提交 · cfc8ea87201dc9bb6aeb3fc80c61abee83e7cc06 · bug2833 / cloud-kernel

09 12月, 2008 1 次提交

Btrfs: superblock duplication · a512bbf8

由 Yan Zheng 提交于 12月 08, 2008

This patch implements superblock duplication. Superblocks
are stored at offset 16K, 64M and 256G on every devices.
Spaces used by superblocks are preserved by the allocator,
which uses a reverse mapping function to find the logical
addresses that correspond to superblocks. Thank you,
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

a512bbf8

02 12月, 2008 1 次提交

Btrfs: remove unneeded total_trans · 6e3ad887

由 Sage Weil 提交于 12月 02, 2008

Remove unneeded debugging sanity check.  It gets corrupted anyway when
multiple btrfs file systems are mounted, throwing bad warnings along the
way.
Signed-off-by: NSage Weil <sage@newdream.net>

6e3ad887

19 11月, 2008 1 次提交

Btrfs: switch back to wait_on_page_writeback to wait on metadata writes · 105d931d

由 Chris Mason 提交于 11月 18, 2008

The extent based waiting was using more CPU, and other fixes have helped
with the unplug storm problems.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

105d931d

18 11月, 2008 3 次提交

Btrfs: Add backrefs and forward refs for subvols and snapshots · 0660b5af

由 Chris Mason 提交于 11月 17, 2008

Subvols and snapshots can now be referenced from any point in the directory
tree.  We need to maintain back refs for them so we can find lost
subvols.

Forward refs are added so that we know all of the subvols and
snapshots referenced anywhere in the directory tree of a single subvol.  This
can be used to do recursive snapshotting (but they aren't yet) and it is
also used to detect and prevent directory loops when creating new snapshots.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0660b5af

Btrfs: Give each subvol and snapshot their own anonymous devid · 3394e160

由 Chris Mason 提交于 11月 17, 2008

Each subvolume has its own private inode number space, and so we need
to fill in different device numbers for each subvolume to avoid confusing
applications.

This commit puts a struct super_block into struct btrfs_root so it can
call set_anon_super() and get a different device number generated for
each root.

btrfs_rename is changed to prevent renames across subvols.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3394e160

Btrfs: Allow subvolumes and snapshots anywhere in the directory tree · 3de4586c

由 Chris Mason 提交于 11月 17, 2008

Before, all snapshots and subvolumes lived in a single flat directory.  This
was awkward and confusing because the single flat directory was only writable
with the ioctls.

This commit changes the ioctls to create subvols and snapshots at any
point in the directory tree.  This requires making separate ioctls for
snapshot and subvol creation instead of a combining them into one.

The subvol ioctl does:

btrfsctl -S subvol_name parent_dir

After the ioctl is done subvol_name lives inside parent_dir.

The snapshot ioctl does:

btrfsctl -s path_for_snapshot root_to_snapshot

path_for_snapshot can be an absolute or relative path.  btrfsctl breaks it up
into directory and basename components.

root_to_snapshot can be any file or directory in the FS.  The snapshot
is taken of the entire root where that file lives.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3de4586c

08 11月, 2008 1 次提交

Btrfs: Avoid unplug storms during commit · 5f2cc086

由 Chris Mason 提交于 11月 07, 2008

While doing a commit, btrfs makes sure all the metadata blocks
were properly written to disk, calling wait_on_page_writeback for
each page.  This writeback happens after allowing another transaction
to start, so it competes for the disk with other processes in the FS.

If the page writeback bit is still set, each wait_on_page_writeback might
trigger an unplug, even though the page might be waiting for checksumming
to finish or might be waiting for the async work queue to submit the
bio.

This trades wait_on_page_writeback for waiting on the extent writeback
bits.  It won't trigger any unplugs and substantially improves performance
in a number of workloads.

This also changes the async bio submission to avoid requeueing if there
is only one device.  The requeue just wastes CPU time because there are
no other devices to service.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5f2cc086

31 10月, 2008 1 次提交

Btrfs: update nodatacow code v2 · 80ff3856

由 Yan Zheng 提交于 10月 30, 2008

This patch simplifies the nodatacow checker. If all references
were created after the latest snapshot, then we can avoid COW
safely. This patch also updates run_delalloc_nocow to do more
fine-grained checking.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

80ff3856

30 10月, 2008 4 次提交

Btrfs: prevent looping forever in finish_current_insert and del_pending_extents · 87ef2bb4

由 Chris Mason 提交于 10月 30, 2008

finish_current_insert and del_pending_extents process extent tree modifications
that build up while we are changing the extent tree. It is a confusing
bit of code that prevents recursion.

Both functions run through a list of pending operations and both funcs
add to the list of pending operations. If you have two procs in either
one of them, they can end up looping forever making more work for each other.

This patch makes them walk forward through the list of pending changes instead
of always trying to process the entire list. At transaction commit
time, we catch any changes that were left over.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

87ef2bb4

Btrfs: Add root tree pointer transaction ids · 84234f3a

由 Yan Zheng 提交于 10月 29, 2008

This patch adds transaction IDs to root tree pointers.
Transaction IDs in tree pointers are compared with the
generation numbers in block headers when reading root
blocks of trees. This can detect some types of IO errors.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

84234f3a

Btrfs: nuke fs wide allocation mutex V2 · 25179201

由 Josef Bacik 提交于 10月 29, 2008

This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.

There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.

The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.

Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.

I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: NJosef Bacik <jbacik@redhat.com>

25179201

Btrfs: Improve space balancing code · f82d02d9

由 Yan Zheng 提交于 10月 29, 2008

This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:

Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.

Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.

This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

f82d02d9

04 10月, 2008 1 次提交

Btrfs: remove last_log_alloc allocator optimization · 30c43e24

由 Chris Mason 提交于 10月 03, 2008

The tree logging code was trying to separate tree log allocations
from normal metadata allocations to improve writeback patterns during
an fsync.

But, the code was not effective and ended up just mixing tree log
blocks with regular metadata.  That seems to be working fairly well,
so the last_log_alloc code can be removed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

30c43e24

30 9月, 2008 1 次提交

Btrfs: add and improve comments · d352ac68

由 Chris Mason 提交于 9月 29, 2008

This improves the comments at the top of many functions.  It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.

extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d352ac68

26 9月, 2008 3 次提交

Btrfs: update space balancing code · 1a40e23b

由 Zheng Yan 提交于 9月 26, 2008

This patch updates the space balancing code to utilize the new
backref format.  Before, btrfs-vol -b would break any COW links
on data blocks or metadata.  This was slow and caused the amount
of space used to explode if a large number of snapshots were present.

The new code can keeps the sharing of all data extents and
most of the tree blocks.

To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.

To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).

To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1a40e23b

Btrfs: extent_map and data=ordered fixes for space balancing · 5b21f2ed

由 Zheng Yan 提交于 9月 26, 2008

* Add an EXTENT_BOUNDARY state bit to keep the writepage code
from merging data extents that are in the process of being
relocated.  This allows us to do accounting for them properly.

* The balancing code relocates data extents indepdent of the underlying
inode.  The extent_map code was modified to properly account for
things moving around (invalidating extent_map caches in the inode).

* Don't take the drop_mutex in the create_subvol ioctl.  It isn't
required.

* Fix walking of the ordered extent list to avoid races with sys_unlink

* Change the lock ordering rules.  Transaction start goes outside
the drop_mutex.  This allows btrfs_commit_transaction to directly
drop the relocation trees.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5b21f2ed

Btrfs: Add shared reference cache · e4657689

由 Zheng Yan 提交于 9月 26, 2008

Btrfs has a cache of reference counts in leaves, allowing it to
avoid reading tree leaves while deleting snapshots.  To reduce
contention with multiple subvolumes, this cache is private to each
subvolume.

This patch adds shared reference cache support. The new space
balancing code plays with multiple subvols at the same time, So
the old per-subvol reference cache is not well suited.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e4657689

25 9月, 2008 23 次提交

Btrfs: Record dirty pages tree-log pages in an extent_io tree · d0c803c4

由 Chris Mason 提交于 9月 11, 2008

This is the same way the transaction code makes sure that all the
other tree blocks are safely on disk.  There's an extent_io tree
for each root, and any blocks allocated to the tree logs are
recorded in that tree.

At tree-log sync, the extent_io tree is walked to flush down the
dirty pages and wait for them.

The main benefit is less time spent walking the tree log and skipping
clean pages, and getting sequential IO down to the drive.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d0c803c4

Btrfs: Tree logging fixes · 4bef0848

由 Chris Mason 提交于 9月 08, 2008

* Pin down data blocks to prevent them from being reallocated like so:

trans 1: allocate file extent
trans 2: free file extent
trans 3: free file extent during old snapshot deletion
trans 3: allocate file extent to new file
trans 3: fsync new file

Before the tree logging code, this was legal because the fsync
would commit the transation that did the final data extent free
and the transaction that allocated the extent to the new file
at the same time.

With the tree logging code, the tree log subtransaction can commit
before the transaction that freed the extent.  If we crash,
we're left with two different files using the extent.

* Don't wait in start_transaction if log replay is going on.  This
avoids deadlocks from iput while we're cleaning up link counts in the
replay code.

* Don't deadlock in replay_one_name by trying to read an inode off
the disk while holding paths for the directory

* Hold the buffer lock while we mark a buffer as written.  This
closes a race where someone is changing a buffer while we write it.
They are supposed to mark it dirty again after they change it, but
this violates the cow rules.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4bef0848

Btrfs: Add a write ahead tree log to optimize synchronous operations · e02119d5

由 Chris Mason 提交于 9月 05, 2008

File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e02119d5

Btrfs: Wait for async bio submissions to make some progress at queue time · b64a2851

由 Chris Mason 提交于 8月 20, 2008

Before, the btrfs bdi congestion function was used to test for too many
async bios.  This keeps that check to throttle pdflush, but also
adds a check while queuing bios.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b64a2851

Btrfs: Transaction commit: don't use filemap_fdatawait · 777e6bd7

由 Chris Mason 提交于 8月 15, 2008

After writing out all the remaining btree blocks in the transaction,
the commit code would use filemap_fdatawait to make sure it was all
on disk.  This means it would wait for blocks written by other procs
as well.

The new code walks the list of blocks for this transaction again
and waits only for those required by this transaction.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

777e6bd7

Y
Btrfs: Fix nodatacow for the new data=ordered mode · 7ea394f1
由 Yan Zheng 提交于 8月 05, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
7ea394f1

Btrfs: Various small fixes. · b48652c1

由 Yan Zheng 提交于 8月 04, 2008

This trivial patch contains two locking fixes and a off by one fix.

---
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b48652c1

Btrfs: fix ioctl-initiated transactions vs wait_current_trans() · 9ca9ee09

由 Sage Weil 提交于 8月 04, 2008

Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in
progress) breaks the transaction start/stop ioctls by making
btrfs_start_transaction conditionally wait for the next transaction to
start.  If an application artificially is holding a transaction open,
things deadlock.

This workaround maintains a count of open ioctl-initiated transactions in
fs_info, and avoids wait_current_trans() if any are currently open (in
start_transaction() and btrfs_throttle()).  The start transaction ioctl
uses a new btrfs_start_ioctl_transaction() that _does_ call
wait_current_trans(), effectively pushing the join/wait decision to the
outer ioctl-initiated transaction.

This more or less neuters btrfs_throttle() when ioctl-initiated
transactions are in use, but that seems like a pretty fundamental
consequence of wrapping lots of write()'s in a transaction.  Btrfs has no
way to tell if the application considers a given operation as part of it's
transaction.

Obviously, if the transaction start/stop ioctls aren't being used, there
is no effect on current behavior.
Signed-off-by: NSage Weil <sage@newdream.net>
---
 ctree.h       |    1 +
 ioctl.c       |   12 +++++++++++-
 transaction.c |   18 +++++++++++++-----
 transaction.h |    2 ++
 4 files changed, 27 insertions(+), 6 deletions(-)
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9ca9ee09

Btrfs: More throttle tuning · 2dd3e67b

由 Chris Mason 提交于 8月 04, 2008

* Make walk_down_tree wake up throttled tasks more often
* Make walk_down_tree call cond_resched during long loops
* As the size of the ref cache grows, wait longer in throttle
* Get rid of the reada code in walk_down_tree, the leaves don't get
  read anymore, thanks to the ref cache.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2dd3e67b

btrfs_search_slot: reduce lock contention by cowing in two stages · 65b51a00

由 Chris Mason 提交于 8月 01, 2008

A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.

The first part needs locks in the extent allocation tree, and may need to
do IO. This changeset splits that into a separate function that can be
called without any tree locks held.

btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block. This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

65b51a00

C
Btrfs: Throttle less often waiting for snapshots to delete · 18e35e0a
由 Chris Mason 提交于 8月 01, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
18e35e0a

Btrfs: Throttle tuning · 37d1aeee

由 Chris Mason 提交于 7月 31, 2008

This avoids waiting for transactions with pages locked by breaking out
the code to wait for the current transaction to close into a function
called by btrfs_throttle.

It also lowers the limits for where we start throttling.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

37d1aeee

Btrfs: implement memory reclaim for leaf reference cache · bcc63abb

由 Yan 提交于 7月 30, 2008

The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bcc63abb

Btrfs: Update and fix mount -o nodatacow · f321e491

由 Yan Zheng 提交于 7月 30, 2008

To check whether a given file extent is referenced by multiple snapshots, the
checker walks down the fs tree through dead root and checks all tree blocks in
the path.

We can easily detect whether a given tree block is directly referenced by other
snapshot. We can also detect any indirect reference from other snapshot by
checking reference's generation. The checker can always detect multiple
references, but can't reliably detect cases of single reference. So btrfs may
do file data cow even there is only one reference.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f321e491

Btrfs: Throttle operations if the reference cache gets too large · ab78c84d

由 Chris Mason 提交于 7月 29, 2008

A large reference cache is directly related to a lot of work pending
for the cleaner thread.  This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.

Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ab78c84d

Btrfs: Leaf reference cache update · 017e5369

由 Chris Mason 提交于 7月 28, 2008

This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.

This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.

Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

017e5369

Btrfs: Add a leaf reference cache · 31153d81

由 Yan Zheng 提交于 7月 28, 2008

Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

31153d81

J
Btrfs: Implement new dir index format · aec7477b
由 Josef Bacik 提交于 7月 24, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
aec7477b
C
Btrfs: Take the csum mutex while reading checksums · ed98b56a
由 Chris Mason 提交于 7月 22, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
ed98b56a

Btrfs: Fix some data=ordered related data corruptions · f421950f

由 Chris Mason 提交于 7月 22, 2008

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree.  The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed.  It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f421950f

btrfs_start_transaction: wait for commits in progress to finish · f9295749

由 Chris Mason 提交于 7月 17, 2008

btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed.  btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.

There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction.  For them, btrfs_join_transaction
is added.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f9295749

Btrfs: New data=ordered implementation · e6dcd2dc

由 Chris Mason 提交于 7月 17, 2008

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e6dcd2dc

C
Btrfs: Drop some verbose printks · 77a41afb
由 Chris Mason 提交于 7月 08, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
77a41afb

bug2833 / cloud-kernel 与 Fork 源项目一致

bug2833 / cloud-kernel
与 Fork 源项目一致