提交 · 5d4f98a28c7d334091c1b7744f48a1acdd2a4ae0 · openanolis / cloud-kernel

10 6月, 2009 2 次提交

Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2

由 Yan Zheng 提交于 6月 10, 2009

This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5d4f98a2

btrfs: Fix set/clear_extent_bit for 'end == (u64)-1' · 5c939df5

由 Yan Zheng 提交于 5月 27, 2009

There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5c939df5

05 6月, 2009 1 次提交

Btrfs: Fix oops and use after free during space balancing · 44fb5511

由 Chris Mason 提交于 6月 04, 2009

The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks.  It starts off with a hint
to help find the best block group for a given allocation.

The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing.  If it is being
freed, the list pointers in it are bogus and can't be trusted.  But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.

The fix used here is to check to make sure the block group we find really
is on the list before we use it.  list_del_init is used when removing
it from the list, so we can do a proper check.

The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster.  If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.

The fix used here is to check the current cluster against the
current allocation flags.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

44fb5511

04 6月, 2009 1 次提交

Btrfs: set device->total_disk_bytes when adding new device · 2cc3c559

由 Yan Zheng 提交于 6月 04, 2009

It was not being properly initialized, and so the size saved to
disk was not correct.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2cc3c559

15 5月, 2009 6 次提交

S
Btrfs: Spelling fix in btrfs_lookup_first_block_group comments · 9f55684c
由 Sankar P 提交于 5月 14, 2009
```
Signed-off-by: NSankar P <sankar.curiosity@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
9f55684c

Btrfs: make show_options result match actual option names · 6b65c5c6

由 Sage Weil 提交于 5月 14, 2009

The notreelog and flushoncommit mount options were being printed slightly
differently.
Signed-off-by: NSage Weil <sage@newdream.net>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6b65c5c6

Btrfs: remove outdated comment in btrfs_ioctl_resize() · 5d847a8e

由 Li Hong 提交于 5月 14, 2009

In Li Zefan's commit dae7b665,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.
Signed-off-by: NLi Hong <lihong.hi@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5d847a8e

Btrfs: remove some WARN_ONs in the IO failure path · cc7b0c9b

由 Chris Mason 提交于 5月 14, 2009

These debugging WARN_ONs make too much console noise during regular
IO failures. An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

cc7b0c9b

Btrfs: Don't loop forever on metadata IO failures · 76a05b35

由 Chris Mason 提交于 5月 14, 2009

When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block.  If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.

The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads.  The end result was looping forever on buffers that were never
going to become up to date.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

76a05b35

Btrfs: init inode ordered_data_close flag properly · 2757495c

由 Chris Mason 提交于 5月 14, 2009

This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits.  It was not being properly set to zero when the inode was
being setup.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2757495c

09 5月, 2009 1 次提交
- A
  Convert obvious places to deactivate_locked_super() · 6f5bbff9
  由 Al Viro 提交于 5月 06, 2009
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  6f5bbff9
28 4月, 2009 2 次提交

Btrfs: look for acls during btrfs_read_locked_inode · 46a53cca

由 Chris Mason 提交于 4月 27, 2009

This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items.
If it is certain a given inode has no acls, it will set the in memory acl
fields to null to avoid acl lookups completely.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

46a53cca

Btrfs: fix acl caching · 7b1a14bb

由 Chris Mason 提交于 4月 27, 2009

Linus noticed the btrfs code to cache acls wasn't properly caching
a NULL acl when the inode didn't have any acls.  This meant the common
case of no acls resulted in expensive btree searches every time the
kernel checked permissions (which is quite often).

This is a modified version of Linus' original patch:

Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode.
This forces an acl lookup when permission checks are done.

Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields
are set to null.

Fix btrfs_get_acl to use the right return value from __btrfs_getxattr
when deciding to cache a NULL acl.  It was storing a NULL acl when
__btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning
-ENODATA for this case.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7b1a14bb

27 4月, 2009 6 次提交

Btrfs: Fix a bunch of printk() warnings. · 21380931

由 Joel Becker 提交于 4月 21, 2009

Just happened to notice a bunch of %llu vs u64 warnings.  Here's a patch
to cast them all.
Signed-off-by: NJoel Becker <joel.becker@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

21380931

Btrfs: Fix a trivial warning using max() of u64 vs ULL. · e63b6a6c

由 Joel Becker 提交于 4月 21, 2009

A small warning popped up on ia64 because inode-map.c was comparing a
u64 object id with the ULL FIRST_FREE_OBJECTID.  My first thought was
that all the OBJECTID constants should contain the u64 cast because
btrfs code deals entirely in u64s.  But then I saw how large that was,
and figured I'd just fix the max() call.
Signed-off-by: NJoel Becker <joel.becker@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e63b6a6c

C
Btrfs: remove unused btrfs_bit_radix slab · 45c06543
由 Chris Mason 提交于 4月 27, 2009
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
45c06543

Btrfs: ratelimit IO error printks · 193f284d

由 Chris Mason 提交于 4月 27, 2009

Btrfs has printks for various IO errors, including bad checksums and
mismatches between what we expect the block headers to contain and what
we actually find on the disk.

Longer term we need a real reporting mechanism for this, but for now
printk is going to have to do.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

193f284d

Btrfs: remove #if 0 code · b7967db7

由 Chris Mason 提交于 4月 27, 2009

Btrfs had some old code sitting around under #if 0, this drops it.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b7967db7

Btrfs: When shrinking, only update disk size on success · d6397bae

由 Chris Ball 提交于 4月 27, 2009

Previously, we updated a device's size prior to attempting a shrink
operation. This patch moves the device resizing logic to only happen if
the shrink completes successfully. In the process, it introduces a new
field to btrfs_device -- disk_total_bytes -- to track the on-disk size.
Signed-off-by: NChris Ball <cjb@laptop.org>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d6397bae

25 4月, 2009 6 次提交

Btrfs: fix deadlocks and stalls on dead root removal · 59bc5c75

由 Chris Mason 提交于 4月 24, 2009

After a transaction commit, the old root of the subvol btrees are sent through
snapshot removal. This is what actually frees up any blocks replaced by
COW, and anything the old blocks pointed to.

Snapshot deletion will pause when a transaction commit has started, which
helps to avoid a huge amount of delayed reference count updates piling up
as the transaction is trying to close.

But, this pause happens after the snapshot deletion process has asked other
procs on the system to throttle back a bit so that it can make progress.

We don't want to throttle everyone while we're waiting for the transaction
commit, it leads to deadlocks in the user transaction ioctls used by Ceph
and makes things slower in general.

This patch changes things to avoid the throttling while we sleep.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

59bc5c75

Btrfs: fix fallocate deadlock on inode extent lock · e980b50c

由 Chris Mason 提交于 4月 24, 2009

The btrfs fallocate call takes an extent lock on the entire range
being fallocated, and then runs through insert_reserved_extent on each
extent as they are allocated.

The problem with this is that btrfs_drop_extents may decide to try
and take the same extent lock fallocate was already holding.  The solution
used here is to push down knowledge of the range that is already locked
going into btrfs_drop_extents.

It turns out that at least one other caller had the same bug.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e980b50c

Btrfs: kill btrfs_cache_create · 9601e3f6

由 Christoph Hellwig 提交于 4月 13, 2009

Just use kmem_cache_create directly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9601e3f6

Btrfs: don't export symbols · 0d4bf11e

由 Christoph Hellwig 提交于 4月 13, 2009

Currently the extent_map code is only for btrfs so don't export it's
symbols.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0d4bf11e

Btrfs: simplify makefile · 2ea2544e

由 Christoph Hellwig 提交于 4月 13, 2009

Get rid of the hacks for building out of tree, and always use += for
assigning to the object lists.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2ea2544e

Btrfs: try to keep a healthy ratio of metadata vs data block groups · 97e728d4

由 Josef Bacik 提交于 4月 21, 2009

This patch makes the chunk allocator keep a good ratio of metadata vs data
block groups. By default for every 8 data block groups, we'll allocate 1
metadata chunk, or about 12% of the disk will be allocated for metadata. This
can be changed by specifying the metadata_ratio mount option.

This is simply the number of data block groups that have to be allocated to
force a metadata chunk allocation. By making sure we allocate metadata chunks
more often, we are less likely to get into situations where the whole disk
has been allocated as data block groups.
Signed-off-by: NJosef Bacik <jbacik@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

97e728d4

22 4月, 2009 1 次提交

Btrfs: fix btrfs fallocate oops and deadlock · 546888da

由 Chris Mason 提交于 4月 21, 2009

Btrfs fallocate was incorrectly starting a transaction with a lock held
on the extent_io tree for the file, which could deadlock. Strictly
speaking it was using join_transaction which would be safe, but it is better
to move the transaction outside of the lock.

When preallocated extents are overwritten, btrfs_mark_buffer_dirty was
being called on an unlocked buffer. This was triggering an assertion and
oops because the lock is supposed to be held.

The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had
been run. btrfs_del_item takes care of dirtying things, so the solution is a
to skip the btrfs_mark_buffer_dirty call in this case.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

546888da

21 4月, 2009 5 次提交

btrfs: use memdup_user() · dae7b665

由 Li Zefan 提交于 4月 08, 2009

Remove open-coded memdup_user().

Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may
cause pagefault, it's pointless to pass GFP_NOFS to kmalloc().
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

dae7b665

Btrfs: use the right node in reada_for_balance · 8c594ea8

由 Chris Mason 提交于 4月 20, 2009

reada_for_balance was using the wrong index into the path node array,
so it wasn't reading the right blocks.  We never directly used the
results of the read done by this function because the btree search is
started over at the end.

This fixes reada_for_balance to reada in the correct node and to
avoid searching past the last slot in the node.  It also makes sure to
hold the parent lock while we are finding the nodes to read.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8c594ea8

Btrfs: fix oops on page->mapping->host during writepage · 11c8349b

由 Chris Mason 提交于 4月 20, 2009

The extent_io writepage call updates the writepage index in the inode
as it makes progress.  But, it was doing the update after unlocking the page,
which isn't legal because page->mapping can't be trusted once the page
is unlocked.

This lead to an oops, especially common with compression turned on.  The
fix here is to update the writeback index before unlocking the page.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

11c8349b

Btrfs: add a priority queue to the async thread helpers · d313d7a3

由 Chris Mason 提交于 4月 20, 2009

Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
higher priority.  But, the checksumming helper threads prevent it
from being fully effective.

There are two problems.  First, a big queue of pending checksumming
will delay the synchronous IO behind other lower priority writes.  Second,
the checksumming uses an ordered async work queue.  The ordering makes sure
that IOs are sent to the block layer in the same order they are sent
to the checksumming threads.  Usually this gives us less seeky IO.

But, when we start mixing IO priorities, the lower priority IO can delay
the higher priority IO.

This patch solves both problems by adding a high priority list to the async
helper threads, and a new btrfs_set_work_high_prio(), which is used
to make put a new async work item onto the higher priority list.

The ordering is still done on high priority IO, but all of the high
priority bios are ordered separately from the low priority bios.  This
ordering is purely an IO optimization, it is not involved in data
or metadata integrity.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d313d7a3

Btrfs: use WRITE_SYNC for synchronous writes · ffbd517d

由 Chris Mason 提交于 4月 20, 2009

Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
writes we plan on waiting on in the near future.  This patch
mirrors recent changes in other filesystems and the generic code to
use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
other latency critical writes.

Btrfs uses async worker threads for checksumming before the write is done,
and then again to actually submit the bios.  The bio submission code just
runs a per-device list of bios that need to be sent down the pipe.

This list is split into low priority and high priority lists so the
WRITE_SYNC IO happens first.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ffbd517d

03 4月, 2009 9 次提交

S
Btrfs: BUG to BUG_ON changes · c293498b
由 Stoyan Gaydarov 提交于 4月 02, 2009
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
c293498b

Btrfs: remove dead code · 3e7ad38d

由 Dan Carpenter 提交于 4月 02, 2009

Remove an unneeded return statement and conditional
Signed-off-by: NDan Carpenter <error27@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3e7ad38d

Btrfs: remove dead code · ff0a5836

由 Dan Carpenter 提交于 4月 02, 2009

merge is always NULL at this point.
Signed-off-by: NDan Carpenter <error27@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ff0a5836

W
Btrfs: fix typos in comments · d4a78947
由 Wu Fengguang 提交于 4月 02, 2009
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
d4a78947

Btrfs: remove unused ftrace include · 2e966ed2

由 Jim Owens 提交于 4月 02, 2009

Signed-off-by: Njim owens <jowens@hp.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2e966ed2

Btrfs: fix __ucmpdi2 compile bug on 32 bit builds · 93dbfad7

由 Heiko Carstens 提交于 4月 03, 2009

We get this on 32 builds:

fs/built-in.o: In function `extent_fiemap':
(.text+0x1019f2): undefined reference to `__ucmpdi2'

Happens because of a switch statement with a 64 bit argument.
Convert this to an if statement to fix this.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

93dbfad7

Btrfs: free inode struct when btrfs_new_inode fails · 09771430

由 Shen Feng 提交于 4月 02, 2009

btrfs_new_inode doesn't call iput to free the inode
when it fails.
Signed-off-by: NShen Feng <shen@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

09771430

Btrfs: fix race in worker_loop · b5555f77

由 Amit Gud 提交于 4月 02, 2009

Need to check kthread_should_stop after schedule_timeout() before calling
schedule(). This causes threads to sleep with potentially no one to wake them
up causing mount(2) to hang in btrfs_stop_workers waiting for threads to stop.
Signed-off-by: NAmit Gud <gud@ksu.edu>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b5555f77

Btrfs: add flushoncommit mount option · dccae999

由 Sage Weil 提交于 4月 02, 2009

The 'flushoncommit' mount option forces any data dirtied by a write in a
prior transaction to commit as part of the current commit.  This makes
the committed state a fully consistent view of the file system from the
application's perspective (i.e., it includes all completed file system
operations).  This was previously the behavior only when a snapshot is
created.

This is used by Ceph to ensure that completed writes make it to the
platter along with the metadata operations they are bound to (by
BTRFS_IOC_TRANS_{START,END}).
Signed-off-by: NSage Weil <sage@newdream.net>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

dccae999

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功