提交 · 49eb7e46d47ea72a9bd2a5f8cedb04f5159cc277 · openeuler / raspberrypi-kernel

25 9月, 2008 40 次提交

Btrfs: Dir fsync optimizations · 49eb7e46

由 Chris Mason 提交于 9月 11, 2008

Drop i_mutex during the commit

Don't bother doing the fsync at all unless the dir is marked as dirtied
and needing fsync in this transaction.  For directories, this means
that someone has unlinked a file from the dir without fsyncing the
file.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

49eb7e46

Btrfs: Add a write ahead tree log to optimize synchronous operations · e02119d5

由 Chris Mason 提交于 9月 05, 2008

File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e02119d5

C
Btrfs: Add debugging checks to track down corrupted metadata · a1b32a59
由 Chris Mason 提交于 9月 05, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
a1b32a59
C
Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them · ea8c2819
由 Chris Mason 提交于 8月 04, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
ea8c2819

Btrfs: Improve and cleanup locking done by walk_down_tree · f87f057b

由 Chris Mason 提交于 8月 01, 2008

While dropping snapshots, walk_down_tree does most of the work of checking
reference counts and limiting tree traversal to just the blocks that
we are freeing.

It dropped and held the allocation mutex in strange and confusing ways,
this commit changes it to only hold the mutex while actually freeing a block.

The rest of the checks around reference counts should be safe without the lock
because we only allow one process in btrfs_drop_snapshot at a time. Other
processes dropping reference counts should not drop it to 1 because
their tree roots already have an extra ref on the block.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f87f057b

C
Btrfs: Drop some debugging around the extent_map pinned flag · 3ce7e67a
由 Chris Mason 提交于 7月 31, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
3ce7e67a

Btrfs: Throttle tuning · 37d1aeee

由 Chris Mason 提交于 7月 31, 2008

This avoids waiting for transactions with pages locked by breaking out
the code to wait for the current transaction to close into a function
called by btrfs_throttle.

It also lowers the limits for where we start throttling.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

37d1aeee

Btrfs: Add compatibility for kernels >= 2.6.27-rc1 · 0ee0fda0

由 Sven Wegener 提交于 7月 30, 2008

Add a couple of #if's to follow API changes.
Signed-off-by: NSven Wegener <sven.wegener@stealer.net>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0ee0fda0

Btrfs: implement memory reclaim for leaf reference cache · bcc63abb

由 Yan 提交于 7月 30, 2008

The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bcc63abb

Btrfs: Throttle operations if the reference cache gets too large · ab78c84d

由 Chris Mason 提交于 7月 29, 2008

A large reference cache is directly related to a lot of work pending
for the cleaner thread.  This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.

Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ab78c84d

Btrfs: Leaf reference cache update · 017e5369

由 Chris Mason 提交于 7月 28, 2008

This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.

This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.

Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

017e5369

Btrfs: Fix some data=ordered related data corruptions · f421950f

由 Chris Mason 提交于 7月 22, 2008

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree.  The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed.  It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f421950f

Btrfs: Data ordered fixes · 4a096752

由 Chris Mason 提交于 7月 21, 2008

* In btrfs_delete_inode, wait for ordered extents after calling
truncate_inode_pages.  This is much faster, and more correct

* Properly clear our the PageChecked bit everywhere we redirty the page.

* Change the writepage fixup handler to lock the page range and check to
see if an ordered extent had been inserted since the improperly dirtied
page was discovered

* Wait for ordered extents outside the transaction.  This isn't required
for locking rules but does improve transaction latencies

* Reduce contention on the alloc_mutex by dropping it while incrementing
refs on a node/leaf and while dropping refs on a leaf.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4a096752

Btrfs: Keep extent mappings in ram until pending ordered extents are done · 7f3c74fb

由 Chris Mason 提交于 7月 18, 2008

It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7f3c74fb

Add a per-inode lock around btrfs_drop_extents · ee6e6504

由 Chris Mason 提交于 7月 17, 2008

btrfs_drop_extents is always called with a range lock held on the inode.
But, it may operate on extents outside that range as it drops and splits
them.

This patch adds a per-inode mutex that is held while calling
btrfs_drop_extents and while inserting new extents into the tree.  It
prevents races from two procs working against adjacent ranges in the tree.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ee6e6504

Btrfs: Don't pin pages in ram until the entire ordered extent is on disk. · ba1da2f4

由 Chris Mason 提交于 7月 17, 2008

Checksum items are not inserted until the entire ordered extent is on disk,
but individual pages might be clean and available for reclaim long before
the whole extent is on disk.

In order to allow those pages to be freed, we need to be able to search
the list of ordered extents to find the checksum that is going to be inserted
in the tree.  This way if the page needs to be read back in before
the checksums are in the btree, we'll be able to verify the checksum on
the page.

This commit adds the ability to search the pending ordered extents for
a given offset in the file, and changes btrfs_releasepage to allow
ordered pages to be freed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ba1da2f4

btrfs_start_transaction: wait for commits in progress to finish · f9295749

由 Chris Mason 提交于 7月 17, 2008

btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed.  btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.

There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction.  For them, btrfs_join_transaction
is added.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f9295749

Btrfs: Update on disk i_size only after pending ordered extents are done · dbe674a9

由 Chris Mason 提交于 7月 17, 2008

This changes the ordered data code to update i_size after the extent
is on disk.  An on disk i_size is maintained in the in-memory btrfs inode
structures, and this is updated as extents finish.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

dbe674a9

Btrfs: Use async helpers to deal with pages that have been improperly dirtied · 247e743c

由 Chris Mason 提交于 7月 17, 2008

Higher layers sometimes call set_page_dirty without asking the filesystem
to help. This causes many problems for the data=ordered and cow code.
This commit detects pages that haven't been properly setup for IO and
kicks off an async helper to deal with them.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

247e743c

Btrfs: New data=ordered implementation · e6dcd2dc

由 Chris Mason 提交于 7月 17, 2008

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e6dcd2dc

C
Btrfs: Add a per-inode csum mutex to avoid races creating csum items · 1b1e2135
由 Chris Mason 提交于 6月 25, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
1b1e2135

Add btrfs_end_transaction_throttle to force writers to wait for pending commits · 89ce8a63

由 Chris Mason 提交于 6月 25, 2008

The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added.  This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

89ce8a63

Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks · 594a24eb

由 Chris Mason 提交于 6月 25, 2008

This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

594a24eb

Btrfs: Replace the big fs_mutex with a collection of other locks · a2135011

由 Chris Mason 提交于 6月 25, 2008

Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a2135011

Btrfs: transaction ioctls · 6bf13c0c

由 Sage Weil 提交于 6月 10, 2008

These ioctls let a user application hold a transaction open while it
performs a series of operations.  A final ioctl does a sync on the fs
(closing the current transaction).  This is the main requirement for
Ceph's OSD to be able to keep the data it's storing in a btrfs volume
consistent, and AFAICS it works just fine.  The application would do
something like

	fd = ::open("some/file", O_RDONLY);
	::ioctl(fd, BTRFS_IOC_TRANS_START);
	/* do a bunch of stuff */
	::ioctl(fd, BTRFS_IOC_TRANS_END);
or just
	::close(fd);

And to ensure it commits to disk,

	::ioctl(fd, BTRFS_IOC_SYNC);

When a transaction is held open, the trans_handle is attached to the
struct file (via private_data) so that it will get cleaned up if the
process dies unexpectedly.  A held transaction is also ended on fsync() to
avoid a deadlock.

A misbehaving application could also deliberately hold a transaction open,
effectively locking up the FS, so it may make sense to restrict something
like this to root or something.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6bf13c0c

btrfs delete ordered inode handling fix · e1b81e67

由 Mingming 提交于 5月 27, 2008

Use btrfs_release_file instead of a put_inode call
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e1b81e67

Fix corners in writepage and btrfs_truncate_page · 211c17f5

由 Chris Mason 提交于 5月 15, 2008

The extent_io writepage calls needed an extra check for discarding
pages that started on th last byte in the file.

btrfs_truncate_page needed checks to make sure the page was still part
of the file after reading it, and most importantly, needed to wait for
all IO to the page to finish before freeing the corresponding extents on
disk.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

211c17f5

Btrfs: Add workaround for AppArmor changing remove_suid() · 12fa8ec6

由 Jeff Mahoney 提交于 5月 02, 2008

In openSUSE 10.3, AppArmor modifies remove_suid to take a struct path
rather than just a dentry. This patch tests that the kernel is openSUSE
10.3 or newer and adjusts the call accordingly.

Debian/Ubuntu with AppArmor applied will also need a similar patch.
Maintainers of btrfs under those distributions should build on this
patch or, alternatively, alter their package descriptions to add
-DREMOVE_SUID_PATH to the compiler command line.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
- --- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/compat.h	2008-02-06 16:46:13.000000000 -0500
@@ -0,0 +1,15 @@
+#ifndef _COMPAT_H_
+#define _COMPAT_H_
+
+
+/*
+ * Even if AppArmor isn't enabled, it still has different prototypes.
+ * Add more distro/version pairs here to declare which has AppArmor applied.
+ */
+#if defined(CONFIG_SUSE_KERNEL)
+# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,22)
+# define REMOVE_SUID_PATH 1
+# endif
+#endif
+
+#endif /* _COMPAT_H_ */
- --- a/file.c	2008-02-06 11:37:39.000000000 -0500
+++ b/file.c	2008-02-06 16:46:23.000000000 -0500
@@ -37,6 +37,7 @@
 #include "ordered-data.h"
 #include "ioctl.h"
 #include "print-tree.h"
+#include "compat.h"

 static int btrfs_copy_from_user(loff_t pos, int num_pages, int write_bytes,
@@ -790,7 +791,11 @@ static ssize_t btrfs_file_write(struct f
 		goto out_nolock;
 	if (count == 0)
 		goto out_nolock;
+#ifdef REMOVE_SUID_PATH
+	err = remove_suid(&file->f_path);
+#else
 	err = remove_suid(fdentry(file));
+#endif
 	if (err)
 		goto out_nolock;
 	file_update_time(file);
Signed-off-by: NChris Mason <chris.mason@oracle.com>

12fa8ec6

C
Btrfs: Fix do_sync_file_range ifdefs (2.6.22) · bb8885cc
由 Chris Mason 提交于 5月 02, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
bb8885cc

Btrfs: Clone file data ioctl · f2eb0a24

由 Sage Weil 提交于 5月 02, 2008

Add a new ioctl to clone file data
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f2eb0a24

C
Btrfs: Throttle file_write when data=ordered is flushing the inode · 81d7ed29
由 Chris Mason 提交于 4月 25, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
81d7ed29
C
Btrfs: Set nodatasum on the inode when written by a nodatasum mount · 409c6118
由 Chris Mason 提交于 4月 22, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
409c6118

Btrfs: Use the extent map cache to find the logical disk block during data retries · 3b951516

由 Chris Mason 提交于 4月 17, 2008

The data read retry code needs to find the logical disk block before it
can resubmit new bios. But, finding this block isn't allowed to take
the fs_mutex because that will deadlock with a number of different callers.

This changes the retry code to use the extent map cache instead, but
that requires the extent map cache to have the extent we're looking for.
This is a problem because btrfs_drop_extent_cache just drops the entire
extent instead of the little tiny part it is invalidating.

The bulk of the code in this patch changes btrfs_drop_extent_cache to
invalidate only a portion of the extent cache, and changes btrfs_get_extent
to deal with the results.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3b951516

Btrfs: A few updates for 2.6.18 and versions older than 2.6.25 · b248a415

由 Chris Mason 提交于 4月 14, 2008

This includes fixing a missing spinlock init call that caused oops on mount
for most kernels other than 2.6.25.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b248a415

Btrfs: Add O_DIRECT read and write (writes == buffered + cache flush) · 16432985

由 Chris Mason 提交于 4月 10, 2008

This adds basic O_DIRECT read and write support.  In the write case, we
just do a normal buffered write followed by a cache flush.  O_DIRECT +
O_SYNC are required to trigger metadata syncs.

In the read case, there is a basic btrfs_get_block call for use by
the generic O_DIRECT code.  This does honor multi-volume mapping rules
but it skips all checksumming.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

16432985

C
Btrfs: Properly cast before shifting · 0740c82b
由 Chris Mason 提交于 2月 19, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
0740c82b
C
Btrfs: Take the extent lock before dropping the delalloc bits · d99cb30a
由 Chris Mason 提交于 2月 19, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
d99cb30a

Btrfs: Properly clear dirty and delalloc extent bits while preparing the file for write · 0762704b

由 Chris Mason 提交于 2月 19, 2008

Yan Zheng noticed that we don't clear the extent state tree dirty and delalloc
bits when we clear the dirty bits on the page during file write.

This leads to csum errors later on.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0762704b

Btrfs: Fix "no csum found for inode" issue. · 39b5637f

由 Yan 提交于 2月 15, 2008

A few codes were not properly updated for changes of extent map.  This
may be the causes of "no csum found for inode" issue.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

39b5637f

Btrfs: Fix i_blocks accounting · 9069218d

由 Chris Mason 提交于 2月 08, 2008

Now that delayed allocation accounting works, i_blocks accounting is changed
to only modify i_blocks when extents inserted or removed.

The fillattr call is changed to include the delayed allocation byte count
in the i_blocks result.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9069218d