提交 · 7f3c74fb831fa19bafe087e817c0a5ff3883f1ea · openanolis / cloud-kernel

25 9月, 2008 40 次提交

Btrfs: Keep extent mappings in ram until pending ordered extents are done · 7f3c74fb

由 Chris Mason 提交于 7月 18, 2008

It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7f3c74fb

C
Btrfs: Don't allow releasepage to succeed if EXTENT_ORDERED is set · 211f90e6
由 Chris Mason 提交于 7月 18, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
211f90e6

Btrfs: Handle data checksumming on bios that span multiple ordered extents · 3edf7d33

由 Chris Mason 提交于 7月 18, 2008

Data checksumming is done right before the bio is sent down the IO stack,
which means a single bio might span more than one ordered extent. In
this case, the checksumming data is split between two ordered extents.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3edf7d33

C
Btrfs: Cleanup and comment ordered-data.c · eb84ae03
由 Chris Mason 提交于 7月 17, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
eb84ae03

Add a per-inode lock around btrfs_drop_extents · ee6e6504

由 Chris Mason 提交于 7月 17, 2008

btrfs_drop_extents is always called with a range lock held on the inode.
But, it may operate on extents outside that range as it drops and splits
them.

This patch adds a per-inode mutex that is held while calling
btrfs_drop_extents and while inserting new extents into the tree.  It
prevents races from two procs working against adjacent ranges in the tree.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ee6e6504

Btrfs: Don't pin pages in ram until the entire ordered extent is on disk. · ba1da2f4

由 Chris Mason 提交于 7月 17, 2008

Checksum items are not inserted until the entire ordered extent is on disk,
but individual pages might be clean and available for reclaim long before
the whole extent is on disk.

In order to allow those pages to be freed, we need to be able to search
the list of ordered extents to find the checksum that is going to be inserted
in the tree.  This way if the page needs to be read back in before
the checksums are in the btree, we'll be able to verify the checksum on
the page.

This commit adds the ability to search the pending ordered extents for
a given offset in the file, and changes btrfs_releasepage to allow
ordered pages to be freed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ba1da2f4

btrfs_start_transaction: wait for commits in progress to finish · f9295749

由 Chris Mason 提交于 7月 17, 2008

btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed.  btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.

There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction.  For them, btrfs_join_transaction
is added.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f9295749

Btrfs: Update on disk i_size only after pending ordered extents are done · dbe674a9

由 Chris Mason 提交于 7月 17, 2008

This changes the ordered data code to update i_size after the extent
is on disk.  An on disk i_size is maintained in the in-memory btrfs inode
structures, and this is updated as extents finish.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

dbe674a9

Btrfs: Use async helpers to deal with pages that have been improperly dirtied · 247e743c

由 Chris Mason 提交于 7月 17, 2008

Higher layers sometimes call set_page_dirty without asking the filesystem
to help. This causes many problems for the data=ordered and cow code.
This commit detects pages that haven't been properly setup for IO and
kicks off an async helper to deal with them.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

247e743c

Btrfs: New data=ordered implementation · e6dcd2dc

由 Chris Mason 提交于 7月 17, 2008

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e6dcd2dc

C
Btrfs: Add a per-inode csum mutex to avoid races creating csum items · 1b1e2135
由 Chris Mason 提交于 6月 25, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
1b1e2135

Add btrfs_end_transaction_throttle to force writers to wait for pending commits · 89ce8a63

由 Chris Mason 提交于 6月 25, 2008

The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added.  This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

89ce8a63

Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks · 594a24eb

由 Chris Mason 提交于 6月 25, 2008

This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

594a24eb

Btrfs: Replace the big fs_mutex with a collection of other locks · a2135011

由 Chris Mason 提交于 6月 25, 2008

Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a2135011

Btrfs: Start btree concurrency work. · 925baedd

由 Chris Mason 提交于 6月 25, 2008

The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

925baedd

Btrfs: split out ioctl.c · f46b5a66

由 Christoph Hellwig 提交于 6月 11, 2008

Split the ioctl handling out of inode.c into a file of it's own.
Also fix up checkpatch.pl warnings for the moved code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f46b5a66

Btrfs: Add async worker threads for pre and post IO checksumming · 8b712842

由 Chris Mason 提交于 6月 11, 2008

Btrfs has been using workqueues to spread the checksumming load across
other CPUs in the system.  But, workqueues only schedule work on the
same CPU that queued the work, giving them a limited benefit for systems with
higher CPU counts.

This code adds a generic facility to schedule work with pools of kthreads,
and changes the bio submission code to queue bios up.  The queueing is
important to make sure large numbers of procs on the system don't
turn streaming workloads into random workloads by sending IO down
concurrently.

The end result of all of this is much higher performance (and CPU usage) when
doing checksumming on large machines.  Two worker pools are created,
one for writes and one for endio processing.  The two could deadlock if
we tried to service both from a single pool.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8b712842

Btrfs: transaction ioctls · 6bf13c0c

由 Sage Weil 提交于 6月 10, 2008

These ioctls let a user application hold a transaction open while it
performs a series of operations.  A final ioctl does a sync on the fs
(closing the current transaction).  This is the main requirement for
Ceph's OSD to be able to keep the data it's storing in a btrfs volume
consistent, and AFAICS it works just fine.  The application would do
something like

	fd = ::open("some/file", O_RDONLY);
	::ioctl(fd, BTRFS_IOC_TRANS_START);
	/* do a bunch of stuff */
	::ioctl(fd, BTRFS_IOC_TRANS_END);
or just
	::close(fd);

And to ensure it commits to disk,

	::ioctl(fd, BTRFS_IOC_SYNC);

When a transaction is held open, the trans_handle is attached to the
struct file (via private_data) so that it will get cleaned up if the
process dies unexpectedly.  A held transaction is also ended on fsync() to
avoid a deadlock.

A misbehaving application could also deliberately hold a transaction open,
effectively locking up the FS, so it may make sense to restrict something
like this to root or something.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6bf13c0c

Btrfs: Invalidate dcache entry after creating snapshot and · 3b96362c

由 Sven Wegener 提交于 6月 09, 2008

We need to invalidate an existing dcache entry after creating a new
snapshot or subvolume, because a negative dache entry will stop us from
accessing the new snapshot or subvolume.

---
  ctree.h       |   23 +++++++++++++++++++++++
  inode.c       |    4 ++++
  transaction.c |    4 ++++
  3 files changed, 31 insertions(+)
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3b96362c

btrfs delete ordered inode handling fix · e1b81e67

由 Mingming 提交于 5月 27, 2008

Use btrfs_release_file instead of a put_inode call
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e1b81e67

Fix corners in writepage and btrfs_truncate_page · 211c17f5

由 Chris Mason 提交于 5月 15, 2008

The extent_io writepage calls needed an extra check for discarding
pages that started on th last byte in the file.

btrfs_truncate_page needed checks to make sure the page was still part
of the file after reading it, and most importantly, needed to wait for
all IO to the page to finish before freeing the corresponding extents on
disk.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

211c17f5

Btrfs: Handle write errors on raid1 and raid10 · 1259ab75

由 Chris Mason 提交于 5月 12, 2008

When duplicate copies exist, writes are allowed to fail to one of those
copies.  This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1259ab75

Btrfs: A number of nodatacow fixes · bbaf549e

由 Chris Mason 提交于 5月 08, 2008

Once part of a delalloc request fails the cow checks, just cow the
entire range

It is possible for the back references to all be from the same root,
but still have snapshots against an extent.  The checks are now more strict,
forcing cow any time there are multiple refs against the data extent.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bbaf549e

Btrfs: Update nodatacow mode to support cloned single files and resizing · a68d5933

由 Chris Mason 提交于 5月 08, 2008

Before, nodatacow only checked to make sure multiple roots didn't have
references on a single extent.  This check makes sure that multiple
inodes don't have references.

nodatacow needed an extra check to see if the block group was currently
readonly.  This way cows forced by the chunk relocation code are honored.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a68d5933

Btrfs: Add support for online device removal · a061fc8d

由 Chris Mason 提交于 5月 07, 2008

This required a few structural changes to the code that manages bdev pointers:

The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev.  This allows us to avoid swapping the super block bdev pointer
around at run time.

The code to read in the super block no longer goes through the extent
buffer interface.  Things got ugly keeping the mapping constant.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a061fc8d

C
Btrfs: Fix clone ioctl to not hold the path over inserts · 5d9cd9ec
由 Chris Mason 提交于 5月 05, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
5d9cd9ec
C
Btrfs: Silence bogus inode.c compiler warnings · b9d86667
由 Chris Mason 提交于 5月 02, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
b9d86667

Btrfs: Clone file data ioctl · f2eb0a24

由 Sage Weil 提交于 5月 02, 2008

Add a new ioctl to clone file data
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f2eb0a24

C
Btrfs: Add balance ioctl to restripe the chunks · ec44a35c
由 Chris Mason 提交于 4月 28, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
ec44a35c
C
Btrfs: Add new ioctl to add devices · 788f20eb
由 Chris Mason 提交于 4月 28, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
788f20eb
C
Btrfs: Do more optimal file RA during shrinking and defrag · 8e7bf94f
由 Chris Mason 提交于 4月 28, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
8e7bf94f
C
Btrfs: Make the resizer work based on shrinking and growing devices · 8f18cf13
由 Chris Mason 提交于 4月 25, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
8f18cf13
C
Btrfs: Throttle file_write when data=ordered is flushing the inode · 81d7ed29
由 Chris Mason 提交于 4月 25, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
81d7ed29
C
Btrfs: Fix the unplug_io_fn to grab a consistent copy of page->mapping · bcbfce8a
由 Chris Mason 提交于 4月 22, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
bcbfce8a

Fix btrfs_get_extent and get_block corner cases, and disable O_DIRECT reads · e1c4b745

由 Chris Mason 提交于 4月 22, 2008

The generic O_DIRECT code assumes all the bios have the same bdev,
which isn't true for multi-device btrfs.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e1c4b745

C
Btrfs: Make an unplug function that doesn't unplug every spindle · f2d8d74d
由 Chris Mason 提交于 4月 21, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
f2d8d74d
C
Btrfs: Remove debugging statements from the invalidatepage calls · 4ef64eae
由 Chris Mason 提交于 4月 21, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
4ef64eae

Force page->private removal in btrfs_invalidatepage · 9ad6b7bc

由 Chris Mason 提交于 4月 18, 2008

btrfs_invalidatepage is not allowed to leave pages around on the lru.
Any such pages will trigger an oops later on because the VM will see
page->private and assume it is a buffer head.

This also forces extra flushes of the async work queues before
dropping all the pages on the btree inode during unmount.  Left over
items on the work queues are one possible cause of busy state ranges
during truncate_inode_pages.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9ad6b7bc

Btrfs: Use the extent map cache to find the logical disk block during data retries · 3b951516

由 Chris Mason 提交于 4月 17, 2008

The data read retry code needs to find the logical disk block before it
can resubmit new bios. But, finding this block isn't allowed to take
the fs_mutex because that will deadlock with a number of different callers.

This changes the retry code to use the extent map cache instead, but
that requires the extent map cache to have the extent we're looking for.
This is a problem because btrfs_drop_extent_cache just drops the entire
extent instead of the little tiny part it is invalidating.

The bulk of the code in this patch changes btrfs_drop_extent_cache to
invalidate only a portion of the extent cache, and changes btrfs_get_extent
to deal with the results.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3b951516

Btrfs: Don't wait on tree block writeback before freeing them anymore · 699122f5

由 Chris Mason 提交于 4月 16, 2008

This isn't required anymore because we don't reallocate blocks that
have already been written in this transaction.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

699122f5

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功