提交 · c2c6ca417e2db7a519e6e92c82f4a933d940d076 · openeuler / Kernel

15 3月, 2010 2 次提交

Btrfs: cache ordered extent when completing io · 5a1a3df1

由 Josef Bacik 提交于 2月 02, 2010

When finishing io we run btrfs_dec_test_ordered_pending, and then immediately
run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that
already, so we're searching twice when we don't have to. This patch lets us
pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do
complete io on that ordered extent we can just use the one we found then instead
of having to do another btrfs_lookup_ordered_extent. This made my fio job with
the other patch go from 24 mb/s to 29 mb/s.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5a1a3df1

Btrfs: change the ordered tree to use a spinlock instead of a mutex · 49958fd7

由 Josef Bacik 提交于 2月 02, 2010

The ordered tree used to need a mutex, but currently all we use it for is to
protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock
instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have
less latency, 3445.138 ms instead of 3820.633 ms.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

49958fd7

09 3月, 2010 1 次提交

Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULL · 6bef4d31

由 Eric Paris 提交于 2月 23, 2010

btrfs inialize rb trees in quite a number of places by settin rb_node =
NULL;  The problem with this is that 17d9ddc7 in the
linux-next tree adds a new field to that struct which needs to be NULL for
the new rbtree library code to work properly.  This patch uses RB_ROOT as
the intializer so all of the relevant fields will be NULL'd.  Without the
patch I get a panic.
Signed-off-by: NEric Paris <eparis@redhat.com>
Acked-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6bef4d31

18 12月, 2009 2 次提交

Btrfs: Add delayed iput · 24bbcf04

由 Yan, Zheng 提交于 11月 12, 2009

iput() can trigger new transactions if we are dropping the
final reference, so calling it in btrfs_commit_transaction
may end up deadlock. This patch adds delayed iput to avoid
the issue.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

24bbcf04

Btrfs: Fix disk_i_size update corner case · c2167754

由 Yan, Zheng 提交于 11月 12, 2009

There are some cases file extents are inserted without involving
ordered struct. In these cases, we update disk_i_size directly,
without checking pending ordered extent and DELALLOC bit. This
patch extends btrfs_ordered_update_i_size() to handle these cases.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c2167754

02 10月, 2009 1 次提交

Btrfs: remove duplicates of filemap_ helpers · 8aa38c31

由 Christoph Hellwig 提交于 10月 01, 2009

Use filemap_fdatawrite_range and filemap_fdatawait_range instead of
local copies of the functions.  For filemap_fdatawait_range that
also means replacing the awkward old wait_on_page_writeback_range
calling convention with the regular filemap byte offsets.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8aa38c31

12 9月, 2009 1 次提交

Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b

由 Chris Mason 提交于 9月 02, 2009

Btrfs writes go through delalloc to the data=ordered code.  This
makes sure that all of the data is on disk before the metadata
that references it.  The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.

This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.

One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.

These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.

This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.

As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.

At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8b62b72b

01 4月, 2009 1 次提交

Btrfs: add extra flushing for renames and truncates · 5a3f23d5

由 Chris Mason 提交于 3月 31, 2009

Renames and truncates are both common ways to replace old data with new
data. The filesystem can make an effort to make sure the new data is
on disk before actually replacing the old data.

This is especially important for rename, which many application use as
though it were atomic for both the data and the metadata involved. The
current btrfs code will happily replace a file that is fully on disk
with one that was just created and still has pending IO.

If we crash after transaction commit but before the IO is done, we'll end
up replacing a good file with a zero length file. The solution used
here is to create a list of inodes that need special ordering and force
them to disk before the commit is done. This is similar to the
ext3 style data=ordering, except it is only done on selected files.

Btrfs is able to get away with this because it does not wait on commits
very often, even for fsync (which use a sub-commit).

For renames, we order the file when it wasn't already
on disk and when it is replacing an existing file. Larger files
are sent to filemap_flush right away (before the transaction handle is
opened).

For truncates, we order if the file goes from non-zero size down to
zero size. This is a little different, because at the time of the
truncate the file has no dirty bytes to order. But, we flag the inode
so that it is added to the ordered list on close (via release method). We
also immediately add it to the ordered list of the current transaction
so that we can try to flush down any writes the application sneaks in
before commit.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5a3f23d5

09 12月, 2008 1 次提交

Btrfs: move data checksumming into a dedicated tree · d20f7043

由 Chris Mason 提交于 12月 08, 2008

Btrfs stores checksums for each data block.  Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block.  This means that when we read the inode,
we've probably read in at least some checksums as well.

But, this has a few problems:

* The checksums are indexed by logical offset in the file.  When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data.  It would be faster if we could checksum
the compressed data instead.

* If we implement encryption, we'll be checksumming the plain text and
storing that on disk.  This is significantly less secure.

* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct.  This makes the raid
layer balancing and extent moving much more expensive.

* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.

* There is potentitally one copy of the checksum in each subvolume
referencing an extent.

The solution used here is to store the extent checksums in a dedicated
tree.  This allows us to index the checksums by phyiscal extent
start and length.  It means:

* The checksum is against the data stored on disk, after any compression
or encryption is done.

* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.

This makes compression significantly faster by reducing the amount of
data that needs to be checksummed.  It will also allow much faster
raid management code in general.

The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent.  This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d20f7043

31 10月, 2008 2 次提交

Btrfs: Add fallocate support v2 · d899e052

由 Yan Zheng 提交于 10月 30, 2008

This patch updates btrfs-progs for fallocate support.

fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it.  This leverages the
-o nodatacow checks.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

d899e052

Btrfs: update nodatacow code v2 · 80ff3856

由 Yan Zheng 提交于 10月 30, 2008

This patch simplifies the nodatacow checker. If all references
were created after the latest snapshot, then we can avoid COW
safely. This patch also updates run_delalloc_nocow to do more
fine-grained checking.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

80ff3856

30 10月, 2008 1 次提交

Btrfs: Add zlib compression support · c8b97818

由 Chris Mason 提交于 10月 29, 2008

This is a large change for adding compression on reading and writing,
both for inline and regular extents.  It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress.  Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler.  This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now.  This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field.  Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k.  This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k.  This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data.  This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread.  This makes it tricky to
spread the compression load across all the cpus on the box.  We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c8b97818

04 10月, 2008 1 次提交

Btrfs: O_DIRECT writes via buffered writes + invaldiate · cb843a6f

由 Chris Mason 提交于 10月 03, 2008

This reworks the btrfs O_DIRECT write code a bit. It had always fallen
back to buffered IO and done an invalidate, but needed to be updated
for the data=ordered code. The invalidate wasn't actually removing pages
because they were still inside an ordered extent.

This also combines the O_DIRECT/O_SYNC paths where possible, and kicks
off IO in the main btrfs_file_write loop to keep the pipe down the the
disk full as we process long writes.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

cb843a6f

25 9月, 2008 16 次提交

Y
Btrfs: Fix nodatacow for the new data=ordered mode · 7ea394f1
由 Yan Zheng 提交于 8月 05, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
7ea394f1

Btrfs: Fix the defragmention code and the block relocation code for data=ordered · 3eaa2885

由 Chris Mason 提交于 7月 24, 2008

Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.

Also, the relocation code needs to wait for ordered IO before scanning
the block group again.  This is because the extents are not removed
until the IO for the new extents is finished
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3eaa2885

C
Btrfs: Fix 32 bit compiles by using an unsigned long byte count in the ordered extent · 9ba4611a
由 Chris Mason 提交于 7月 23, 2008
```
The ordered extents have to fit in memory, so an unsigned long is sufficient.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
9ba4611a
C
Btrfs: Take the csum mutex while reading checksums · ed98b56a
由 Chris Mason 提交于 7月 22, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
ed98b56a

Btrfs: Fix some data=ordered related data corruptions · f421950f

由 Chris Mason 提交于 7月 22, 2008

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree.  The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed.  It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f421950f

Btrfs: Handle data checksumming on bios that span multiple ordered extents · 3edf7d33

由 Chris Mason 提交于 7月 18, 2008

Data checksumming is done right before the bio is sent down the IO stack,
which means a single bio might span more than one ordered extent. In
this case, the checksumming data is split between two ordered extents.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3edf7d33

C
Btrfs: Cleanup and comment ordered-data.c · eb84ae03
由 Chris Mason 提交于 7月 17, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
eb84ae03

Btrfs: Don't pin pages in ram until the entire ordered extent is on disk. · ba1da2f4

由 Chris Mason 提交于 7月 17, 2008

Checksum items are not inserted until the entire ordered extent is on disk,
but individual pages might be clean and available for reclaim long before
the whole extent is on disk.

In order to allow those pages to be freed, we need to be able to search
the list of ordered extents to find the checksum that is going to be inserted
in the tree.  This way if the page needs to be read back in before
the checksums are in the btree, we'll be able to verify the checksum on
the page.

This commit adds the ability to search the pending ordered extents for
a given offset in the file, and changes btrfs_releasepage to allow
ordered pages to be freed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ba1da2f4

Btrfs: Update on disk i_size only after pending ordered extents are done · dbe674a9

由 Chris Mason 提交于 7月 17, 2008

This changes the ordered data code to update i_size after the extent
is on disk.  An on disk i_size is maintained in the in-memory btrfs inode
structures, and this is updated as extents finish.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

dbe674a9

Btrfs: New data=ordered implementation · e6dcd2dc

由 Chris Mason 提交于 7月 17, 2008

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e6dcd2dc

Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks · 594a24eb

由 Chris Mason 提交于 6月 25, 2008

This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

594a24eb

btrfs delete ordered inode handling fix · e1b81e67

由 Mingming 提交于 5月 27, 2008

Use btrfs_release_file instead of a put_inode call
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e1b81e67

C
Btrfs: Throttle file_write when data=ordered is flushing the inode · 81d7ed29
由 Chris Mason 提交于 4月 25, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
81d7ed29

Btrfs: Fix data=ordered vs wait_on_inode deadlock on older kernels · 4d5e74bc

由 Chris Mason 提交于 1月 16, 2008

Using ilookup5 during data=ordered writeback could deadlock on I_LOCK.  This
saves a pointer to the inode instead.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4d5e74bc

C
Rework btrfs_drop_inode to avoid scheduling · cee36a03
由 Chris Mason 提交于 1月 15, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
cee36a03

Btrfs: Add data=ordered support · dc17ff8f

由 Chris Mason 提交于 1月 08, 2008

This forces file data extents down the disk along with the metadata that
references them. The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

dc17ff8f

28 8月, 2007 1 次提交
- C
  Btrfs: Extent based page cache code. This uses an rbtree of extents and tests · a52d9a80
  由 Chris Mason 提交于 8月 27, 2007
```
instead of buffer heads.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  a52d9a80
11 8月, 2007 1 次提交
- J
  Btrfs: delay commits during fsync to allow more writers · 15ee9bc7
  由 Josef Bacik 提交于 8月 10, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  15ee9bc7
14 6月, 2007 1 次提交

btrfs: Code cleanup · f1ace244

由 Aneesh 提交于 6月 13, 2007

Attaching below is some of the code cleanups that i came across while
reading the code.

a) alloc_path already calls init_path.
b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
the in memory copy ext4_inode as the disk copy
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f1ace244

12 6月, 2007 1 次提交
- C
  Btrfs: add GPLv2 · 6cbd5570
  由 Chris Mason 提交于 6月 12, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  6cbd5570
01 5月, 2007 1 次提交
- C
  Btrfs: allocator improvements, inode block groups · 31f3c99b
  由 Chris Mason 提交于 4月 30, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  31f3c99b
11 4月, 2007 1 次提交
- C
  Btrfs: drop the inode map tree · 1b05da2e
  由 Chris Mason 提交于 4月 10, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  1b05da2e
07 4月, 2007 1 次提交
- C
  Btrfs: start of support for many FS volumes · d6e4a428
  由 Chris Mason 提交于 4月 06, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  d6e4a428
02 4月, 2007 1 次提交
- C
  Btrfs: still corruption hunting · 2c90e5d6
  由 Chris Mason 提交于 4月 02, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  2c90e5d6

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功