提交 · e4404d6e8da678d852b7f767f665f8edf76c9e9f · bug2833 / cloud-kernel

10 12月, 2008 1 次提交

Btrfs: Delete csum items when freeing extents · 459931ec

由 Chris Mason 提交于 12月 10, 2008

This finishes off the new checksumming code by removing csum items
for extents that are no longer in use.

The trick is doing it without racing because a single csum item may
hold csums for more than one extent.  Extra checks are added to
btrfs_csum_file_blocks to make sure that we are using the correct
csum item after dropping locks.

A new btrfs_split_item is added to split a single csum item so it
can be split without dropping the leaf lock.  This is used to
remove csum bytes from the middle of an item.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

459931ec

09 12月, 2008 1 次提交

Btrfs: move data checksumming into a dedicated tree · d20f7043

由 Chris Mason 提交于 12月 08, 2008

Btrfs stores checksums for each data block.  Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block.  This means that when we read the inode,
we've probably read in at least some checksums as well.

But, this has a few problems:

* The checksums are indexed by logical offset in the file.  When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data.  It would be faster if we could checksum
the compressed data instead.

* If we implement encryption, we'll be checksumming the plain text and
storing that on disk.  This is significantly less secure.

* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct.  This makes the raid
layer balancing and extent moving much more expensive.

* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.

* There is potentitally one copy of the checksum in each subvolume
referencing an extent.

The solution used here is to store the extent checksums in a dedicated
tree.  This allows us to index the checksums by phyiscal extent
start and length.  It means:

* The checksum is against the data stored on disk, after any compression
or encryption is done.

* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.

This makes compression significantly faster by reducing the amount of
data that needs to be checksummed.  It will also allow much faster
raid management code in general.

The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent.  This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d20f7043

02 12月, 2008 1 次提交

Btrfs: add support for multiple csum algorithms · 607d432d

由 Josef Bacik 提交于 12月 02, 2008

This patch gives us the space we will need in order to have different csum
algorithims at some point in the future. We save the csum algorithim type
in the superblock, and use those instead of define's.
Signed-off-by: NJosef Bacik <jbacik@redhat.com>

607d432d

11 11月, 2008 1 次提交

Btrfs: Use invalidatepage when writepage finds a page outside of i_size · 39be25cd

由 Chris Mason 提交于 11月 10, 2008

With all the recent fixes to the delalloc locking, it is now safe
again to use invalidatepage inside the writepage code for
pages outside of i_size. This used to deadlock against some of the
code to write locked ranges of pages, but all of that has been fixed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

39be25cd

30 10月, 2008 1 次提交

Btrfs: Add zlib compression support · c8b97818

由 Chris Mason 提交于 10月 29, 2008

This is a large change for adding compression on reading and writing,
both for inline and regular extents.  It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress.  Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler.  This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now.  This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field.  Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k.  This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k.  This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data.  This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread.  This makes it tricky to
spread the compression load across all the cpus on the box.  We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c8b97818

25 9月, 2008 25 次提交

C
Btrfs: Fix variable init during csum creation · 639cb586
由 Chris Mason 提交于 8月 28, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
639cb586

Btrfs: Lookup readpage checksums on bio submission again · 4d1b5fb4

由 Chris Mason 提交于 8月 20, 2008

This optimization had been removed because I thought it was triggering
csum errors.  The real cause of the errors was elsewhere, and so
this optimization is back.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4d1b5fb4

Btrfs: Lower contention on the csum mutex · 53863232

由 Chris Mason 提交于 8月 15, 2008

This takes the csum mutex deeper in the call chain and releases it
more often.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

53863232

C
btrfs_lookup_bio_sums seems broken, go back to the readpage_io_hook for now · 3de9d6b6
由 Chris Mason 提交于 8月 04, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
3de9d6b6
C
Btrfs: Hold csum mutex while reading in sums during readpages · 6dab8157
由 Chris Mason 提交于 8月 04, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
6dab8157

Btrfs: Fix streaming read performance with checksumming on · 61b49440

由 Chris Mason 提交于 7月 31, 2008

Large streaming reads make for large bios, which means each entry on the
list async work queues represents a large amount of data. IO
congestion throttling on the device was kicking in before the async
worker threads decided a single thread was busy and needed some help.

The end result was that a streaming read would result in a single CPU
running at 100% instead of balancing the work off to other CPUs.

This patch also changes the pre-IO checksum lookup done by reads to
work on a per-bio basis instead of a per-page. This results in many
extra btree lookups on large streaming reads. Doing the checksum lookup
right before bio submit allows us to reuse searches while processing
adjacent offsets.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

61b49440

Btrfs: implement memory reclaim for leaf reference cache · bcc63abb

由 Yan 提交于 7月 30, 2008

The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bcc63abb

C
Btrfs: Take the csum mutex while reading checksums · ed98b56a
由 Chris Mason 提交于 7月 22, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
ed98b56a
C
Fix btrfs_wait_ordered_extent_range to properly wait · e5a2217e
由 Chris Mason 提交于 7月 18, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
e5a2217e

Btrfs: Keep extent mappings in ram until pending ordered extents are done · 7f3c74fb

由 Chris Mason 提交于 7月 18, 2008

It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7f3c74fb

Btrfs: Handle data checksumming on bios that span multiple ordered extents · 3edf7d33

由 Chris Mason 提交于 7月 18, 2008

Data checksumming is done right before the bio is sent down the IO stack,
which means a single bio might span more than one ordered extent. In
this case, the checksumming data is split between two ordered extents.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3edf7d33

Btrfs: New data=ordered implementation · e6dcd2dc

由 Chris Mason 提交于 7月 17, 2008

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e6dcd2dc

Btrfs: Clone file data ioctl · f2eb0a24

由 Sage Weil 提交于 5月 02, 2008

Add a new ioctl to clone file data
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f2eb0a24

Btrfs: Write bio checksumming outside the FS mutex · e015640f

由 Chris Mason 提交于 4月 16, 2008

This significantly improves streaming write performance by allowing
concurrency in the data checksumming.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e015640f

C
Btrfs: Use KM_USERN instead of KM_IRQ during data summing · eb20978f
由 Chris Mason 提交于 2月 21, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
eb20978f
C
Btrfs: Make sure bio pages are adjacent during bulk csumming · 2e1a992e
由 Chris Mason 提交于 2月 20, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
2e1a992e
C
Btrfs: While doing checksums on bios, cache the extent_buffer mapping · 6e92f5e6
由 Chris Mason 提交于 2月 20, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
6e92f5e6

Btrfs: checksum file data at bio submission time instead of during writepage · 065631f6

由 Chris Mason 提交于 2月 20, 2008

When we checkum file data during writepage, the checksumming is done one
page at a time, making it difficult to do bulk metadata modifications
to insert checksums for large ranges of the file at once.

This patch changes btrfs to checksum on a per-bio basis instead. The
bios are checksummed before they are handed off to the block layer, so
each bio is contiguous and only has pages from the same inode.

Checksumming on a bio basis allows us to insert and modify the file
checksum items in large groups. It also allows the checksumming to
be done more easily by async worker threads.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

065631f6

C
Btrfs: Add some extra debugging around file data checksum failures · aadfeb6e
由 Chris Mason 提交于 1月 29, 2008
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
aadfeb6e

Btrfs: Fix a number of inline extent problems that Yan Zheng reported. · 179e29e4

由 Chris Mason 提交于 11月 01, 2007

The fixes do a number of things:

1) Most btrfs_drop_extent callers will try to leave the inline extents in
place.  It can truncate bytes off the beginning of the inline extent if
required.

2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.

3) btrfs_truncate_in_transaction truncates inline extents

4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together
Signed-off-by: NChris Mason <chris.mason@oracle.com>

179e29e4

Minor fix for btrfs_csum_file_block. · b56baf5b

由 Yan 提交于 10月 29, 2007

Execution should goto label 'insert' when 'btrfs_next_leaf' return a
non-zero value, otherwise the parameter 'slot' for
'btrfs_item_key_to_cpu' may be out of bounds. The original codes jump
to  label 'insert' only when 'btrfs_next_leaf' return a negative
value.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b56baf5b

Btrfs: Optimize csum insertion to create larger items when possible · f578d4bd

由 Chris Mason 提交于 10月 25, 2007

This reduces the number of calls to btrfs_extend_item and greatly lowers
the cpu usage while writing large files.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f578d4bd

C
Btrfs: Add back file data checksumming · ff79f819
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
ff79f819
C
Btrfs: Allow tree blocks larger than the page size · db94535d
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
db94535d
C
Btrfs: Create extent_buffer interface for large blocksizes · 5f39d397
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
5f39d397

11 7月, 2007 1 次提交

Btrfs: trivial include fixups · ec6b910f

由 Zach Brown 提交于 7月 11, 2007

Almost none of the files including module.h need to do so,
remove them.

Include sched.h in extent-tree.c to silence a warning about cond_resched()
being undeclared.
Signed-off-by: NZach Brown <zach.brown@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ec6b910f

23 6月, 2007 1 次提交
- C
  Btrfs: Audit callers and return codes to make sure -ENOSPC gets up the stack · 54aa1f4d
  由 Chris Mason 提交于 6月 22, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  54aa1f4d
18 6月, 2007 1 次提交
- C
  Subject: Rework btrfs_file_write to only allocate while page locks are held · 8c2383c3
  由 Chris Mason 提交于 6月 18, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  8c2383c3
16 6月, 2007 1 次提交
- C
  Btrfs: patch queue: page_mkwrite · 9ebefb18
  由 Chris Mason 提交于 6月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  9ebefb18
14 6月, 2007 1 次提交

btrfs: Code cleanup · f1ace244

由 Aneesh 提交于 6月 13, 2007

Attaching below is some of the code cleanups that i came across while
reading the code.

a) alloc_path already calls init_path.
b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
the in memory copy ext4_inode as the disk copy
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f1ace244

12 6月, 2007 3 次提交
- C
  Btrfs: add GPLv2 · 6cbd5570
  由 Chris Mason 提交于 6月 12, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  6cbd5570
- C
  Btrfs: printk fixes · 5af3981c
  由 Chris Mason 提交于 6月 12, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  5af3981c
- C
  Btrfs: 64 bit div fixes · 84f54cfa
  由 Chris Mason 提交于 6月 12, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  84f54cfa
30 5月, 2007 1 次提交
- C
  Btrfs: fixup various fsx failures · 1de037a4
  由 Chris Mason 提交于 5月 29, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  1de037a4
25 5月, 2007 1 次提交
- C
  Btrfs: sparse files! · 3a686375
  由 Chris Mason 提交于 5月 24, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
  3a686375

bug2833 / cloud-kernel 与 Fork 源项目一致

bug2833 / cloud-kernel
与 Fork 源项目一致