提交 · 09a2a8f96e3009273bed1833b3f210e2c68728a5 · bug2833 / cloud-kernel

07 5月, 2013 2 次提交

Btrfs: fix bad extent logging · 09a2a8f9

由 Josef Bacik 提交于 4月 05, 2013

A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery.  I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem.  The
problem is if your application does something like this

[prealloc][prealloc][prealloc]

the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents.  So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space.  If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later.  The data and such will still work,
but everything else is broken.  This patch fixes this by not allowing extents
that are on the modified list to be merged.  This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree.  So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents.  With this patch the testcase I've created no
longer creates a bogus file system after replaying the log.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

09a2a8f9

Btrfs: log ram bytes properly · cc95bef6

由 Josef Bacik 提交于 4月 04, 2013

When logging changed extents I was logging ram_bytes as the current length,
which isn't correct, it's supposed to be the ram bytes of the original extent.
This is for compression where even if we split the extent we need to know the
ram bytes so when we uncompress the extent we know how big it will be. This was
still working out right with compression for some reason but I think we were
getting lucky. It was definitely off for prealloc which is why I noticed it,
btrfsck was complaining about it. With this patch btrfsck no longer complains
after a log replay. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

cc95bef6

25 1月, 2013 1 次提交

Btrfs: do not allow logged extents to be merged or removed · 201a9038

由 Josef Bacik 提交于 1月 24, 2013

We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

201a9038

17 12月, 2012 2 次提交

Btrfs: do not mark ems as prealloc if we are writing to them · b11e234d

由 Josef Bacik 提交于 12月 03, 2012

We are going to use EM's to log extents in the future, so we need to not
mark them as prealloc if they aren't actually prealloc extents. Instead
mark them with FILLING so we know to ammend mod_start/mod_len and that way
we don't confuse the extent logging code. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

b11e234d

Btrfs: keep track of the extents original block length · b4939680

由 Josef Bacik 提交于 12月 03, 2012

If we've written to a prealloc extent we need to know the original block len
for the extent. We can't figure this out currently since ->block_len is
just set to the extent length. So introduce ->orig_block_len so that we
know how many bytes were in the original extent for proper extent logging
that future patches will need. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

b4939680

04 10月, 2012 1 次提交

Btrfs: do not hold the write_lock on the extent tree while logging · ff44c6e3

由 Josef Bacik 提交于 9月 14, 2012

Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This
is because I'm an idiot and didn't realize that rwlock's were spin locks, so
we've been holding this thing while doing allocations and such which is not
good. This patch fixes this by dropping the write lock before we do
anything heavy and re-acquire it when it is done. We also need to take a
ref on the em's in case their corresponding pages are evicted and mark them
as being logged so that releasepage does not remove them and doesn't remove
them from our local list. Thanks,
Reported-by: NDave Sterba <dave@jikos.cz>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

ff44c6e3

02 10月, 2012 2 次提交

Btrfs: improve fsync by filtering extents that we want · 4e2f84e6

由 Liu Bo 提交于 8月 27, 2012

This is based on Josef's "Btrfs: turbo charge fsync".

The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.

However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync

The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:

A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...

So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.

But we can avoid this by telling fsync the real extents that are needed
to be logged.

With this, I did the above dd sync write test (size=50m),

         w/o (orig)   w/ (josef's)   w/ (this)
SATA      104KB/s       109KB/s       121KB/s
ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>

4e2f84e6

Btrfs: turbo charge fsync · 5dc562c5

由 Josef Bacik 提交于 8月 17, 2012

At least for the vm workload.  Currently on fsync we will

1) Truncate all items in the log tree for the given inode if they exist

and

2) Copy all items for a given inode into the log

The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing.  This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them.  We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already.  Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write

		Original	Patched
SATA drive	82KB/s		140KB/s
Fusion drive	431KB/s		2532KB/s

So around 2-6 times faster depending on your hardware.  There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok.  This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get.  All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.

The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok.  I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

5dc562c5

15 2月, 2012 1 次提交

btrfs: fix structs where bitfields and spinlock/atomic share 8B word · c08782da

由 David Sterba 提交于 1月 26, 2012

On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current
gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has
disaterous effect.

https://lkml.org/lkml/2012/2/1/220Signed-off-by: NDavid Sterba <dsterba@suse.cz>

c08782da

02 5月, 2011 2 次提交

D
btrfs: drop gfp parameter from alloc_extent_map · 172ddd60
由 David Sterba 提交于 4月 21, 2011
```
pass GFP_NOFS directly to kmem_cache_alloc
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
172ddd60

btrfs: drop unused parameter from extent_map_tree_init · a8067e02

由 David Sterba 提交于 4月 21, 2011

the GFP flags are not stored anywhere and all allocations are done via
alloc_extent_map(GFP_NOFS).
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

a8067e02

22 12月, 2010 1 次提交

btrfs: Allow to add new compression algorithm · 261507a0

由 Li Zefan 提交于 12月 17, 2010

Make the code aware of compression type, instead of always assuming
zlib compression.

Also make the zlib workspace function as common code for all
compression types.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

261507a0

19 9月, 2009 1 次提交

Btrfs: search for an allocation hint while filling file COW · b917b7c3

由 Chris Mason 提交于 9月 18, 2009

The allocator has some nice knobs for sending hints about where
to try and allocate new blocks, but when we're doing file allocations
we're not sending any hint at all.

This commit adds a simple extent map search to see if we can
quickly and easily find a hint for the allocator.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b917b7c3

12 9月, 2009 2 次提交

Btrfs: Fix extent replacment race · a1ed835e

由 Chris Mason 提交于 9月 11, 2009

Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones.  There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.

Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.

This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a1ed835e

Btrfs: switch extent_map to a rw lock · 890871be

由 Chris Mason 提交于 9月 02, 2009

There are two main users of the extent_map tree.  The
first is regular file inodes, where it is evenly spread
between readers and writers.

The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.

The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

890871be

10 11月, 2008 1 次提交

Btrfs: Fix csum error for compressed data · ff5b7ee3

由 Yan Zheng 提交于 11月 10, 2008

The decompress code doesn't take the logical offset in extent
pointer into account. If the logical offset isn't zero, data
will be decompressed into wrong pages.

The solution used here is to record the starting offset of the extent
in the file separately from the logical start of the extent_map struct.
This allows us to avoid problems inserting overlapping extents.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

ff5b7ee3

31 10月, 2008 2 次提交

Btrfs: Add fallocate support v2 · d899e052

由 Yan Zheng 提交于 10月 30, 2008

This patch updates btrfs-progs for fallocate support.

fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it.  This leverages the
-o nodatacow checks.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

d899e052

Btrfs: update hole handling v2 · 9036c102

由 Yan Zheng 提交于 10月 30, 2008

This patch splits the hole insertion code out of btrfs_setattr
into btrfs_cont_expand and updates btrfs_get_extent to properly
handle the case that file extent items are not continuous.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

9036c102

30 10月, 2008 1 次提交

Btrfs: Add zlib compression support · c8b97818

由 Chris Mason 提交于 10月 29, 2008

This is a large change for adding compression on reading and writing,
both for inline and regular extents.  It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress.  Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler.  This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now.  This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field.  Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k.  This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k.  This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data.  This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread.  This makes it tricky to
spread the compression load across all the cpus on the box.  We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c8b97818

25 9月, 2008 21 次提交

Btrfs: Fix some data=ordered related data corruptions · f421950f

由 Chris Mason 提交于 7月 22, 2008

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree.  The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed.  It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f421950f

Btrfs: Keep extent mappings in ram until pending ordered extents are done · 7f3c74fb

由 Chris Mason 提交于 7月 18, 2008

It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7f3c74fb

Btrfs: Split the extent_map code into two parts · d1310b2e

由 Chris Mason 提交于 1月 24, 2008

There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller.  This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d1310b2e

Btrfs: Implement basic support for -ENOSPC · 1832a6d5

由 Chris Mason 提交于 12月 21, 2007

This is intended to prevent accidentally filling the drive.  A determined
user can still make things oops.

It includes some accounting of the current bytes under delayed allocation,
but this will change as things get optimized
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1832a6d5

Btrfs: section mismatch warnings · 17636e03

由 Christian Hesse 提交于 12月 11, 2007

--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hello everybody,

compiling btrfs into the kernel results in section mismatch warnings. __exit
functions are called where they are not allowed to. The attached patch fixes
this for me. Not sure if it is correct though.
Signed-off-by: NChristian Hesse <mail@earthworm.de>
--
Regards,
Chris

--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/x-diff; charset="iso-8859-1";
	name="btrfs-section_mismatches.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="btrfs-section_mismatches.patch"
Signed-off-by: NChris Mason <chris.mason@oracle.com>

17636e03

C
Btrfs: Add efficient dirty accounting to the extent_map tree · ca664626
由 Chris Mason 提交于 11月 27, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
ca664626
C
Btrfs: Limit btree writeback to prevent seeks · 793955bc
由 Chris Mason 提交于 11月 26, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
793955bc
W
Btrfs: Return value checking in module init · 2f4cbe64
由 Wyatt Banks 提交于 11月 19, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
2f4cbe64
C
Btrfs: Add readpages support · 3ab2fb5a
由 Chris Mason 提交于 11月 08, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
3ab2fb5a
C
Btrfs: Add writepages support · b293f02e
由 Chris Mason 提交于 11月 01, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
b293f02e

Btrfs: Fix a number of inline extent problems that Yan Zheng reported. · 179e29e4

由 Chris Mason 提交于 11月 01, 2007

The fixes do a number of things:

1) Most btrfs_drop_extent callers will try to leave the inline extents in
place.  It can truncate bytes off the beginning of the inline extent if
required.

2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.

3) btrfs_truncate_in_transaction truncates inline extents

4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together
Signed-off-by: NChris Mason <chris.mason@oracle.com>

179e29e4

C
Btrfs: Add back metadata checksumming · 19c00ddc
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
19c00ddc
C
Btrfs: extent_map optimizations to cut down on CPU usage · 810191ff
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
810191ff
C
Btrfs: Add an extent buffer LRU to reduce radix tree hits · 4dc11904
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
4dc11904
C
Btrfs: Add back the online defragging code · 6b80053d
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
6b80053d
C
Btrfs: Use an array of pages in the extent buffers to reduce the cost of find_get_page · 09e71a32
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
09e71a32
C
Btrfs: Allow tree blocks larger than the page size · db94535d
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
db94535d
C
Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map trees · 1a5bc167
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
1a5bc167
C
Btrfs: Stop using radix trees for the block group cache · 96b5179d
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
96b5179d
C
Btrfs: Fix extent_buffer and extent_state leaks · f510cfec
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
f510cfec
C
Btrfs: Avoid memcpy where possible in extent_buffers · 6d36dcd4
由 Chris Mason 提交于 10月 15, 2007
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
6d36dcd4

bug2833 / cloud-kernel 与 Fork 源项目一致

bug2833 / cloud-kernel
与 Fork 源项目一致