提交 · bb721703aa551e98dc5c7fb259cf90343408baf2 · bug2833 / cloud-kernel

02 2月, 2013 8 次提交

Btrfs: reduce CPU contention while waiting for delayed extent operations · bb721703

由 Chris Mason 提交于 1月 29, 2013

We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.

It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.

The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.

This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself.  All the extra contention just slows
down the operations and doesn't get things done any faster.

This commit changes things to use a wait queue instead.  As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

bb721703

Btrfs: reduce lock contention on extent buffer locks · 242e18c7

由 Chris Mason 提交于 1月 29, 2013

The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.

These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

242e18c7

Btrfs: fix cluster alignment for mount -o ssd · 8de972b4

由 Chris Mason 提交于 1月 04, 2013

With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

8de972b4

Btrfs: add a plugging callback to raid56 writes · 6ac0f488

由 Chris Mason 提交于 1月 31, 2013

Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.

This adds a plugging callback to gather those smaller writes full stripe
writes.

Example on flash:

fio job to do 64K writes in batches of 3 (which makes a full stripe):

With plugging: 450MB/s
Without plugging: 220MB/s
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

6ac0f488

Btrfs: Add a stripe cache to raid56 · 4ae10b3a

由 Chris Mason 提交于 1月 31, 2013

The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk.  Pages are cached when:

* They are read in during a read/modify/write cycle

* They are written during a read/modify/write cycle

* They are involved in a parity rebuild

Pages are not cached if we're doing a full stripe write.  We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.

This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.

The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.

Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct

Without the stripe cache  -- 2.1MB/s
With the stripe cache 21MB/s
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

4ae10b3a

Btrfs: RAID5 and RAID6 · 53b381b3

由 David Woodhouse 提交于 1月 29, 2013

This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio.  This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle.  This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk.  This will be tunable
in a later commit.
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

53b381b3

Btrfs: add rw argument to merge_bio_hook() · 64a16701

由 David Woodhouse 提交于 7月 15, 2009

We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

64a16701

btrfs: don't try to notify udev about missing devices · 3c911608

由 Eric Sandeen 提交于 1月 31, 2013

If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

3c911608

19 12月, 2012 2 次提交

Revert "Btrfs: reorder tree mod log operations in deleting a pointer" · 57ba86c0

由 Chris Mason 提交于 12月 18, 2012

This reverts commit 6a7a665d.

This was bug was fixed differently in 3.6, so this commit
isn't needed.

Conflicts:
	fs/btrfs/ctree.c
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

57ba86c0

Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems" · 4c3e6969

由 Chris Mason 提交于 12月 18, 2012

This reverts commit 95c80bb1.

The bug addressed by this commit was fixed differently back in 3.6
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

4c3e6969

18 12月, 2012 2 次提交

Btrfs: fix a bug of per-file nocow · 213490b3

由 Liu Bo 提交于 9月 11, 2012

Users report a bug, the reproducer is:
$ mkfs.btrfs /dev/loop0
$ mount /dev/loop0 /mnt/btrfs/
$ mkdir /mnt/btrfs/dir
$ chattr +C /mnt/btrfs/dir/
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
$ lsattr /mnt/btrfs/dir/foo
---------------C- /mnt/btrfs/dir/foo
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 1 extent found    ---> an extent
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 3 extents found   ---> with nocow, btrfs breaks the extent into three parts

The new created file should not only inherit the NODATACOW flag, but also
honor NODATASUM flag, because we must do COW on a file extent with checksum.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

213490b3

Btrfs: fix hash overflow handling · 9c52057c

由 Chris Mason 提交于 12月 17, 2012

The handling for directory crc hash overflows was fairly obscure,
split_leaf returns EOVERFLOW when we try to extend the item and that is
supposed to bubble up to userland.  For a while it did so, but along the
way we added better handling of errors and forced the FS readonly if we
hit IO errors during the directory insertion.

Along the way, we started testing only for EEXIST and the EOVERFLOW case
was dropped.  The end result is that we may force the FS readonly if we
catch a directory hash bucket overflow.

This fixes a few problem spots.  First I add tests for EOVERFLOW in the
places where we can safely just return the error up the chain.

btrfs_rename is harder though, because it tries to insert the new
directory item only after it has already unlinked anything the rename
was going to overwrite.  Rather than adding very complex logic, I added
a helper to test for the hash overflow case early while it is still safe
to bail out.

Snapshot and subvolume creation had a similar problem, so they are using
the new helper now too.
Signed-off-by: NChris Mason <chris.mason@fusionio.com>
Reported-by: NPascal Junod <pascal@junod.info>

9c52057c

17 12月, 2012 28 次提交

Btrfs: don't take inode delalloc mutex if we're a free space inode · c64c2bd8

由 Josef Bacik 提交于 12月 14, 2012

This confuses and angers lockdep even though it's ok.  We don't really need
the lock for free space inodes since only the transaction committer will be
reserving space.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

c64c2bd8

Btrfs: fix autodefrag and umount lockup · 1135d6df

由 Josef Bacik 提交于 12月 14, 2012

This happens because writeback_inodes_sb_nr_if_idle does down_read.  This
doesn't work for us and it has not been fixed upstream yet, so do it
ourselves and use that instead so we can stop having this stupid long
standing lockup.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

1135d6df

Btrfs: fix permissions of empty files not affected by umask · 9185aa58

由 Filipe Brandenburger 提交于 11月 30, 2012

When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)

This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.
Signed-off-by: NFilipe Brandenburger <filbranden@google.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

9185aa58

Btrfs: put raid properties into global table · 31e50229

由 Liu Bo 提交于 11月 21, 2012

Raid properties can be shared among raid calculation code, we can put
them into a global table to keep it simple.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

31e50229

Btrfs: fix BUG() in scrub when first superblock reading gives EIO · 4ded4f63

由 Stefan Behrens 提交于 11月 14, 2012

This fixes a very special case that can be reproduced by just
disconnecting a disk at runtime, and without unmounting the
filesystem first, start scrub on the filesystem with the
disconnected disk. All read and write EIOs are handled
correctly, only the first superblock is an exception and gives
a BUG() in a subfunction. The BUG() is correct, it would crash
later otherwise. The subfunction must not be called for
superblocks and this is what the fix changes.
Reported-by: NJoeri Vanthienen <mail@joerivanthienen.be>
Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

4ded4f63

Btrfs: do not call file_update_time in aio_write · 6c760c07

由 Josef Bacik 提交于 11月 09, 2012

This starts a transaction and dirties the inode everytime we call it, which
is super expensive if you have a write heavy workload. We will be updating
the inode when the IO completes and we reserve the space for the inode
update when we reserve space for the write, so there is no chance of loss of
information or enospc issues. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

6c760c07

Btrfs: only unlock and relock if we have to · 5124e00e

由 Josef Bacik 提交于 11月 07, 2012

I noticed while doing fsync tests that we were always dropping the path and
re-searching when we first cow the log root even though we've already gotten
the write lock on the root. That's because we don't take into account that
there might not be a parent node, so fix the check to make sure there is
actually a parent node before we undo all of this work for nothing. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

5124e00e

Btrfs: use tokens where we can in the tree log · 0b1c6cca

由 Josef Bacik 提交于 10月 23, 2012

If we are syncing over and over the overhead of doing all those maps in
fill_inode_item and log_changed_extents really starts to hurt, so use map
tokens so we can avoid all the extra mapping. Since the token maps from our
offset to the end of the page make sure to set the first thing in the item
first so we really only do one map. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

0b1c6cca

Btrfs: optimize leaf_space_used · 41be1f3b

由 Josef Bacik 提交于 10月 15, 2012

This gets called at least 4 times for every level while adding an object,
and it involves 3 kmapping calls, which on my box take about 5us a piece.
So instead use a token, which brings us down to 1 kmap call and makes this
function take 1/3 of the time per call. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

41be1f3b

Btrfs: don't memset new tokens · ad914559

由 Josef Bacik 提交于 10月 15, 2012

Our token logic depends on token->kaddr being set, and if it is not it sets
everything properly as needed.  So instead of memsetting just set
token->kaddr to NULL.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

ad914559

Btrfs: only clear dirty on the buffer if it is marked as dirty · ed7b63eb

由 Josef Bacik 提交于 10月 15, 2012

No reason to set the path blocking or loop through all of the pages if the
extent buffer isn't actually marked dirty.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

ed7b63eb

Btrfs: move checks in set_page_dirty under DEBUG · bb146eb2

由 Josef Bacik 提交于 10月 15, 2012

This is a high traffic function, let's try and do as little as possible
during normal operations shall we?
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

bb146eb2

Btrfs: log changed inodes based on the extent map tree · 70c8a91c

由 Josef Bacik 提交于 10月 11, 2012

We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree. So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

70c8a91c

Btrfs: add path->really_keep_locks · d6393786

由 Josef Bacik 提交于 12月 12, 2012

You'd think path->keep_locks would keep all the locks wouldn't you? You'd
be wrong. It only keeps them if the slot is pointing to the last item in
the node. This is for use with btrfs_next_leaf, which needs this sort of
thing. But the horrible horrible things I'm going to do to the tree log
means I really need everything held from root to leaf so I can add and
delete items in the same search. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

d6393786

Btrfs: do not mark ems as prealloc if we are writing to them · b11e234d

由 Josef Bacik 提交于 12月 03, 2012

We are going to use EM's to log extents in the future, so we need to not
mark them as prealloc if they aren't actually prealloc extents. Instead
mark them with FILLING so we know to ammend mod_start/mod_len and that way
we don't confuse the extent logging code. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

b11e234d

Btrfs: keep track of the extents original block length · b4939680

由 Josef Bacik 提交于 12月 03, 2012

If we've written to a prealloc extent we need to know the original block len
for the extent. We can't figure this out currently since ->block_len is
just set to the extent length. So introduce ->orig_block_len so that we
know how many bytes were in the original extent for proper extent logging
that future patches will need. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

b4939680

Btrfs: inline csums if we're fsyncing · b812ce28

由 Josef Bacik 提交于 11月 16, 2012

The tree logging stuff needs the csums to be on the ordered extents in order
to log them properly, so mark that we're sync and inline the csum creation
so we don't have to wait on the csumming to be done when logging extents
that are still in flight.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

b812ce28

Btrfs: don't bother copying if we're only logging the inode · a95249b3

由 Josef Bacik 提交于 10月 11, 2012

We don't copy inode items anwyay, we just copy them straight into the log
from the in memory inode. So if we know we're only logging the inode, don't
bother dropping anything, just try to insert it and either if it succeeds or
we get EEXIST we can update the inode item in the log and carry on. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

a95249b3

Btrfs: only log the inode item if we can get away with it · e9976151

由 Josef Bacik 提交于 10月 11, 2012

Currently we copy all the file information into the log, inode item, the
refs, xattrs etc. Except most of this doesn't change from fsync to fsync,
just the inode item changes. So set a flag if an xattr changes or a link is
added, and otherwise only log the inode item. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

e9976151

Btrfs: rename root_times_lock to root_item_lock · 5f3ab90a

由 Anand Jain 提交于 12月 07, 2012

Originally root_times_lock was introduced as part of send/receive
code however newly developed patch to label the subvol reused
the same lock, so renaming it for a meaningful name.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

5f3ab90a

btrfs: Notify udev when removing device · b8b8ff59

由 Lukas Czerner 提交于 12月 06, 2012

Currently udev does not know about the device being removed from the
file system. This may result in the situation where we're unable to
mount the file system by UUID or by LABEL because the by-uuid and
by-label links may still point to the device which is no longer part of
the btrfs file system and hence does not have any btrfs super block.

It can be easily reproduced by the following:

mkfs.btrfs -L bugfs /dev/loop[0-6]
mount /dev/loop0 /mnt/test
btrfs device delete /dev/loop0 /mnt/test
umount /mnt/test

mount LABEL=bugfs /mnt/test <---- this fails

then see:

ls -l /dev/disk/by-label/bugfs

which will still point to the /dev/loop0

We did not noticed this before because libblkid would send the udev
event for us when it notice that the link does not fit the reality,
however it does not do that anymore and completely relies on udev
information.

Fix this by sending the KOBJ_CHANGE event to the bdev kobject after
successful device removal.

Note that this does not affect device addition, because we will open the
device prior the addition from userspace and udev will notice that and
reread the device afterwards.
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

b8b8ff59

Btrfs: fix wrong return value of btrfs_truncate_page() · ac6a2b36

由 Miao Xie 提交于 12月 05, 2012

ret variant may be set to 0 if we read page successfully, but it might be
released before we lock it again. On this case, if we fail to allocate a
new page, we will return 0, it is wrong, fix it.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

ac6a2b36

Btrfs: punch hole past the end of the file · 7426cc04

由 Miao Xie 提交于 12月 05, 2012

Since we can pre-allocate the space past EOF, we should be able to reclaim
that space if we need. This patch implements it by removing the EOF check.

Though the manual of fallocate command says we can use truncate command to
reclaim the pre-allocated space which past EOF, but because truncate command
changes the file size, we must run several commands to reclaim the space if we
don't want to change the file size, so it is not a good choice.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

7426cc04

Btrfs: fix the page that is beyond EOF · 0061280d

由 Miao Xie 提交于 12月 05, 2012

Steps to reproduce:
 # mkfs.btrfs <disk>
 # mount <disk> <mnt>
 # dd if=/dev/zero of=<mnt>/<file> bs=512 seek=5 count=8
 # fallocate -p -o 2048 -l 16384 <mnt>/<file>
 # dd if=/dev/zero of=<mnt>/<file> bs=4096 seek=3 count=8 conv=notrunc,nocreat
 # umount <mnt>
 # dmesg
 WARNING: at fs/btrfs/inode.c:7140 btrfs_destroy_inode+0x2eb/0x330

The reason is that we inputed a range which is beyond the end of the file. And
because the end of this range was not page-aligned, we had to truncate the last
page in this range, this operation is similar to a buffered file write. In other
words, we reserved enough space and clear the data which was in the hole range
on that page. But when we expanded that test file, write the data into the same
page, we forgot that we have reserved enough space for the buffered write of
that page because in most cases there is no page that is beyond the end of
the file. As a result, we reserved the space twice.

In fact, we needn't truncate the page if it is beyond the end of the file, just
release the allocated space in that range. Fix the above problem by this way.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

0061280d

Btrfs: fix off-by-one error of the same page check in btrfs_punch_hole() · 6347b3c4

由 Miao Xie 提交于 12月 05, 2012

(start + len) is the start of the adjacent extent, not the end of the current
extent, so we should not use it to check the hole is on the same page or not.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

6347b3c4

Btrfs: fix missing reserved space release in error path of delalloc reservation · 4b5829a8

由 Miao Xie 提交于 12月 05, 2012

We forget to release the reserved space in the error path of delalloc
reservatiom, fix it.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

4b5829a8

Btrfs: don't auto defrag a file when doing directIO · 543eabd5

由 Miao Xie 提交于 12月 05, 2012

If we runt the direct IO, we should not run auto defrag, because it may
introduce buffered IO vs direcIO problem, and make direct IO slow down.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

543eabd5

Btrfs: parse parent 0 into correct value in tracepoint · fb57dc81

由 Liu Bo 提交于 11月 30, 2012

Value 0 is not a tree id, so besides an upper limit, a lower limit is
necessary as well while parsing root types of tracepoint.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

fb57dc81

bug2833 / cloud-kernel 与 Fork 源项目一致

bug2833 / cloud-kernel
与 Fork 源项目一致