提交 · f7039b1d5c32241f87a513e33120db36bf30264d · bug2833 / cloud-kernel

28 3月, 2011 12 次提交

Btrfs: add btrfs_trim_fs() to handle FITRIM · f7039b1d

由 Li Dongyang 提交于 3月 24, 2011

We take an free extent out from allocator, trim it, then put it back,
but before we trim the block group, we should make sure the block group is
cached, so plus a little change to make cache_block_group() run without a
transaction.
Signed-off-by: NLi Dongyang <lidongyang@novell.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f7039b1d

Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes · 5378e607

由 Li Dongyang 提交于 3月 24, 2011

Callers of btrfs_discard_extent() should check if we are mounted with -o discard,
as we want to make fitrim to work even the fs is not mounted with -o discard.
Also we should use REQ_DISCARD to map the free extent to get a full mapping,
last we only return errors if
1. the error is not a EOPNOTSUPP
2. no device supports discard
Signed-off-by: NLi Dongyang <lidongyang@novell.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5378e607

Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP · fce3bb9a

由 Li Dongyang 提交于 3月 24, 2011

btrfs_map_block() will only return a single stripe length, but we want the
full extent be mapped to each disk when we are trimming the extent,
so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD.
Signed-off-by: NLi Dongyang <lidongyang@novell.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

fce3bb9a

Btrfs: make update_reserved_bytes() public · b4d00d56

由 Li Dongyang 提交于 3月 24, 2011

Make the function public as we should update the reserved extents calculations
after taking out an extent for trimming.
Signed-off-by: NLi Dongyang <lidongyang@novell.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b4d00d56

btrfs: return EXDEV when linking from different subvolumes · 3ab3564f

由 Mark Fasheh 提交于 3月 22, 2011

btrfs_link returns EPERM if a cross-subvolume link is attempted.

However, in this case I believe EXDEV to be the more appropriate value.
>From the link(2) man page:

EXDEV  oldpath and newpath are not on the same mounted file system.  (Linux
       permits a file system to be mounted at multiple points, but link()
       does not work across different mount points, even if the same file
       system is mounted on both.)

This matters because an application may have different behaviors based on
return codes.
Signed-off-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3ab3564f

Btrfs: Per file/directory controls for COW and compression · 75e7cb7f

由 Liu Bo 提交于 3月 22, 2011

Data compression and data cow are controlled across the entire FS by mount
options right now.  ioctls are needed to set this on a per file or per
directory basis.  This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.

According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super.  However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.

After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.

NOTE:
 - The compression type is selected by such rules:
   If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
   Otherwise, we'll use the default compress type (zlib today).

v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
  will be screwed by inheritance from parent directory.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

75e7cb7f

btrfs: use GFP_NOFS instead of GFP_KERNEL · fc0e4a31

由 Miao Xie 提交于 3月 24, 2011

In the filesystem context, we must allocate memory by GFP_NOFS,
or we may start another filesystem operation and make kswap thread hang up.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

fc0e4a31

Btrfs: check return value of read_tree_block() · 97d9a8a4

由 Tsutomu Itoh 提交于 3月 24, 2011

This patch is checking return value of read_tree_block(),
and if it is NULL, error processing.
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

97d9a8a4

btrfs: properly access unaligned checksum buffer · 7e75bf3f

由 David Sterba 提交于 3月 18, 2011

On Fri, Mar 18, 2011 at 11:56:53AM -0400, Chris Mason wrote:
> Thanks for fielding this one.  Does put_unaligned_le32 optimize away on
> platforms with efficient access?  It would be great if we didn't need
> the #ifdef.

(quicktest: assembly output is same for put_unaligned_le32 and direct
assignment on my x86_64)
I was originally following examples in
Documentation/unaligned-memory-access.txt. From other code it seems to me that
the define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is intended for larger
portions of code. Macros/wrappers for {put,get}_unaligned* are chosen via
arch/<arch>/include/asm/unaligned.h accordingly, therefore it's safe to use
put_unaligned_le32 without the ifdef.

dave
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7e75bf3f

Btrfs: cleanup some BUG_ON() · db5b493a

由 Tsutomu Itoh 提交于 3月 23, 2011

This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

db5b493a

Btrfs: add initial tracepoint support for btrfs · 1abe9b8a

由 liubo 提交于 3月 24, 2011

Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
              dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
              dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
 btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
   flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
   flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
   flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
   flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
        btrfs_ordered_extent_add
        btrfs_ordered_extent_remove
        btrfs_ordered_extent_start
        btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
        btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
        __extent_writepage
        btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
        btrfs_inode_new
        btrfs_inode_request
        btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
        btrfs_sync_file
        btrfs_sync_fs

These show sync arguments.

6) transaction:
        btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
	btrfs_delayed_tree_ref
	btrfs_delayed_data_ref
	btrfs_delayed_ref_head
	btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
	btrfs_chunk_alloc
	btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
	btrfs_reserved_extent_alloc
	btrfs_reserved_extent_free

These can show how btrfs uses its space.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1abe9b8a

Btrfs: use RCU instead of a spinlock to protect the root node · 240f62c8

由 Chris Mason 提交于 3月 23, 2011

The pointer to the extent buffer for the root of each tree
is protected by a spinlock so that we can safely read the pointer
and take a reference on the extent buffer.

But now that the extent buffers are freed via RCU, we can safely
use rcu_read_lock instead.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

240f62c8

26 3月, 2011 3 次提交

Btrfs: mark the bio with an error if we have a failure in dio · c0da7aa1

由 Josef Bacik 提交于 3月 22, 2011

I noticed that dio_end_io calls the appropriate endio function with an error,
but the endio functions don't actually do anything with that error, they assume
that if there was an error then the bio will not be uptodate. So if we had
checksum failures we would never pass back EIO. So if there is an error in our
endio functions make sure to clear the uptodate flag on the bio. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

c0da7aa1

Btrfs: don't allocate dip->csums when doing writes · 98bc3149

由 Josef Bacik 提交于 3月 22, 2011

When doing direct writes we store the checksums in the ordered sum stuff in the
ordered extent for writing them when the write completes, so we don't even use
the dip->csums array. So if we're writing, don't bother allocating dip->csums
since we won't use it anyway. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

98bc3149

Btrfs: cleanup how we setup free space clusters · 4e69b598

由 Josef Bacik 提交于 3月 21, 2011

This patch makes the free space cluster refilling code a little easier to
understand, and fixes some things with the bitmap part of it. Currently we
either want to refill a cluster with

1) All normal extent entries (those without bitmaps)
2) A bitmap entry with enough space

The current code has this ugly jump around logic that will first try and fill up
the cluster with extent entries and then if it can't do that it will try and
find a bitmap to use. So instead split this out into two functions, one that
tries to find only normal entries, and one that tries to find bitmaps.

This also fixes a suboptimal thing we would do with bitmaps. If we used a
bitmap we would just tell the cluster that we were pointing at a bitmap and it
would do the tree search in the block group for that entry every time we tried
to make an allocation. Instead of doing that now we just add it to the clusters
group.

I tested this with my ENOSPC tests and xfstests and it survived.
Signed-off-by: NJosef Bacik <josef@redhat.com>

4e69b598

21 3月, 2011 3 次提交

Btrfs: don't be as aggressive about using bitmaps · 32cb0840

由 Josef Bacik 提交于 3月 18, 2011

We have been creating bitmaps for small extents unconditionally forever. This
was great when testing to make sure the bitmap stuff was working, but is
overkill normally. So instead of always adding small chunks of free space to
bitmaps, only start doing it if we go past half of our extent threshold. This
will keeps us from creating a bitmap for just one small free extent at the front
of the block group, and will make the allocator a little faster as a result.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

32cb0840

Btrfs: deal with min_bytes appropriately when looking for a cluster · d0a365e8

由 Josef Bacik 提交于 3月 18, 2011

We do all this fun stuff with min_bytes, but either don't use it in the case of
just normal extents, or use it completely wrong in the case of bitmaps.  So fix
this for both cases

1) In the extent case, stop looking for space with window_free >= min_bytes
instead of bytes + empty_size.

2) In the bitmap case, we were looking for streches of free space that was at
least min_bytes in size, which was not right at all.  So instead search for
stretches of free space that are at least bytes in size (this will make a
difference when we have > page size blocks) and then only search for min_bytes
amount of free space.

Thanks,
Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <josef@redhat.com>

d0a365e8

Btrfs: check free space in block group before searching for a cluster · 7d0d2e8e

由 Josef Bacik 提交于 3月 18, 2011

The free space cluster stuff is heavy duty, so there is no sense in going
through the entire song and dance if there isn't enough space in the block group
to begin with.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

7d0d2e8e

18 3月, 2011 16 次提交

Btrfs: add checks to verify dir items are correct · 22a94d44

由 Josef Bacik 提交于 3月 16, 2011

We need to make sure the dir items we get are valid dir items.  So any time we
try and read one check it with verify_dir_item, which will do various sanity
checks to make sure it looks sane.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

22a94d44

Btrfs: check return value of btrfs_search_slot properly · 41415730

由 Josef Bacik 提交于 3月 16, 2011

Doing an audit of where we use btrfs_search_slot only showed one place where we
don't check the return value of btrfs_search_slot properly.  Just fix
mark_extent_written to see if btrfs_search_slot failed and act accordingly.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

41415730

Btrfs: check items for correctness as we search · a826d6dc

由 Josef Bacik 提交于 3月 16, 2011

Currently if we have corrupted items things will blow up in spectacular ways.
So as we read in blocks and they are leaves, check the entire leaf to make sure
all of the items are correct and point to valid parts in the leaf for the item
data the are responsible for. If the item is corrupt we will kick back EIO and
not read any of the copies since they are likely to not be correct either. This
will catch generic corruptions, it will be up to the individual callers of
btrfs_search_slot to make sure their items are right. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

a826d6dc

Btrfs: return error if the range we want to map is bogus · 85026533

由 Josef Bacik 提交于 3月 15, 2011

Currently if we have corrupt metadata map_extent_buffer will complain about it,
but not return an error so the caller has no idea a problem was hit.  Fix this.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

85026533

Btrfs: add a comment explaining what btrfs_cont_expand does · 695a0d0d

由 Josef Bacik 提交于 3月 04, 2011

Everytime I have to deal with btrfs_cont_expand I stare at it for 20 minutes
trying to remember what exactly it does and why the hell we need it. So add a
comment to save future-Josef some time. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

695a0d0d

Btrfs: use mark_inode_dirty when expanding the file · 930f028a

由 Josef Bacik 提交于 3月 04, 2011

Mark_inode_dirty will call btrfs_dirty_inode which will take care of updating
the inode.  This makes setsize a little cleaner since we don't have to start a
transaction and update the inode in there, we can just call mark_inode_dirty.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

930f028a

Btrfs: only add orphan items when truncating · f0cd846e

由 Josef Bacik 提交于 3月 04, 2011

We don't need an orphan item when expanding files, we just need them for
truncating them, so only add the orphan item in btrfs_truncate instead of in
btrfs_setsize.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

f0cd846e

Btrfs: make sure to remove the orphan item from the in-memory list · ded5db9d

由 Josef Bacik 提交于 3月 04, 2011

This fixes a problem where if truncate fails the inode will still be on the in
memory orphan list. This is will make us complain when the inode gets destroyed
because it's still on the orphan list. So if we fail just remove us from the in
memory list and carry on.
Signed-off-by: NJosef Bacik <josef@redhat.com>

ded5db9d

Btrfs: handle errors in btrfs_orphan_cleanup · 66b4ffd1

由 Josef Bacik 提交于 1月 31, 2011

If we cannot truncate an inode for some reason we will never delete the orphan
item associated with that inode, which means that we will loop forever in
btrfs_orphan_cleanup. Instead of doing this just return error so we fail to
mount. It sucks, but hey it's better than hanging. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

66b4ffd1

Btrfs: cleanup error handling in the truncate path · 3893e33b

由 Josef Bacik 提交于 1月 31, 2011

Now that we can handle having errors in the truncate path lets make sure we
return errors instead of doing BUG_ON() and such.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

3893e33b

Btrfs: convert to the new truncate sequence · a41ad394

由 Josef Bacik 提交于 1月 31, 2011

->truncate() is going away, instead all of the work needs to be done in
->setattr().  So this converts us over to do this.  It's fairly straightforward,
just get rid of our .truncate inode operation and call btrfs_truncate() directly
from btrfs_setsize.  This works out better for us since truncate can technically
return ENOSPC, and before we had no way of letting anybody know.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

a41ad394

Btrfs: use a slab for the free space entries · dc89e982

由 Josef Bacik 提交于 1月 28, 2011

Since we alloc/free free space entries a whole lot, lets use a slab to keep
track of them. This makes some of my tests slightly faster. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

dc89e982

Btrfs: change reserved_extents to an atomic_t · 57a45ced

由 Josef Bacik 提交于 1月 25, 2011

We track delayed allocation per inodes via 2 counters, one is
outstanding_extents and reserved_extents. Outstanding_extents is already an
atomic_t, but reserved_extents is not and is protected by a spinlock. So
convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg
when releasing delalloc bytes. This makes our inode 72 bytes smaller, and
reduces locking overhead (albiet it was minimal to begin with). Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

57a45ced

Btrfs: fix how we deal with the pages array in the write path · 4a64001f

由 Josef Bacik 提交于 1月 25, 2011

Really we don't need to memset the pages array at all, since we know how many
pages we're going to use in the array and pass that around. So don't memset,
just trust we're not idiots and we pass num_pages around properly.
Signed-off-by: NJosef Bacik <josef@redhat.com>

4a64001f

Btrfs: simplify our write path · d0215f3e

由 Josef Bacik 提交于 1月 25, 2011

Our aio_write function is huge and kind of hard to follow at times. So this
patch fixes this by breaking out the buffered and direct write paths out into
seperate functions so it's a little clearer what's going on. I've also fixed
some wrong typing that we had and added the ability to handle getting an error
back from btrfs_set_extent_delalloc. Tested this with xfstests and everything
came out fine. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

d0215f3e

Btrfs: fix formatting in file.c · 9f570b8d

由 Josef Bacik 提交于 1月 25, 2011

Sorry, but these were bugging me.  Just cleanup some of the formatting in
file.c.
Signed-off-by: NJosef Bacik <josef@redhat.com>

9f570b8d

12 3月, 2011 1 次提交

Btrfs: break out of shrink_delalloc earlier · 36e39c40

由 Chris Mason 提交于 3月 12, 2011

Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.

But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations.  The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made.  This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.

The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.

This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down.  This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.

If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.

Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
other writers are able to continue until we get 100%.

This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

36e39c40

11 3月, 2011 2 次提交

btrfs: fix not enough reserved space · 7e6b6465

由 Miao Xie 提交于 2月 18, 2011

btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
tree. So we should reserve space for these 5 items, not 3 items.
Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7e6b6465

btrfs: fix dip leak · b4966b77

由 Daniel J Blueman 提交于 3月 09, 2011

The btrfs DIO code leaks dip structs when dip->csums allocation
fails; bio->bi_end_io isn't set at the point where the free_ordered
branch is consequently taken, thus bio_endio doesn't call the function
which would free it in the normal case. Fix.
Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
Acked-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b4966b77

09 3月, 2011 1 次提交

Btrfs: make sure not to return overlapping extents to fiemap · ea8efc74

由 Chris Mason 提交于 3月 08, 2011

The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases.  cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.

The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ea8efc74

08 3月, 2011 1 次提交

Btrfs: deal with short returns from copy_from_user · 31339acd

由 Chris Mason 提交于 3月 07, 2011

When copy_from_user is only able to copy some of the bytes we requested,
we may end up creating a partially up to date page.  To avoid garbage in
the page, we need to treat a partial copy as a zero length copy.

This makes the rest of the file_write code drop the page and
retry the whole copy instead of marking the partially up to
date page as dirty.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
cc: stable@kernel.org

31339acd

07 3月, 2011 1 次提交

Btrfs: fix regressions in copy_from_user handling · b1bf862e

由 Chris Mason 提交于 2月 28, 2011

Commit 914ee295 fixed deadlocks in
btrfs_file_write where we would catch page faults on pages we had
locked.

But, there were a few problems:

1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
data when the amount to copy is more than 4K and the offset to start
copying from is not page aligned.  The result was btrfs_file_write
looping forever retrying the iov_iter_copy_from_user_atomic

We deal with this by changing btrfs_file_write to drop down to single
page copies when iov_iter_copy_from_user_atomic starts returning failure.

2) The btrfs_file_write code was leaking delalloc reservations when
iov_iter_copy_from_user_atomic returned zero.  The looping above would
result in the entire filesystem running out of delalloc reservations and
constantly trying to flush things to disk.

3) btrfs_file_write will lock down page cache pages, make sure
any writeback is finished, do the copy_from_user and then release them.
Before the loop runs we check the first and last pages in the write to
see if they are only being partially modified.  If the start or end of
the write isn't aligned, we make sure the corresponding pages are
up to date so that we don't introduce garbage into the file.

With the copy_from_user changes, we're allowing the VM to reclaim the
pages after a partial update from copy_from_user, but we're not
making sure the page cache page is up to date when we loop around to
resume the write.

We deal with this by pushing the up to date checks down into the page
prep code.  This fits better with how the rest of file_write works.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
cc: stable@kernel.org

b1bf862e

bug2833 / cloud-kernel 与 Fork 源项目一致

bug2833 / cloud-kernel
与 Fork 源项目一致