提交 · bd7de2c9a449e26a5493d918618eb20ae60d56bd · xiphi1978 / linux

29 8月, 2012 4 次提交

Btrfs: fix enospc problems when deleting a subvol · 5a24e84c

由 Josef Bacik 提交于 8月 08, 2012

Subvol delete is a special kind of awful where we use the global reserve to
cover the ENOSPC requirements. The problem is once we're done removing
everything we do a btrfs_update_inode(), which by default will try to do the
delayed update stuff which will use it's own reserve. There will be no
space in this reserve and we'll return ENOSPC. So instead use
btrfs_update_inode_fallback() which will just fallback to updating the inode
item in the case of enospc. This is fine because the global reserve covers
the space requirements for this. With this patch I can now delete a subvol
on a problem image Dave Sterba sent me. Thanks,
Reported-by: NDavid Sterba <dave@jikos.cz>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

5a24e84c

Btrfs: barrier before waitqueue_active · 66657b31

由 Josef Bacik 提交于 8月 01, 2012

We need a barrir before calling waitqueue_active otherwise we will miss
wakeups.  So in places that do atomic_dec(); then atomic_read() use
atomic_dec_return() which imply a memory barrier (see memory-barriers.txt)
and then add an explicit memory barrier everywhere else that need them.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

66657b31

Btrfs: don't allocate a seperate csums array for direct reads · c329861d

由 Josef Bacik 提交于 8月 03, 2012

We've been allocating a big array for csums instead of storing them in the
io_tree like we do for buffered reads because previously we were locking the
entire range, so we didn't have an extent state for each sector of the
range. But now that we do the range locking as we map the buffers we can
limit the mapping lenght to sectorsize and use the private part of the
io_tree for our csums. This allows us to avoid an extra memory allocation
for direct reads which could incur latency. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

c329861d

Btrfs: lock extents as we map them in DIO · eb838e73

由 Josef Bacik 提交于 7月 31, 2012

A deadlock in xfstests 113 was uncovered by commit

d187663e

This is because we would not return EIOCBQUEUED for short AIO reads, instead
we'd wait for the DIO to complete and then return the amount of data we
transferred, which would allow our stuff to unlock the remaning amount. But
with this change this no longer happens, so if we have a short AIO read (for
example if we try to read past EOF), we could leave the section from EOF to
the end of where we tried to read locked. Fixing this is tricky since there
is no clear way to know exactly how much data DIO truly submitted for IO, so
to make this less hard on ourselves and less combersome we need to lock the
extents as we try to map them, and then we unlock any areas we didn't
actually map. This makes us completely safe from deadlocks and reliance on
a particular behavior of the DIO code. This also lays the groundwork for
allowing us to use the normal csum storage method for reads which means we
can remove an allocation. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

eb838e73

26 7月, 2012 1 次提交

Btrfs: introduce subvol uuids and times · 8ea05e3a

由 Alexander Block 提交于 7月 25, 2012

This patch introduces uuids for subvolumes. Each
subvolume has it's own uuid. In case it was snapshotted,
it also contains parent_uuid. In case it was received,
it also contains received_uuid.

It also introduces subvolume ctime/otime/stime/rtime. The
first two are comparable to the times found in inodes. otime
is the origin/creation time and ctime is the change time.
stime/rtime are only valid on received subvolumes.
stime is the time of the subvolume when it was
sent. rtime is the time of the subvolume when it was
received.

Additionally to the times, we have a transid for each
time. They are updated at the same place as the times.

btrfs receive uses stransid and rtransid to find out
if a received subvolume changed in the meantime.

If an older kernel mounts a filesystem with the
extented fields, all fields become invalid. The next
mount with a new kernel will detect this and reset the
fields.
Signed-off-by: NAlexander Block <ablock84@googlemail.com>
Reviewed-by: NDavid Sterba <dave@jikos.cz>
Reviewed-by: NArne Jansen <sensille@gmx.net>
Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: NAlex Lyakas <alex.bolshoy.btrfs@gmail.com>

8ea05e3a

24 7月, 2012 6 次提交

Btrfs: zero unused bytes in inode item · 293f7e07

由 Li Zefan 提交于 7月 10, 2012

The otime field is not zeroed, so users will see random otime in an old
filesystem with a new kernel which has otime support in the future.

The reserved bytes are also not zeroed, and we'll have compatibility
issue if we make use of those bytes.
Signed-off-by: NLi Zefan <lizefan@huawei.com>

293f7e07

Btrfs: kill free_space pointer from inode structure · b4d7c3c9

由 Li Zefan 提交于 7月 09, 2012

Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
which means every inode has the same BTRFS_I(inode)->free_space pointer.

This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).
Signed-off-by: NLi Zefan <lizefan@huawei.com>

b4d7c3c9

Btrfs: kill root from btrfs_is_free_space_inode · 83eea1f1

由 Liu Bo 提交于 7月 10, 2012

Since root can be fetched via BTRFS_I macro directly, we can save an args
for btrfs_is_free_space_inode().
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

83eea1f1

Btrfs: fix typo in cow_file_range_async and async_cow_submit · 287082b0

由 Liu Bo 提交于 6月 28, 2012

It should be 10 * 1024 * 1024.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

287082b0

Btrfs: return error of btrfs_update_inode() to caller · b9959295

由 Tsutomu Itoh 提交于 6月 25, 2012

We didn't check error of btrfs_update_inode(), but that error looks
easy to bubble back up.
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

b9959295

Btrfs: don't update atime on RO subvolumes · 2bc55652

由 Alexander Block 提交于 6月 15, 2012

Before the update_time inode operation was indroduced, it was
not possible to prevent updates of atime on RO subvolumes. VFS
was only able to check for RO on the mount, but did not know
anything about btrfs subvolumes.

btrfs_update_time does now check if the root is RO and skip
updating of times.
Signed-off-by: NAlexander Block <ablock84@googlemail.com>

2bc55652

03 7月, 2012 2 次提交

Btrfs: fix wrong check during log recovery · 6bf02314

由 Liu Bo 提交于 6月 25, 2012

When we're evicting an inode during log recovery, we need to ensure that the inode
is not in orphan state any more, which means inode's run_time flags has _no_
BTRFS_INODE_HAS_ORPHAN_ITEM.  Thus, the BUG_ON was triggered because of a wrong
check for the flags.
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>

6bf02314

Btrfs: fix dio write vs buffered read race · c3473e83

由 Josef Bacik 提交于 6月 19, 2012

Miao pointed out there's a problem with mixing dio writes and buffered
reads. If the read happens between us invalidating the page range and
actually locking the extent we can bring in pages into page cache. Then
once the write finishes if somebody tries to read again it will just find
uptodate pages and we'll read stale data. So we need to lock the extent and
check for uptodate bits in the range. If there are uptodate bits we need to
unlock and invalidate again. This will keep this race from happening since
we will hold the extent locked until we create the ordered extent, and then
teh read side always waits for ordered extents. There was also a race in
how we updated i_size, previously we were relying on the generic DIO stuff
to adjust the i_size after the DIO had completed, but this happens outside
of the extent lock which means reads could come in and not see the updated
i_size. So instead move this work into where we create the extents, and
then this way the update ordered i_size stuff works properly in the endio
handlers. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

c3473e83

21 6月, 2012 1 次提交

Btrfs: delay iput with async extents · cb77fcd8

由 Josef Bacik 提交于 6月 15, 2012

There is some concern that these iput()'s could be the final iputs and could
induce lockups on people waiting on writeback. This would happen in the
rare case that we don't create ordered extents because of an error, but it
is theoretically possible and we already have a mechanism to deal with this
so just make them delayed iputs to negate any worry.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

cb77fcd8

15 6月, 2012 5 次提交

Btrfs: fix missing inherited flag in rename · bc178237

由 Liu Bo 提交于 6月 14, 2012

When we move a file into a directory with compression flag, we need to
inherite BTRFS_INODE_COMPRESS and clear BTRFS_INODE_NOCOMPRESS as well.
But if we move a file into a directory without compression flag, we need
to clear both of them.

It is the way how our setflags deals with compression flag, so keep
the same behaviour here.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

bc178237

Btrfs: call filemap_fdatawrite twice for compression · 7ddf5a42

由 Josef Bacik 提交于 6月 08, 2012

I removed this in an earlier commit and I was wrong. Because compression
can return from filemap_fdatawrite() without having actually set any of it's
pages as writeback() it can make filemap_fdatawait() do essentially nothing,
and then we won't find any ordered extents because they may not have been
created yet. So not only does this make fsync() completely useless, but it
will also screw up if you truncate on a non-page aligned offset since we
zero out the end and then wait on ordered extents and then call drop caches.
We can drop the cache before the io completes and then we try to unpin the
extent we just wrote we won't find it and everything goes sideways. So fix
this by putting it back and put a giant comment there to keep me from trying
to remove it in the future. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

7ddf5a42

Btrfs: keep inode pinned when compressing writes · 8180ef88

由 Josef Bacik 提交于 6月 08, 2012

A user reported lots of problems using compression on the new code and it
turns out part of the problem was that igrab() was failing when we added a
new ordered extent. This is because when writing out an inode under
compression we immediately return without actually doing anything to the
pages, and then in another thread at some point down the line actually do
the ordered dance. The problem is between the point that we start writeback
and we actually add the ordered extent we could be trying to reclaim the
inode, which makes igrab() return NULL. So we need to do an igrab() when we
create the async extent and then drop it when we are done with it. This
makes sure we stay pinned in memory until the ordered extent can get a
reference on it and we are good to go. With this patch we no longer panic
in btrfs_finish_ordered_io(). Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

8180ef88

Btrfs: unlock everything properly in the error case for nocow · 17ca04af

由 Josef Bacik 提交于 5月 31, 2012

I was getting hung on umount when a transaction was aborted because a range
of one of the free space inodes was still locked. This is because the nocow
stuff doesn't unlock anything on error. This fixed the problem and I
verified that is what was happening. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

17ca04af

Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error · beb42dd7

由 Josef Bacik 提交于 5月 30, 2012

While doing my enospc work I got a transaction abortion that resulted in a
panic when we tried to unlock_page() an already unlocked page.  This is
because we aren't calling extent_clear_unlock_delalloc with the locked page
so it was unlocking all the pages in the range.  This is wrong since
__extent_writepage expects to have the page locked still unless we return
*page_started as 1.  This should keep us from panicing.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

beb42dd7

02 6月, 2012 1 次提交

Btrfs: move over to use ->update_time · e41f941a

由 Josef Bacik 提交于 3月 26, 2012

Btrfs had been doing it's own file_update_time so we could catch ENOSPC
properly, so just update our btrfs_update_time to work with the new stuff and
then we'll be fancy later.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

e41f941a

30 5月, 2012 5 次提交

Btrfs: fall back to non-inline if we don't have enough space · 2adcac1a

由 Josef Bacik 提交于 5月 23, 2012

If cow_file_range_inline fails with ENOSPC we abort the transaction which
isn't very nice. This really shouldn't be happening anyways but there's no
sense in making it a horrible error when we can easily just go allocate
normal data space for this stuff. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

2adcac1a

Btrfs: fix how we deal with the orphan block rsv · 8a35d95f

由 Josef Bacik 提交于 5月 23, 2012

Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations. So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation. I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph. Thanks,
Btrfs: fix how we deal with the orphan block rsv

8a35d95f

Btrfs: convert the inode bit field to use the actual bit operations · 72ac3c0d

由 Josef Bacik 提交于 5月 23, 2012

Miao pointed this out while I was working on an orphan problem that messing
with a bitfield where different ranges are protected by different locks
doesn't work out right. Turns out we've been doing this forever where we
have different parts of the bit field protected by either no lock at all or
different locks which could cause all sorts of weird problems including the
issue I was hitting. So instead make a runtime_flags thing that we use the
normal bit operations on that are all atomic so we can keep having our
no/different locking for the different flags and then make force_compress
it's own thing so it can be treated normally. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

72ac3c0d

Btrfs: finish ordered extents in their own thread · 5fd02043

由 Josef Bacik 提交于 5月 02, 2012

We noticed that the ordered extent completion doesn't really rely on having
a page and that it could be done independantly of ending the writeback on a
page. This patch makes us not do the threaded endio stuff for normal
buffered writes and direct writes so we can end page writeback as soon as
possible (in irq context) and only start threads to do the ordered work when
it is actually done. Compression needs to be reworked some to take
advantage of this as well, but atm it has to do a find_get_page in its endio
handler so it must be done in its own thread. This makes direct writes
quite a bit faster. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

5fd02043

Btrfs: use i_version instead of our own sequence · 0c4d2d95

由 Josef Bacik 提交于 4月 05, 2012

We've been keeping around the inode sequence number in hopes that somebody
would use it, but nobody uses it and people actually use i_version which
serves the same purpose, so use i_version where we used the incore inode's
sequence number and that way the sequence is updated properly across the
board, and not just in file write. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

0c4d2d95

06 5月, 2012 1 次提交

vfs: Rename end_writeback() to clear_inode() · dbd5768f

由 Jan Kara 提交于 5月 03, 2012

After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>

dbd5768f

28 4月, 2012 1 次提交

Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir · fede766f

由 Chris Mason 提交于 4月 27, 2012

Btrfs has an optimization where it will preallocate dentries during
readdir to fill in enough information to open the inode without an extra
lookup.

But, we're calling d_alloc, which is doing GFP_KERNEL allocations, and
that leads to deadlocks because our readdir code has tree locks held.

For now, disable this optimization.  We'll fix the gfp mask in the next
merge window.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

fede766f

19 4月, 2012 3 次提交

Btrfs: always store the mirror we read the eb from · 5cf1ab56

由 Josef Bacik 提交于 4月 16, 2012

A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid. This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success. So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

5cf1ab56

btrfs: fix race in reada · 8c9c2bf7

由 Arne Jansen 提交于 2月 25, 2012

When inserting into the radix tree returns EEXIST, get the existing
entry without giving up the spinlock in between.
There was a race for both the zones trees and the extent tree.
Signed-off-by: NArne Jansen <sensille@gmx.net>

8c9c2bf7

Btrfs: avoid setting ->d_op twice · 848cce0d

由 Li Zefan 提交于 2月 21, 2012

Follow those instructions, and you'll trigger a warning in the
beginning of d_set_d_op():

  # mkfs.btrfs /dev/loop3
  # mount /dev/loop3 /mnt
  # btrfs sub create /mnt/sub
  # btrfs sub snap /mnt /mnt/snap
  # touch /mnt/snap/sub
  touch: cannot touch `tmp': Permission denied

__d_alloc() set d_op to sb->s_d_op (btrfs_dentry_operations), and
then simple_lookup() reset it to simple_dentry_operations, which
triggered the warning.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

848cce0d

29 3月, 2012 1 次提交

Btrfs: fix recursive defragment with autodefrag option · 4cb13e5d

由 Liu Bo 提交于 3月 29, 2012

$ mkfs.btrfs disk
$ mount disk /mnt -o autodefrag
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync
$ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \
  seek=$i conv=notrunc 2> /dev/null; done && sync

then we'll get to defrag "foobar" again and again.
So does option "-o autodefrag,compress".

Reasons:
When the cleaner kthread gets to fetch inodes from the defrag tree and defrag
them, it will dirty pages and submit them, this will comes to another DATA COW
where the processing inode will be inserted to the defrag tree again.

This patch sets a rule for COW code, i.e. insert an inode when we're really
going to make some defragments.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4cb13e5d

27 3月, 2012 2 次提交

Btrfs: ensure an entire eb is written at once · 0b32f4bb

由 Josef Bacik 提交于 3月 13, 2012

This patch simplifies how we track our extent buffers. Previously we could exit
writepages with only having written half of an extent buffer, which meant we had
to track the state of the pages and the state of the extent buffers differently.
Now we only read in entire extent buffers and write out entire extent buffers,
this allows us to simply set bits in our bflags to indicate the state of the eb
and we no longer have to do things like track uptodate with our iotree. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0b32f4bb

Btrfs: remove search_start and search_end from find_free_extent and callers · 81c9ad23

由 Josef Bacik 提交于 1月 18, 2012

We have been passing nothing but (u64)-1 to find_free_extent for search_end in
all of the callers, so it's completely useless, and we've always been passing 0
in as search_start, so just remove them as function arguments and move
search_start into find_free_extent. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

81c9ad23

22 3月, 2012 7 次提交

btrfs: replace many BUG_ONs with proper error handling · 79787eaa

由 Jeff Mahoney 提交于 3月 12, 2012

 btrfs currently handles most errors with BUG_ON. This patch is a work-in-
 progress but aims to handle most errors other than internal logic
 errors and ENOMEM more gracefully.

 This iteration prevents most crashes but can run into lockups with
 the page lock on occasion when the timing "works out."
Signed-off-by: NJeff Mahoney <jeffm@suse.com>

79787eaa

btrfs: Don't BUG_ON errors from btrfs_create_subvol_root() · ce598979

由 Mark Fasheh 提交于 7月 26, 2011

This is called from only one place - create_subvol() which passes errors
safely back out to it's caller, btrfs_mksubvol where they are handled.

Additionally, btrfs_create_subvol_root() itself bug's needlessly from error
return of btrfs_update_inode(). Since create_subvol() was fixed to catch
errors we can bubble this one up too.
Signed-off-by: NMark Fasheh <mfasheh@suse.com>

ce598979

btrfs: split extent_state ops · 3fbe5c02

由 Jeff Mahoney 提交于 3月 01, 2012

set_extent_bit can do exclusive locking but only when called by lock_extent*,

Drop the exclusive bits argument except when called by lock_extent.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>

3fbe5c02

btrfs: drop gfp_t from lock_extent · d0082371

由 Jeff Mahoney 提交于 3月 01, 2012

 lock_extent and unlock_extent are always called with GFP_NOFS, drop the
 argument and use GFP_NOFS consistently.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>

d0082371

J
btrfs: return void in functions without error conditions · 143bede5
由 Jeff Mahoney 提交于 3月 01, 2012
```
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
```
143bede5

btrfs: ->submit_bio_hook error push-up · 355808c2

由 Jeff Mahoney 提交于 10月 03, 2011

This pushes failures from the submit_bio_hook callbacks,
btrfs_submit_bio_hook and btree_submit_bio_hook into the callers, including
callers of submit_one_bio where it catches the failures with BUG_ON.

It also pushes up through the ->readpage_io_failed_hook to
end_bio_extent_writepage where the error is already caught with BUG_ON.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>

355808c2

btrfs: Factor out tree->ops->merge_bio_hook call · 3444a972

由 Jeff Mahoney 提交于 10月 03, 2011

In submit_extent_page, there's a visually noisy if statement that, in
the midst of other conditions, does the tree dependency for tree->ops
and tree->ops->merge_bio_hook before calling it, and then another
condition afterwards. If an error is returned from merge_bio_hook,
there's no way to catch it. It's considered a routine "1" return
value instead of a failure.

This patch factors out the dependency check into a new local merge_bio
routine and BUG's on an error. The if statement is less noisy as a side-
effect.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>

3444a972