提交 · fccb84c94a9755f48668e43d0a44d6ecc750900f · openanolis / cloud-kernel

02 10月, 2014 14 次提交

D
btrfs: move checks for DUMMY_ROOT into a helper · fccb84c9
由 David Sterba 提交于 9月 29, 2014
```
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
fccb84c9

btrfs: new define for the inline extent data start · 7ec20afb

由 David Sterba 提交于 7月 24, 2014

Use a common definition for the inline data start so we don't have to
open-code it and introduce bugs like "Btrfs: fix wrong max inline data
size limit" fixed.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

7ec20afb

btrfs: kill extent_buffer_page helper · fb85fc9a

由 David Sterba 提交于 7月 31, 2014

It used to be more complex but now it's just a simple array access.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

fb85fc9a

D
btrfs: drop constant param from btrfs_release_extent_buffer_page · a50924e3
由 David Sterba 提交于 7月 31, 2014
```
All callers use the same value, simplify the function.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
a50924e3
D
btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB · 2755a0de
由 David Sterba 提交于 7月 31, 2014
```
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
2755a0de
D
btrfs: let merge_reloc_roots return void · 94404e82
由 David Sterba 提交于 7月 30, 2014
```
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
94404e82
D
btrfs: remove unused members from struct scrub_warning · 8b9456da
由 David Sterba 提交于 7月 30, 2014
```
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
8b9456da

btrfs: use slab for end_io_wq structures · 97eb6b69

由 David Sterba 提交于 7月 30, 2014

The structure is frequently reused.  Rename it according to the slab
name.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

97eb6b69

btrfs: fix error labels in init_btrfs_fs · af13b492

由 David Sterba 提交于 7月 30, 2014

btrfs_interface_init rarely fails but we could leak the prelim_ref slab.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

af13b492

D
btrfs: use enum for wq endio metadata type · bfebd8b5
由 David Sterba 提交于 7月 30, 2014
```
The enum exists but is not consistently used.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
bfebd8b5
D
btrfs: remove unused extent state bits · 01d5bc37
由 David Sterba 提交于 7月 30, 2014
```
The last users are long gone.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
01d5bc37

Btrfs: set default max_inline to 8KiB instead of 8MiB · 95ac567a

由 Filipe David Borba Manana 提交于 8月 08, 2013

8MiB is way too large and likely set by mistake. This is not
a significant issue as in practice the max amount of data
added to an inline extent is also limited by the page cache
and btree leaf sizes.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

95ac567a

D
btrfs: remove unused variable from btrfs_parse_options · 143f3636
由 David Sterba 提交于 7月 29, 2014
```
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
143f3636

btrfs: defrag, use unsigned type for extent thresh · aab110ab

由 David Sterba 提交于 7月 29, 2014

Signed type mismatches the ioctl structure, all extent calculations are
done on unsigned types.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

aab110ab

23 9月, 2014 3 次提交

Btrfs: try not to ENOSPC on log replay · 1d52c78a

由 Josef Bacik 提交于 9月 18, 2014

When doing log replay we may have to update inodes, which traditionally goes
through our delayed inode stuff. This will try to move space over from the
trans handle, but we don't reserve space in our trans handle on replay since we
don't know how much we will need, so instead we try to flush. But because we
have a trans handle open we won't flush anything, so if we are out of reserve
space we will simply return ENOSPC. Since we know that if an operation made it
into the log then we definitely had space before the box bought the farm then we
don't need to worry about doing this space reservation. Use the
fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
item directly. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

1d52c78a

Btrfs: don't do async reclaim during log replay · f6acfd50

由 Josef Bacik 提交于 9月 18, 2014

Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
during log replay. This is because we use fs_info->fs_root as our root for
shrinking and such. Technically we can use whatever root we want, but let's
just not allow async reclaim while we're doing log replay. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

f6acfd50

Btrfs: remove empty block groups automatically · 47ab2a6c

由 Josef Bacik 提交于 9月 18, 2014

One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space. This happens because all the chunks were allocated for
data since the metadata requirements were so low. But now there's a bunch of
empty data block groups and not enough metadata space to do anything. This
patch solves this problem by automatically deleting empty block groups. If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it. This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

47ab2a6c

19 9月, 2014 2 次提交

Btrfs: fix data corruption after fast fsync and writeback error · 8407f553

由 Filipe Manana 提交于 9月 05, 2014

When we do a fast fsync, we start all ordered operations and then while
they're running in parallel we visit the list of modified extent maps
and construct their matching file extent items and write them to the
log btree. After that, in btrfs_sync_log() we wait for all the ordered
operations to finish (via btrfs_wait_logged_extents).

The problem with this is that we were completely ignoring errors that
can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
or -EIO errors for example. When such error happens, it means we have parts
of the on disk extent that weren't written to, and so we end up logging
file extent items that point to these extents that contain garbage/random
data - so after a crash/reboot plus log replay, we get our inode's metadata
pointing to those extents.

This worked in contrast with the full (non-fast) fsync path, where we
start all ordered operations, wait for them to finish and then write
to the log btree. In this path, after each ordered operation completes
we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
-EIO if so (via btrfs_wait_ordered_range).

So if an error happens with any ordered operation, just return a -EIO
error to userspace, so that it knows that not all of its previous writes
were durably persisted and the application can take proper action (like
redo the writes for e.g.) - and definitely not leave any file extent items
in the log refer to non fully written extents.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

8407f553

Btrfs: fix fsync race leading to invalid data after log replay · 669249ee

由 Filipe Manana 提交于 9月 02, 2014

When the fsync callback (btrfs_sync_file) starts, it first waits for
the writeback of any dirty pages to start and finish without holding
the inode's mutex (to reduce contention). After this it acquires the
inode's mutex and repeats that process via btrfs_wait_ordered_range
only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
is set on the inode).

This is not safe for a non full sync - we need to start and wait for
writeback to finish for any pages that might have been made dirty
before acquiring the inode's mutex and after that first step mentioned
before. Why this is needed is explained by the following comment added
to btrfs_sync_file:

  "Right before acquiring the inode's mutex, we might have new
   writes dirtying pages, which won't immediately start the
   respective ordered operations - that is done through the
   fill_delalloc callbacks invoked from the writepage and
   writepages address space operations. So make sure we start
   all ordered operations before starting to log our inode. Not
   doing this means that while logging the inode, writeback
   could start and invoke writepage/writepages, which would call
   the fill_delalloc callbacks (cow_file_range,
   submit_compressed_extents). These callbacks add first an
   extent map to the modified list of extents and then create
   the respective ordered operation, which means in
   tree-log.c:btrfs_log_inode() we might capture all existing
   ordered operations (with btrfs_get_logged_extents()) before
   the fill_delalloc callback adds its ordered operation, and by
   the time we visit the modified list of extent maps (with
   btrfs_log_changed_extents()), we see and process the extent
   map they created. We then use the extent map to construct a
   file extent item for logging without waiting for the
   respective ordered operation to finish - this file extent
   item points to a disk location that might not have yet been
   written to, containing random data - so after a crash a log
   replay will make our inode have file extent items that point
   to disk locations containing invalid data, as we returned
   success to userspace without waiting for the respective
   ordered operation to finish, because it wasn't captured by
   btrfs_get_logged_extents()."
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

669249ee

18 9月, 2014 21 次提交

Btrfs: fix wrong parse of extent map's tracepoint · 254a2d14

由 Liu Bo 提交于 9月 17, 2014

The tracepoint of extent map doesn't parse @flag correctly, we set @flag via
set_bit(), so we need to parse it on a bit bias.

Also add the missing flag, EXTENT_FLAG_FS_MAPPING.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

254a2d14

btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map · e6c4efd8

由 Qu Wenruo 提交于 9月 17, 2014

The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.

[BUG]
When existing extent map's start is before map_start,
the em->len will be minus, which will corrupt the extent map and fail to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.

[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.

[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing->start, now it will find the
previous and next extent around map_start.
So the old existing->start < map_start bug will never happen again.

[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing->start).

The patch will first search existing extent that does not intersects with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and based
on the existing extent, we can find out the previous and next extent
around map_start.

So the best fitted extent would be [prev->end, next->start).
For prev or next is not found, em->start would be prev->end and em->end
wold be next->start.

With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree search.
Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

e6c4efd8

Btrfs: fix up bounds checking in lseek · 4d1a40c6

由 Liu Bo 提交于 9月 16, 2014

An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END
allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't
prepare for that and convert the negative @offset into unsigned type,
so we get (end < start) warning.

[ 1269.835374] ------------[ cut here ]------------
[ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140()
[ 1269.838816] BTRFS: end < start 4094 18446744073709551615
[ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G        W      3.16.0+ #306
[ 1269.858229] Call Trace:
[ 1269.858612]  [<ffffffff81801a69>] dump_stack+0x4e/0x68
[ 1269.858952]  [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0
[ 1269.859416]  [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50
[ 1269.859929]  [<ffffffff813b0fbd>] insert_state+0x11d/0x140
[ 1269.860409]  [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0
[ 1269.860805]  [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200
[ 1269.861697]  [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0
[ 1269.862168]  [<ffffffff811f201e>] SyS_lseek+0xae/0xc0
[ 1269.862620]  [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b
[ 1269.862970] ---[ end trace 4d33ea885832054b ]---

This assumes that btrfs starts finding DATA/HOLE from the beginning of file
if the assigned @offset is negative.

Also we add alignment for lock_extent_bits 's range.
Reported-by: NToralf Förster <toralf.foerster@gmx.de>
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

4d1a40c6

Btrfs: cleanup the read failure record after write or when the inode is freeing · f612496b

由 Miao Xie 提交于 9月 12, 2014

After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

f612496b

Btrfs: implement repair function when direct read fails · 8b110e39

由 Miao Xie 提交于 9月 12, 2014

This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

8b110e39

Btrfs: Set real mirror number for read operation on RAID0/5/6 · 28e1cc7d

由 Miao Xie 提交于 9月 12, 2014

We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

28e1cc7d

Btrfs: modify clean_io_failure and make it suit direct io · 1203b681

由 Miao Xie 提交于 9月 12, 2014

We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

1203b681

Btrfs: modify repair_io_failure and make it suit direct io · ffdd2018

由 Miao Xie 提交于 9月 12, 2014

The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

ffdd2018

Btrfs: split bio_readpage_error into several functions · 2fe6303e

由 Miao Xie 提交于 9月 12, 2014

The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

2fe6303e

M
Btrfs: Cleanup unused variant and argument of IO failure handlers · 454ff3de
由 Miao Xie 提交于 9月 12, 2014
```
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>
```
454ff3de

Btrfs: fix missing error handler if submiting re-read bio fails · 6c387ab2

由 Miao Xie 提交于 9月 12, 2014

We forgot to free failure record and bio after submitting re-read bio failed,
fix it.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

6c387ab2

Btrfs: do file data check by sub-bio's self · c1dc0896

由 Miao Xie 提交于 9月 12, 2014

Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

c1dc0896

M
Btrfs: cleanup similar code of the buffered data data check and dio read data check · dc380aea
由 Miao Xie 提交于 9月 12, 2014
```
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>
```
dc380aea

Btrfs: load checksum data once when submitting a direct read io · 23ea8e5a

由 Miao Xie 提交于 9月 12, 2014

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

23ea8e5a

Btrfs: modify rw_devices counter under chunk_mutex context · c3929c36

由 Miao Xie 提交于 9月 03, 2014

rw_devices counter is often used to tune the profile when doing chunk allocation,
so we should modify it under the chunk_mutex context to avoid getting wrong
chunk profile.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

c3929c36

Btrfs: move the missing device to its own fs device list · 5f375835

由 Miao Xie 提交于 9月 03, 2014

For a missing device, we don't know it belong to which fs before we read its
fsid from the chunk tree. So we add them into the current fs device list at first.
When we get its fsid, we should move them to their own fs device list.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

5f375835

Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs · 416d7b80

由 Miao Xie 提交于 9月 03, 2014

When we open a seed filesystem, if the degraded mount option is set, we continue to
mount the fs if we don't find some devices in the seed filesystem. But we should stop
mounting if other errors happen. Fix it
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

416d7b80

M
Btrfs: make the logic of source device removing more clear · 82372bc8
由 Miao Xie 提交于 9月 03, 2014
```
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>
```
82372bc8

Btrfs: fix use-after-free problem of the device during device replace · 67a2c45e

由 Miao Xie 提交于 9月 03, 2014

The problem is:
	Task0(device scan task)		Task1(device replace task)
	scan_one_device()
	mutex_lock(&uuid_mutex)
	device = find_device()
					mutex_lock(&device_list_mutex)
					lock_chunk()
					rm_and_free_source_device
					unlock_chunk()
					mutex_unlock(&device_list_mutex)
	check device

Destroying the target device if device replace fails also has the same problem.

We fix this problem by locking uuid_mutex during destroying source device or
target device, just like the device remove operation.

It is a temporary solution, we can fix this problem and make the code more
clear by atomic counter in the future.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

67a2c45e

Btrfs: fix unprotected device list access when cloning fs devices · adbbb863

由 Miao Xie 提交于 9月 03, 2014

We can build a new filesystem based a seed filesystem, and we need clone
the fs devices when we open the new filesystem. But someone might clear
the seed flag of the seed filesystem, then mount that filesystem and
remove some device. If we mount the new filesystem, we might access
a device list which was being changed when we clone the fs devices.
Fix it.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

adbbb863

Btrfs: Fix misuse of chunk mutex · 2196d6e8

由 Miao Xie 提交于 9月 03, 2014

There were several problems about chunk mutex usage:
- Lock chunk mutex when updating metadata. It would cause the nested
  deadlock because updating metadata might need allocate new chunks
  that need acquire chunk mutex. We remove chunk mutex at this case,
  because b-tree lock and other lock mechanism can help us.
- ABBA deadlock occured between device_list_mutex and chunk_mutex.
  When we update device status, we must acquire device_list_mutex at the
  beginning, and then we might get chunk_mutex during the device status
  update because we need allocate new chunks for metadata COW. But at
  most place, we acquire chunk_mutex at first and then acquire device list
  mutex. We need change the lock order.
- Some place we needn't acquire chunk_mutex. For example we needn't get
  chunk_mutex when we free a empty seed fs_devices structure.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

2196d6e8

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功