提交 · b2117a39fa96cf4814e7cab8c11494149ba6f29d · xiphi1978 / linux

17 1月, 2011 8 次提交

btrfs: make the chunk allocator utilize the devices better · b2117a39

由 Miao Xie 提交于 1月 05, 2011

With this patch, we change the handling method when we can not get enough free
extents with default size.

Implementation:
1. Look up the suitable free extent on each device and keep the search result.
   If not find a suitable free extent, keep the max free extent
2. If we get enough suitable free extents with default size, chunk allocation
   succeeds.
3. If we can not get enough free extents, but the number of the extent with
   default size is >= min_stripes, we just change the mapping information
   (reduce the number of stripes in the extent map), and chunk allocation
   succeeds.
4. If the number of the extent with default size is < min_stripes, sort the
   devices by its max free extent's size descending
5. Use the size of the max free extent on the (num_stripes - 1)th device as the
   stripe size to allocate the device space

By this way, the chunk allocator can allocate chunks as large as possible when
the devices' space is not enough and make full use of the devices.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b2117a39

btrfs: restructure find_free_dev_extent() · 7bfc837d

由 Miao Xie 提交于 1月 05, 2011

- make it return the start position and length of the max free space when it can
  not find a suitable free space.
- make it more readability
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7bfc837d

btrfs: fix wrong calculation of stripe size · 1974a3b4

由 Miao Xie 提交于 1月 05, 2011

There are two tiny problem:
- One is When we check the chunk size is greater than the max chunk size or not,
  we should take mirrors into account, but the original code didn't.
- The other is btrfs shouldn't use the size of the residual free space as the
  length of of a dup chunk when doing chunk allocation. It is because the device
  space that a dup chunk needs is twice as large as the chunk size, if we use
  the size of the residual free space as the length of a dup chunk, we can not
  get enough free space. Fix it.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Reviewed-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1974a3b4

btrfs: try to reclaim some space when chunk allocation fails · d52a5b5f

由 Miao Xie 提交于 1月 05, 2011

We cannot write data into files when when there is tiny space in the filesystem.

Reproduce steps:
 # mkfs.btrfs /dev/sda1
 # mount /dev/sda1 /mnt
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
 # dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
   (fill the filesystem)
 # umount /mnt
 # mount /dev/sda1 /mnt
 # rm -f /mnt/tmpfile0
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
   (failed with nospec)

But if we do the last step again, we can write data successfully. The reason of
the problem is that btrfs didn't try to commit the current transaction and
reclaim some space when chunk allocation failed.

This patch fixes it by committing the current transaction to reclaim some
space when chunk allocation fails.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Reviewed-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d52a5b5f

btrfs: fix wrong data space statistics · 299a08b1

由 Miao Xie 提交于 1月 05, 2011

Josef has implemented mixed data/metadata chunks, we must add those chunks'
space just like data chunks.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Reviewed-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

299a08b1

fs/btrfs: Fix build of ctree · f580eb09

由 Stefan Schmidt 提交于 1月 12, 2011

CC [M]  fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2

We need to include kobject.h here.
Reported-by: NJeff Garzik <jeff@garzik.org>
Fix-suggested-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NStefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f580eb09

C

Merge branch 'lzo-support' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 · f892436e
由 Chris Mason 提交于 1月 16, 2011

f892436e
C

Merge branch 'readonly-snapshots' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 · 26c79f6b
由 Chris Mason 提交于 1月 16, 2011

26c79f6b

05 1月, 2011 1 次提交

Btrfs: fix off by one while setting block groups readonly · 65e5341b

由 Chris Mason 提交于 12月 24, 2010

When we read in block groups, we'll set non-redundant groups
readonly if we find a raid1, DUP or raid10 group.  But the
ro code has an off by one bug in the math around testing to
make sure out accounting doesn't go wrong.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

65e5341b

23 12月, 2010 3 次提交

Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls · 0caa102d

由 Li Zefan 提交于 12月 20, 2010

This allows us to set a snapshot or a subvolume readonly or writable
on the fly.

Usage:

Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then
call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS);

Changelog for v3:

- Change to pass __u64 as ioctl parameter.

Changelog for v2:

- Add _GETFLAGS ioctl.
- Check if the passed fd is the root of a subvolume.
- Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

0caa102d

Btrfs: Add readonly snapshots support · b83cc969

由 Li Zefan 提交于 12月 20, 2010

Usage:

Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).

Implementation:

- Set readonly bit of btrfs_root_item->flags.
- Add readonly checks in btrfs_permission (inode_permission),
btrfs_setattr, btrfs_set/remove_xattr and some ioctls.

Changelog for v3:

- Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
- Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

b83cc969

Btrfs: Refactor btrfs_ioctl_snap_create() · fa0d2b9b

由 Li Zefan 提交于 12月 20, 2010

Split it into two functions for two different ioctls, since they
share no common code.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

fa0d2b9b

22 12月, 2010 6 次提交

btrfs: Extract duplicate decompress code · 3a39c18d

由 Li Zefan 提交于 11月 08, 2010

Add a common function to copy decompressed data from working buffer
to bio pages.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

3a39c18d

btrfs: Allow to specify compress method when defrag · 1a419d85

由 Li Zefan 提交于 10月 25, 2010

Update defrag ioctl, so one can choose lzo or zlib when turning
on compression in defrag operation.

Changelog:

v1 -> v2
- Add incompability flag.
- Fix to check invalid compress type.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

1a419d85

btrfs: Add lzo compression support · a6fa6fae

由 Li Zefan 提交于 10月 25, 2010

Lzo is a much faster compression algorithm than gzib, so would allow
more users to enable transparent compression, and some users can
choose from compression ratio and speed for different applications

Usage:

 # mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt
or
 # mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt

"-o compress" without argument is still allowed for compatability.

Compatibility:

If we mount a filesystem with lzo compression, it will not be able be
mounted in old kernels. One reason is, otherwise btrfs will directly
dump compressed data, which sits in inline extent, to user.

Performance:

The test copied a linux source tarball (~400M) from an ext4 partition
to the btrfs partition, and then extracted it.

(time in second)
           lzo        zlib        nocompress
copy:      10.6       21.7        14.9
extract:   70.1       94.4        66.6

(data size in MB)
           lzo        zlib        nocompress
copy:      185.87     108.69      394.49
extract:   193.80     132.36      381.21

Changelog:

v1 -> v2:
- Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig.
- Add incompability flag.
- Fix error handling in compress code.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

a6fa6fae

btrfs: Allow to add new compression algorithm · 261507a0

由 Li Zefan 提交于 12月 17, 2010

Make the code aware of compression type, instead of always assuming
zlib compression.

Also make the zlib workspace function as common code for all
compression types.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

261507a0

btrfs: Fix error handling in zlib · 4b72029d

由 Li Zefan 提交于 11月 09, 2010

Return failure if alloc_page() fails to allocate memory,
and the upper code will just give up compression.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

4b72029d

btrfs: Fix bugs in zlib workspace · 8844355d

由 Li Zefan 提交于 10月 25, 2010

- Fix a race that can result in alloc_workspace > cpus.
- Fix to check num_workspace after wakeup.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>

8844355d

14 12月, 2010 3 次提交

Btrfs: prevent RAID level downgrades when space is low · 83a50de9

由 Chris Mason 提交于 12月 13, 2010

The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.

This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.

But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk.  This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.

The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.

The fix here is to make sure that we provide duplication when the
caller has asked for it.  It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

83a50de9

Btrfs: account for missing devices in RAID allocation profiles · cd02dca5

由 Chris Mason 提交于 12月 13, 2010

When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.

This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.

The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.

The fix here is to count the missing devices as we build up the list
of devices in the system.  This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

cd02dca5

Btrfs: EIO when we fail to read tree roots · 68433b73

由 Chris Mason 提交于 12月 13, 2010

If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain.  This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs.  The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

68433b73

11 12月, 2010 7 次提交

Btrfs: fix compiler warnings · 3dd1462e

由 Jan Beulich 提交于 12月 07, 2010

... regarding an unused function when !MIGRATION, and regarding a
printk() format string vs argument mismatch.
Signed-off-by: NJan Beulich <jbeulich@novell.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3dd1462e

Btrfs: Make async snapshot ioctl more generic · fdfb1e4f

由 Li Zefan 提交于 12月 10, 2010

If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.

Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

fdfb1e4f

Btrfs: pwrite blocked when writing from the mmaped buffer of the same page · 914ee295

由 Xin Zhong 提交于 12月 09, 2010

This problem is found in meego testing:
http://bugs.meego.com/show_bug.cgi?id=6672
A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
fault happen before pages are locked. And also disable page fault in critical region in
btrfs_copy_from_user().

Reviewed-by: Yan, Zheng<zheng.z.yan@intel.com>
Signed-off-by: NZhong, Xin <xin.zhong@intel.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

914ee295

Btrfs: Fix a crash when mounting a subvolume · f106e82c

由 Li Zefan 提交于 12月 07, 2010

We should drop dentry before deactivating the superblock, otherwise
we can hit this bug:

BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
...

Steps to reproduce the bug:

  # mount /dev/loop1 /mnt
  # mkdir save
  # btrfs subvolume snapshot /mnt save/snap1
  # umount /mnt
  # mount -o subvol=save/snap1 /dev/loop1 /mnt
  (crash)
Reported-by: NMichael Niederle <mniederle@gmx.at>
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f106e82c

Btrfs: fix sync subvol/snapshot creation · 75eaa0e2

由 Sage Weil 提交于 12月 10, 2010

We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.

There's ample room for further cleanup here, but this keeps the fix simple.
Signed-off-by: NSage Weil <sage@newdream.net>
Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

75eaa0e2

Btrfs: Fix page leak in compressed writeback path · 24ae6365

由 Yan, Zheng 提交于 12月 06, 2010

"start + num_bytes >= actual_end" can happen when compressed page writeback races
with file truncation. In that case we need unlock and release pages past the end
of file.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

24ae6365

Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots · 84cd948c

由 Josef Bacik 提交于 12月 08, 2010

Not being able to delete an orphan item isn't a horrible thing. The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.
Signed-off-by: NJosef Bacik <josef@redhat.com>

84cd948c

10 12月, 2010 4 次提交

Btrfs: fixup return code for btrfs_del_orphan_item · 7e1fea73

由 Josef Bacik 提交于 12月 08, 2010

If the orphan item doesn't exist, we return 1, which doesn't make any sense to
the callers. Instead return -ENOENT if we didn't find the item. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

7e1fea73

Btrfs: do not do fast caching if we are allocating blocks for tree_root · b8399dee

由 Josef Bacik 提交于 12月 08, 2010

Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root. So just check to
see if the root we are on is the tree root, and just don't do the fast caching.
Reported-by: NSage Weil <sage@newdream.net>
Signed-off-by: NJosef Bacik <josef@redhat.com>

b8399dee

Btrfs: deal with space cache errors better · 2b20982e

由 Josef Bacik 提交于 12月 03, 2010

Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing. Fix this by marking the space cache as having an error so we only
get the message once. This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway. None of these problems are actual problems, they are just
annoying and sub-optimal. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

2b20982e

Btrfs: fix use after free in O_DIRECT · 955256f2

由 Josef Bacik 提交于 11月 19, 2010

This fixes a bug where we use dip after we have freed it.  Instead just use the
file_offset that was passed to the function.  Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>

955256f2

29 11月, 2010 2 次提交

C
Btrfs: don't use migrate page without CONFIG_MIGRATION · 5a92bc88
由 Chris Mason 提交于 11月 29, 2010
```
Fixes compile error
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
5a92bc88

Btrfs: deal with DIO bios that span more than one ordered extent · 163cf09c

由 Chris Mason 提交于 11月 28, 2010

The new DIO bio splitting code has problems when the bio
spans more than one ordered extent.  This will happen as the
generic DIO code merges our get_blocks calls together into
a bigger single bio.

This fixes things by walking forward in the ordered extent
code finding all the overlapping ordered extents and completing them
all at once.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

163cf09c

28 11月, 2010 6 次提交

Btrfs: setup blank root and fs_info for mount time · 450ba0ea

由 Josef Bacik 提交于 11月 19, 2010

There is a problem with how we use sget, it searches through the list of supers
attached to the fs_type looking for a super with the same fs_devices as what
we're trying to mount. This depends on sb->s_fs_info being filled, but we don't
fill that in until we get to btrfs_fill_super, so we could hit supers on the
fs_type super list that have a null s_fs_info. In order to fix that we need to
go ahead and setup a blank root with a blank fs_info to hold fs_devices, that
way our test will work out right and then we can set s_fs_info in
btrfs_set_super, and then open_ctree will simply use our pre-allocated root and
fs_info when setting everything up. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

450ba0ea

Btrfs: fix fiemap · 975f84fe

由 Josef Bacik 提交于 11月 23, 2010

There are two big problems currently with FIEMAP

1) We return extents for holes. This isn't supposed to happen, we just don't
return extents for holes and then userspace interprets the lack of an extent as
a hole.

2) We sometimes don't set FIEMAP_EXTENT_LAST properly. This is because we wait
to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
fiemap to map up to the last extent in a file, and there is nothing but holes up
to the i_size. To fix this we need to lookup the last extent in this file and
save the logical offset, so if we happen to try and map that extent we can be
sure to set FIEMAP_EXTENT_LAST.

With this patch we now pass xfstest 225, which we never have before.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

975f84fe

Btrfs - fix race between btrfs_get_sb() and umount · 619c8c76

由 Ian Kent 提交于 11月 22, 2010

When mounting a btrfs file system btrfs_test_super() may attempt to
use sb->s_fs_info, the btrfs root, of a super block that is going away
and that has had the btrfs root set to NULL in its ->put_super(). But
if the super block is going away it cannot be an existing super block
so we can return false in this case.
Signed-off-by: NIan Kent <raven@themaw.net>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

619c8c76

Btrfs: update inode ctime when using links · bc1cbf1f

由 Josef Bacik 提交于 11月 23, 2010

Currently we fail xfstest 236 because we're not updating the inode ctime on
link.  This is a simple fix, and makes it so we pass 236 now.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bc1cbf1f

Btrfs: make sure new inode size is ok in fallocate · 0ed42a63

由 Josef Bacik 提交于 11月 22, 2010

We have been failing xfstest 228 forever, because we don't check to make sure
the new inode size is acceptable as far as RLIMIT is concerned.  Just check to
make sure it's ok to create a inode with this new size and error out if not.
With this patch we now pass 228.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0ed42a63

Btrfs: fix typo in fallocate to make it honor actual size · 55a61d1d

由 Josef Bacik 提交于 11月 22, 2010

There is a typo in __btrfs_prealloc_file_range() where we set the i_size to
actual_len/cur_offset, and then just set it to cur_offset again, and do the same
with btrfs_ordered_update_i_size(). This fixes it back to keeping i_size in a
local variable and then updating i_size properly. Tested this with

xfs_io -F -f -c "falloc 0 1" -c "pwrite 0 1" foo

stat'ing foo gives us a size of 1 instead of 4096 like it was. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

55a61d1d