提交 · 39c2d7faccc5ca5a1be682b01c0db5fafa8adeda · openeuler / raspberrypi-kernel

03 6月, 2015 7 次提交

Btrfs: fix -ENOSPC on block group removal · 39c2d7fa

由 Filipe Manana 提交于 5月 20, 2015

Unlike when attempting to allocate a new block group, where we check
that we have enough space in the system space_info to update the device
items and insert a new chunk item in the chunk tree, we were not checking
if the system space_info had enough space for updating the device items
and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
error when attempting to allocate blocks for the chunk tree (during btree
node/leaf COW operations) while updating the device items or deleting the
chunk item, which resulted in the current transaction being aborted and
turning the filesystem into read-only mode.

While running fstests generic/038, which stresses allocation of block
groups and removal of unused block groups, with a large scratch device
(750Gb) this happened often, despite more than enough unallocated space,
and resulted in the following trace:

[68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[68663.600407] BTRFS: Transaction aborted (error -28)
(...)
[68663.730829] Call Trace:
[68663.732585]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[68663.734334]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[68663.739980]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[68663.757153]  [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.760925]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[68663.762854]  [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
[68663.764073]  [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.765130]  [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
[68663.765998]  [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
[68663.767068]  [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
[68663.768227]  [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
[68663.769081]  [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
[68663.799485]  [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
[68663.809208]  [<ffffffff8105f367>] kthread+0xef/0xf7
[68663.828795]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[68663.844942]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.846486]  [<ffffffff81435a88>] ret_from_fork+0x58/0x90
[68663.847760]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
[68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left

So fix this by verifying that enough space exists in system space_info,
and reserving the space in the chunk block reserve, before attempting to
delete the block group and allocate a new system chunk if we don't have
enough space to perform the necessary updates and delete in the chunk
tree. Like for the block group creation case, we don't error our if we
fail to allocate a new system chunk, since we might end up not needing
it (no node/leaf splits happen during the COW operations and/or we end
up not needing to COW any btree nodes or leafs because they were already
COWed in the current transaction and their writeback didn't start yet).
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

39c2d7fa

Btrfs: fix chunk allocation regression leading to transaction abort · c152b63e

由 Filipe Manana 提交于 5月 14, 2015

With commit 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction
in case device tree has hole") introduced in the kernel 4.1 merge window,
we end up using part of a device hole for which there are already pending
chunks or pinned chunks. Before that commit we didn't use the hole and
would just move on to the next hole in the device.

However when we adjust the start offset for the chunk allocation and we
have pinned chunks, we set it blindly to the end offset of the pinned
chunk we are currently processing, which is dangerous because we can
have a pending chunk that has a start offset that matches the end offset
of our pinned chunk - leading us to a case where we end up getting two
pending chunks that start at the same physical device offset, which makes
us later abort the current transaction with -EEXIST when finishing the
chunk allocation at btrfs_create_pending_block_groups():

[194737.659017] ------------[ cut here ]------------
[194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[194737.662209] BTRFS: Transaction aborted (error -17)
[194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
[194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[194737.682999]  0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2
[194737.684540]  ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78
[194737.686017]  ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40
[194737.687509] Call Trace:
[194737.688068]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[194737.689027]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[194737.690095]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[194737.691198]  [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.693789]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[194737.695065]  [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.696806]  [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[194737.698683]  [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[194737.700329]  [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs]
[194737.701924]  [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[194737.703675]  [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs]
[194737.705417]  [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[194737.707058]  [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
[194737.708560]  [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs]
[194737.710673]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
[194737.712076]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
[194737.713293]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
[194737.714443]  [<ffffffff81154424>] SyS_pwrite64+0x64/0x82
[194737.715646]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[194737.717175] ---[ end trace f2d5dc04e56d7e48 ]---
[194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists

The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by
btrfs_create_pending_block_groups(), when it attempts to insert a
duplicated device extent item via btrfs_alloc_dev_extent().

This issue was reproducible with fstests generic/038 running in a loop for
several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard".
Applying Jeff's recent patch titled "btrfs: add missing discards when
unpinning extents with -o discard" makes the issue much easier to reproduce
(usually within 4 to 5 hours), since it pins chunks for longer periods of
time when an unused block group is deleted by the cleaner kthread.

Fix this by making sure that we never adjust the start offset to a lower
value than it currently has.

Fixes: 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole"
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

c152b63e

btrfs: use after free when closing devices · 2037a093

由 Sasha Levin 提交于 5月 12, 2015

__btrfs_close_devices() would call_rcu to free the device, which is racy with
list_for_each_entry() accessing the memory to retrieve the next device on the
list.
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

2037a093

Btrfs: check error before reporting missing device and add uuid · 33b97e43

由 Anand Jain 提交于 5月 08, 2015

Report missing device when add is successful,
otherwise it would exit as ENOMEM. And add uuid
to the report.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

33b97e43

Btrfs: log when missing device is created · 816fcebe

由 Anand Jain 提交于 4月 27, 2015

Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

816fcebe

btrfs: fix warnings after changes in btrfs_abort_transaction · 6d13f549

由 David Sterba 提交于 4月 24, 2015

fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’:
fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
   btrfs_abort_transaction(trans, tree_root,
   ^
  CC [M]  fs/btrfs/ioctl.o
fs/btrfs/ioctl.c: In function ‘create_subvol’:
fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
   btrfs_abort_transaction(trans, root, PTR_ERR(new_root));

PTR_ERR returns long, but we're really using 'int' for the error codes
everywhere so just set and use the local variable.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

6d13f549

Btrfs: check pending chunks when shrinking fs to avoid corruption · 53e489bc

由 Filipe Manana 提交于 6月 02, 2015

When we shrink the usable size of a device (its total_bytes), we go over
all the device extent items in the device tree and attempt to relocate
the chunk of any device extent that goes beyond the new usable size for
the device. We do that after setting the new usable size (total_bytes) in
the device object, so that all new allocations (and reallocations) don't
use areas of the device that go beyond the new (shorter) size. However we
were not considering that before setting the new size in the device,
pending chunks might have been created that use device extents that go
beyond the new size, and those device extents are not yet in the device
tree after we search the device tree - they are still attached to the
list of new block group for some ongoing transaction handle, and they are
only added to the device tree when the transaction handle is ended (via
btrfs_create_pending_block_groups()).

So check for pending chunks with device extents that go beyond the new
size and if any exists, commit the current transaction and repeat the
search in the device tree.

Not doing this it would mean we would return success to user space while
still having extents that go beyond the new size, and later user space
could override those locations on the device while the fs still references
them, causing all sorts of corruption and unexpected events.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

53e489bc

20 5月, 2015 1 次提交

Btrfs: fix racy system chunk allocation when setting block group ro · a9629596

由 Filipe Manana 提交于 5月 18, 2015

If while setting a block group read-only we end up allocating a system
chunk, through check_system_chunk(), we were not doing it while holding
the chunk mutex which is a problem if a concurrent chunk allocation is
happening, through do_chunk_alloc(), as it means both block groups can
end up using the same logical addresses and physical regions in the
device(s). So make sure we hold the chunk mutex.

Cc: stable@vger.kernel.org  # 4.0+
Fixes: 2f081088 ("btrfs: delete chunk allocation attemp when
                      setting block group ro")
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

a9629596

26 4月, 2015 1 次提交

Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole · 1b984508

由 Forrest Liu 提交于 2月 09, 2015

If device tree has hole, find_free_dev_extent() cannot find available
address properly.

The problem can be reproduce by following script.

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 100g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    # make device tree with one big hole
    for i in `seq 1 1 100`; do
        fallocate -l 1g $mntpath/$i
    done
    sync
    for i in `seq 1 1 95`; do
        rm $mntpath/$i
    done
    sync

    # wait cleaner thread remove unused block group
    sleep 300

    fallocate -l 1g $mntpath/aaa

    # failed to allocate new chunk
    fallocate -l 1g $mntpath/bbb

Above script will make device tree with one big hole, and can only allocate
just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb

    item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 106292051968 length 1073741824
    item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 103108575232 length 1073741824
Signed-off-by: NForrest Liu <forrestl@synology.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

1b984508

13 4月, 2015 1 次提交

btrfs: Fix tail space processing in find_free_dev_extent() · f2ab7618

由 Zhao Lei 提交于 2月 16, 2015

It is another reason for NO_SPACE case.

When we found enough free space in loop and saved them to
max_hole_start/size before, and tail space contains pending extent,
origional innocent max_hole_start/size are reset in retry.

As a result, find_free_dev_extent() returns less space than it can,
and cause NO_SPACE in user program.
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

f2ab7618

04 3月, 2015 9 次提交

btrfs: remove shadowing variables in __btrfs_map_block · 258ece02

由 David Sterba 提交于 2月 24, 2015

1) We can safely use the function's 'i'. Fixes warning

fs/btrfs/volumes.c:5257:7: warning: declaration of 'i' shadows a previous local
fs/btrfs/volumes.c:4951:6: warning: shadowed declaration is here

2) A local variable duplicates name of an argument, we can use the value
directly. Fixes warning

fs/btrfs/volumes.c:5433:8: warning: declaration of 'length' shadows a parameter
fs/btrfs/volumes.c:4935:27: warning: shadowed declaration is here
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

258ece02

D
btrfs: cleanup, use correct type in div_u64_rem · 9d644a62
由 David Sterba 提交于 2月 20, 2015
```
div_u64_rem expects u32 for divisior and reminder.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
9d644a62

btrfs: replace remaining do_div calls with div_u64 variants · 47c5713f

由 David Sterba 提交于 2月 20, 2015

Switch to div_u64_rem that does type checking and has more obvious
semantics than do_div.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

47c5713f

btrfs: cleanup 64bit/32bit divs, provably bounded values · b8b93add

由 David Sterba 提交于 1月 16, 2015

The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type.
Get rid of a few more do_div instances.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

b8b93add

btrfs: cleanup, use kmalloc_array/kcalloc array helpers · 31e818fe

由 David Sterba 提交于 2月 20, 2015

Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

31e818fe

btrfs: need_resched not needed with cond_resched · 853d8ec4

由 David Sterba 提交于 1月 08, 2015

Cleanup, no special reason to do

if (need_resched())
        cond_resched();
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

853d8ec4

D
btrfs: cleanup, use correct type in div_u64_rem · 29cf342b
由 David Sterba 提交于 2月 20, 2015
```
div_u64_rem expects u32 for divisior and reminder.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
29cf342b

btrfs: replace remaining do_div calls with div_u64 variants · 35b850f1

由 David Sterba 提交于 2月 20, 2015

Switch to div_u64_rem that does type checking and has more obvious
semantics than do_div.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

35b850f1

btrfs: cleanup 64bit/32bit divs, provably bounded values · c7abe829

由 David Sterba 提交于 1月 16, 2015

The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type.
Get rid of a few more do_div instances.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

c7abe829

20 2月, 2015 1 次提交

Btrfs: fix allocation size calculations in alloc_btrfs_bio · e57cf21e

由 Chris Mason 提交于 2月 19, 2015

Since commit 8e5cfb55 (Btrfs: Make raid_map array be inlined in
btrfs_bio structure), the raid map array is allocated along with the
btrfs bio in alloc_btrfs_bio.  The calculation used to decide how much
we need to allocate was using the wrong parameter passed into the
allocation function.

The passed in real_stripes will be zero if a target replace operation
is not currently running.  We want to use total_stripes instead.
Signed-off-by: NChris Mason <clm@fb.com>
Reported-by: NDavid Sterba <dsterba@suse.cz>
Tested-by: NDavid Sterba <dsterba@suse.cz>

e57cf21e

17 2月, 2015 4 次提交

btrfs: remove unused fs_info arg from btrfs_close_extra_devices() · 9eaed21e

由 Eric Sandeen 提交于 8月 01, 2014

The commit:
8dabb742 Btrfs: change core code of btrfs to support the
        device replace operations
added the fs_info argument, but never used it -
just remove it again.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

9eaed21e

btrfs: cleanup: use for() loop in btrfs_map_bio() · 08da757d

由 Zhao Lei 提交于 2月 12, 2015

for() is obviously better in these code block, and remove noused
init-value to reduce about 6 bytes binary size.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

08da757d

btrfs: remove unused chunk_tree argument in several functions · a688a04a

由 Zhao Lei 提交于 2月 09, 2015

There functions include unused chunk_tree argument from the begining,
it is time to remove them and clean up relative code to prepare value
of this argument in caller.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

a688a04a

btrfs: constify structs with op functions or static definitions · e8c9f186

由 David Sterba 提交于 1月 02, 2015

There are some op tables that can be easily made const, similarly the
sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

e8c9f186

15 2月, 2015 1 次提交

btrfs: Fix out-of-space bug · 13212b54

由 Zhao Lei 提交于 2月 12, 2015

Btrfs will report NO_SPACE when we create and remove files for several times,
and we can't write to filesystem until mount it again.

Steps to reproduce:
 1: Create a single-dev btrfs fs with default option
 2: Write a file into it to take up most fs space
 3: Delete above file
 4: Wait about 100s to let chunk removed
 5: goto 2

Script is like following:
 #!/bin/bash

 # Recommend 1.2G space, too large disk will make test slow
 DEV="/dev/sda16"
 MNT="/mnt/tmp"

 dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
 file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))

 echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"

 for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
 echo "mkfs $DEV"
 mkfs.btrfs -f "$DEV" >/dev/null || exit 2
 echo "mount $DEV $MNT"
 mount "$DEV" "$MNT" || exit 2

 for ((loop_i = 0; loop_i < 20; loop_i++)); do
     echo
     echo "loop $loop_i"

     echo "dd file..."
     cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
     "${cmd[@]}" 2>/dev/null || {
         # NO_SPACE error triggered
         echo "dd failed: ${cmd[*]}"
         exit 1
     }

     echo "rm file..."
     rm -f "$MNT"/file0 || exit 2

     for ((i = 0; i < 10; i++)); do
         df "$MNT" | tail -1
         sleep 10
     done
 done

Reason:
 It is triggered by commit: 47ab2a6c
 which is used to remove empty block groups automatically, but the
 reason is not in that patch. Code before works well because btrfs
 don't need to create and delete chunks so many times with high
 complexity.
 Above bug is caused by many reason, any of them can trigger it.

Reason1:
 When we remove some continuous chunks but leave other chunks after,
 these disk space should be used by chunk-recreating, but in current
 code, only first create will successed.
 Fixed by Forrest Liu <forrestl@synology.com> in:
 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason2:
 contains_pending_extent() return wrong value in calculation.
 Fixed by Forrest Liu <forrestl@synology.com> in:
 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason3:
 btrfs_check_data_free_space() try to commit transaction and retry
 allocating chunk when the first allocating failed, but space_info->full
 is set in first allocating, and prevent second allocating in retry.
 Fixed in this patch by clear space_info->full in commit transaction.

 Tested for severial times by above script.

Changelog v3->v4:
 use light weight int instead of atomic_t to record have_remove_bgs in
 transaction, suggested by:
 Josef Bacik <jbacik@fb.com>

Changelog v2->v3:
 v2 fixed the bug by adding more commit-transaction, but we
 only need to reclaim space when we are really have no space for
 new chunk, noticed by:
 Filipe David Manana <fdmanana@gmail.com>

 Actually, our code already have this type of commit-and-retry,
 we only need to make it working with removed-bgs.
 v3 fixed the bug with above way.

Changelog v1->v2:
 v1 will introduce a new bug when delete and create chunk in same disk
 space in same transaction, noticed by:
 Filipe David Manana <fdmanana@gmail.com>
 V2 fix this bug by commit transaction after remove block grops.
Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Suggested-by: NFilipe David Manana <fdmanana@gmail.com>
Suggested-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

13212b54

03 2月, 2015 2 次提交

btrfs: add more checks to btrfs_read_sys_array · e3540eab

由 David Sterba 提交于 11月 05, 2014

Verify that the sys_array has enough bytes to read the next item.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

e3540eab

btrfs: cleanup, rename a few variables in btrfs_read_sys_array · 1ffb22cf

由 David Sterba 提交于 10月 31, 2014

There's a pointer to buffer, integer offset and offset passed as
pointer, try to find matching names for them.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

1ffb22cf

22 1月, 2015 5 次提交

Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply · ffe2d203

由 Zhao Lei 提交于 1月 20, 2015

So we can check raid56 with:
 (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
instead of long:
 (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

ffe2d203

Btrfs: Include map_type in raid_bio · 10f11900

由 Zhao Lei 提交于 1月 20, 2015

Corrent code use many kinds of "clever" way to determine operation
target's raid type, as:
  raid_map != NULL
  or
  raid_map[MAX_NR] == RAID[56]_Q_STRIPE

To make code easy to maintenance, this patch put raid type into
bbio, and we can always get raid type from bbio with a "stupid"
way.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

10f11900

Btrfs: add ref_count and free function for btrfs_bio · 6e9606d2

由 Zhao Lei 提交于 1月 20, 2015

1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
   to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
   forced data type checking in compile.

Changelog v1->v2:
 Rename following by David Sterba's suggestion:
 put_btrfs_bio() -> btrfs_put_bio()
 get_btrfs_bio() -> btrfs_get_bio()
 bbio->ref_count -> bbio->refs
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

6e9606d2

Btrfs: Make raid_map array be inlined in btrfs_bio structure · 8e5cfb55

由 Zhao Lei 提交于 1月 20, 2015

It can make code more simple and clear, we need not care about
free bbio and raid_map together.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

8e5cfb55

Btrfs: sort raid_map before adding tgtdev stripes · cc7539ed

由 Zhao Lei 提交于 1月 20, 2015

It can avoid complex calculation of real stripes in sort,
moreover, we can clean up code of sorting tgtdev_map because it
will be in order initially.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

cc7539ed

17 12月, 2014 1 次提交
- A
  btrfs: filp_open() returns ERR_PTR() on failure, not NULL... · 98af592f
  由 Al Viro 提交于 12月 14, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  98af592f
13 12月, 2014 1 次提交

btrfs: sink blocksize parameter to btrfs_find_create_tree_block · a83fffb7

由 David Sterba 提交于 6月 15, 2014

Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.

Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

a83fffb7

03 12月, 2014 6 次提交

Btrfs: fix fs mapping extent map leak · 495e64f4

由 Filipe Manana 提交于 12月 02, 2014

On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).

Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
[20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
[20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
[20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

495e64f4

Btrfs: fix race between fs trimming and block group remove/allocation · 04216820

由 Filipe Manana 提交于 11月 27, 2014

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

04216820

Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 · 4245215d

由 Miao Xie 提交于 11月 25, 2014

The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>

4245215d

Btrfs, replace: write dirty pages into the replace target device · 2c8cdd6e

由 Miao Xie 提交于 11月 14, 2014

The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>

2c8cdd6e

Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted · af8e2d1d

由 Miao Xie 提交于 10月 23, 2014

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>

af8e2d1d

Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block · 6de65650

由 Zhao Lei 提交于 11月 13, 2014

stripe_index's value was set again in latter line:
stripe_index = 0;
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>

6de65650