提交 · 1bbc621ef28462456131c035eaeb5567a1a2a2fe · OpenHarmony / kernel_linux

11 4月, 2015 1 次提交

Btrfs: allow block group cache writeout outside critical section in commit · 1bbc621e

由 Chris Mason 提交于 4月 06, 2015

We loop through all of the dirty block groups during commit and write
the free space cache.  In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit.  We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.
Signed-off-by: NChris Mason <clm@fb.com>

1bbc621e

27 3月, 2015 1 次提交

Btrfs: Remove the check for old-style mkfs · e56a951e

由 Liu Bo 提交于 3月 17, 2015

This was used to make sure that a fresh btrfs from an older mkfs.btrfs,
but it also allows us to mount a buggy btrfs if this btrfs has the right
superblock head part but has something wrong with chunk tree part[1], and
after that we can hit BUG_ON()s set in the code to prevent something
impossible.

Since David has released "Btrfs progs v3.19-rc2", just remove the check,
if anyone who wants to make a fresh btrfs, please use the latest one.

[1]: http://www.spinics.net/lists/linux-btrfs/msg42358.htmlSigned-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NOmar Sandoval <osandov@osandov.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

e56a951e

14 3月, 2015 1 次提交

btrfs: fix sizeof format specifier in btrfs_check_super_valid() · d2207129

由 Fabian Frederick 提交于 2月 14, 2015

This patch fixes mips compilation warning:

fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NChris Mason <clm@fb.com>

d2207129

04 3月, 2015 2 次提交

btrfs: cleanup, use kmalloc_array/kcalloc array helpers · 31e818fe

由 David Sterba 提交于 2月 20, 2015

Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

31e818fe

btrfs: cleanup 64bit/32bit divs, compile time constants · f8c269d7

由 David Sterba 提交于 1月 16, 2015

Switch to div_u64 if the divisor is a numeric constant or sum of
sizeof()s. We can remove a few instances of do_div that has the hidden
semtantics of changing the 1st argument.

Small power-of-two divisors are converted to bitshifts, large values are
kept intact for clarity.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

f8c269d7

21 2月, 2015 1 次提交

btrfs: cleanup 64bit/32bit divs, compile time constants · 16068ec1

由 David Sterba 提交于 1月 16, 2015

Switch to div_u64 if the divisor is a numeric constant or sum of
sizeof()s. We can remove a few instances of do_div that has the hidden
semtantics of changing the 1st argument.

Small power-of-two divisors are converted to bitshifts, large values are
kept intact for clarity.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

16068ec1

17 2月, 2015 15 次提交

D
btrfs: cleanup, reduce temporary variables in btrfs_read_roots · a4f3d2c4
由 David Sterba 提交于 2月 16, 2015
```
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
a4f3d2c4

btrfs: use correct type for workqueue flags · 6f011058

由 David Sterba 提交于 2月 16, 2015

Through all the local wrappers to alloc_workqueue, __alloc_workqueue_key
takes an unsigned int.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

6f011058

btrfs: factor btrfs_read_roots() out of open_ctree() · 4bbcaa64

由 Eric Sandeen 提交于 8月 01, 2014

Also, remove the two local variables create_uuid_tree
and check_uuid_tree; we can use the existence of
the uuid root and/or the RESCAN_UUID_TREE flag to
determine what action to take.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

4bbcaa64

E
btrfs: factor btrfs_replay_log() out of open_ctree() · 63443bf5
由 Eric Sandeen 提交于 8月 01, 2014
```
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
63443bf5

btrfs: factor btrfs_init_workqueues() out of open_ctree() · 2a458198

由 Eric Sandeen 提交于 2月 16, 2015

Signed-off-by: NEric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_workqueues]
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

2a458198

btrfs: factor btrfs_init_qgroup() out of open_ctree() · f9e92e40

由 Eric Sandeen 提交于 8月 01, 2014

Signed-off-by: NEric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_qgroup]
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

f9e92e40

btrfs: factor btrfs_init_dev_replace_locks() out of open_ctree() · ad618368

由 Eric Sandeen 提交于 8月 01, 2014

Signed-off-by: NEric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_dev_replace_locks]
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

ad618368

btrfs: factor btrfs_init_btree_inode() out of open_ctree() · f37938e0

由 Eric Sandeen 提交于 8月 01, 2014

Signed-off-by: NEric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_btree_inode]
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

f37938e0

btrfs: factor btrfs_init_balance() out of open_ctree() · 779a65a4

由 Eric Sandeen 提交于 8月 01, 2014

Signed-off-by: NEric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_balance]
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

779a65a4

btrfs: factor btrfs_init_scrub() out of open_ctree() · 638aa7ed

由 Eric Sandeen 提交于 8月 01, 2014

Signed-off-by: NEric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_scrub]
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

638aa7ed

btrfs: consistently use fs_info in close_ctree() · 04892340

由 Eric Sandeen 提交于 8月 01, 2014

close_ctree() has a local fs_info var for convienience;
use it consistently.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

04892340

btrfs: remove unused fs_info arg from btrfs_close_extra_devices() · 9eaed21e

由 Eric Sandeen 提交于 8月 01, 2014

The commit:
8dabb742 Btrfs: change core code of btrfs to support the
        device replace operations
added the fs_info argument, but never used it -
just remove it again.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

9eaed21e

btrfs: fix sizeof format specifier in btrfs_check_super_valid() · 41d6b13e

由 Fabian Frederick 提交于 2月 14, 2015

This patch fixes mips compilation warning:

fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

41d6b13e

btrfs: constify structs with op functions or static definitions · e8c9f186

由 David Sterba 提交于 1月 02, 2015

There are some op tables that can be easily made const, similarly the
sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

e8c9f186

Btrfs: disk-io: replace root args iff only fs_info used · 01d58472

由 Daniel Dressler 提交于 11月 21, 2014

This is the 3rd independent patch of a larger project to cleanup btrfs's
internal usage of btrfs_root. Many functions take btrfs_root only to
grab the fs_info struct.

By requiring a root these functions cause programmer overhead. That
these functions can accept any valid root is not obvious until
inspection.

This patch reduces the specificity of such functions to accept the
fs_info directly.

These patches can be applied independently and thus are not being
submitted as a patch series. There should be about 26 patches by the
project's completion. Each patch will cleanup between 1 and 34 functions
apiece.  Each patch covers a single file's functions.

This patch affects the following function(s):
  1) csum_tree_block
  2) csum_dirty_buffer
  3) check_tree_block_fsid
  4) btrfs_find_tree_block
  5) clean_tree_block
Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

01d58472

03 2月, 2015 3 次提交

Btrfs: fix race between transaction commit and empty block group removal · d4b450cd

由 Filipe Manana 提交于 1月 29, 2015

Committing a transaction can race with automatic removal of empty block
groups (cleaner kthread), leading to a BUG_ON() in the transaction
commit code while running btrfs_finish_extent_commit(). The following
sequence diagram shows how it can happen:

           CPU 1                                       CPU 2

btrfs_commit_transaction()
  fs_info->running_transaction = NULL
  btrfs_finish_extent_commit()
    find_first_extent_bit()
      -> found range for block group X
         in fs_info->freed_extents[]

                                               btrfs_delete_unused_bgs()
                                                 -> found block group X

                                                 Removed block group X's range
                                                 from fs_info->freed_extents[]

                                                 btrfs_remove_chunk()
                                                    btrfs_remove_block_group(bg X)

    unpin_extent_range(bg X range)
       btrfs_lookup_block_group(bg X)
          -> returns NULL
            -> BUG_ON()

The trace that results from the BUG_ON() is:

[48665.187808] ------------[ cut here ]------------
[48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
[48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
[48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
[48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
[48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
[48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
[48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
[48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
[48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
[48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
[48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
[48665.197388] Stack:
[48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
[48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
[48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
[48665.197388] Call Trace:
[48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
[48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
[48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
[48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
[48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
[48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
[48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
[48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
[48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
[48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
[48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
[48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
[48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388]  RSP <ffff8801b56a7b88>
[48665.272246] ---[ end trace b9c6ab9957521376 ]---

Fix this by ensuring that unpining the block group's range in
btrfs_finish_extent_commit() is done in a synchronized fashion
with removing the block group's range from freed_extents[]
in btrfs_delete_unused_bgs()

This race got introduced with the change:

    Btrfs: remove empty block groups automatically
    commit 47ab2a6cSigned-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

d4b450cd

btrfs: add checks for sys_chunk_array sizes · ce7fca5f

由 David Sterba 提交于 10月 31, 2014

Verify that possible minimum and maximum size is set, validity of
contents is checked in btrfs_read_sys_array.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

ce7fca5f

btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize · 75d6ad38

由 David Sterba 提交于 10月 31, 2014

I received a few crafted images from Jiri, all got through the recently
added superblock checks. The lower bounds checks for num_devices and
sector/node -sizes were missing and caused a crash during mount.

Tools for symbolic code execution were used to prepare the images
contents.
Reported-by: NJiri Slaby <jslaby@suse.cz>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

75d6ad38

22 1月, 2015 5 次提交

Btrfs: fix unused members in struct btrfs_root · 78f55e5e

由 Anand Jain 提交于 1月 13, 2015

There isn't any real use of following members of struct btrfs_root
so delete them.

struct kobject root_kobj;
struct completion kobj_unregister;
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

78f55e5e

btrfs: set proper message level for skinny metadata · 5efa0490

由 David Sterba 提交于 12月 19, 2014

This has been confusing people for too long, the message is really just
informative.

CC: <stable@vger.kernel.org> # 3.10+
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

5efa0490

btrfs: update message levels after checksum errors · f0954c66

由 David Sterba 提交于 12月 19, 2014

The errors are worth noting and might get missed with INFO level.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

f0954c66

btrfs: update message levels during failed mount · aa8ee312

由 David Sterba 提交于 12月 19, 2014

All error conditions from open_ctree shall be ERR. Warning would
suggest that something's wrong and we can continue.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

aa8ee312

btrfs: update message levels for errors · 68b663d1

由 David Sterba 提交于 12月 19, 2014

Several messages that point to some internal problem, level INFO is
wrong here.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

68b663d1

21 1月, 2015 3 次提交

fs: remove default_backing_dev_info · df0ce26c

由 Christoph Hellwig 提交于 1月 14, 2015

Now that default_backing_dev_info is not used for writeback purposes we can
git rid of it easily:

 - instead of using it's name for tracing unregistered bdi we just use
   "unknown"
 - btrfs and ceph can just assign the default read ahead window themselves
   like several other filesystems already do.
 - we can assign noop_backing_dev_info as the default one in alloc_super.
   All filesystems already either assigned their own or
   noop_backing_dev_info.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

df0ce26c

fs: remove mapping->backing_dev_info · b83ae6d4

由 Christoph Hellwig 提交于 1月 14, 2015

Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

b83ae6d4

fs: introduce f_op->mmap_capabilities for nommu mmap support · b4caecd4

由 Christoph Hellwig 提交于 1月 14, 2015

Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead.  Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device.  It also removes the need for
the mtd_inodefs filesystem.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NTejun Heo <tj@kernel.org>
Acked-by: NBrian Norris <computersforpeace@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b4caecd4

15 1月, 2015 2 次提交

btrfs: expand btrfs_find_item if found_key is NULL · 1d4c08e0

由 David Sterba 提交于 1月 02, 2015

If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper
for simple btrfs_search_slot.

After we've removed all such callers, passing a NULL key is not valid
anymore.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

1d4c08e0

btrfs: fix leak of path in btrfs_find_item · 381cf658

由 David Sterba 提交于 1月 02, 2015

If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.

Move the path allocation to the callers.

CC: <stable@vger.kernel.org> # 3.14+
Fixes: 3f870c28 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

381cf658

13 12月, 2014 4 次提交

btrfs: sink parameter len to alloc_extent_buffer · ce3e6984

由 David Sterba 提交于 6月 15, 2014

Because we're using globally known nodesize. Do the same for the sanity
test function variant.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

ce3e6984

btrfs: sink blocksize parameter to btrfs_find_create_tree_block · a83fffb7

由 David Sterba 提交于 6月 15, 2014

Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.

Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

a83fffb7

D
btrfs: sink blocksize parameter to reada_tree_block_flagged · c0dcaa4d
由 David Sterba 提交于 6月 15, 2014
```
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
c0dcaa4d
D
btrfs: sink blocksize parameter to readahead_tree_block · d3e46fea
由 David Sterba 提交于 6月 15, 2014
```
All callers pass nodesize.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
```
d3e46fea

11 12月, 2014 1 次提交

Btrfs: fix fs corruption on transaction abort if device supports discard · 678886bd

由 Filipe Manana 提交于 12月 07, 2014

When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.

This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.

I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:

  [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

The corruption always happened when a transaction aborted and then fsck complained
like this:

   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
   *** fsck.btrfs output ***
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   read block failed check_tree_block
   Couldn't open file system

In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:

   1) transaction N started, fs_info->pinned_extents pointed to
      fs_info->freed_extents[0];

   2) node/eb 94945280 is created;

   3) eb is persisted to disk;

   4) transaction N commit starts, fs_info->pinned_extents now points to
      fs_info->freed_extents[1], and transaction N completes;

   5) transaction N + 1 starts;

   6) eb is COWed, and btrfs_free_tree_block() called for this eb;

   7) eb range (94945280 to 94945280 + 16Kb) is added to
      fs_info->pinned_extents (fs_info->freed_extents[1]);

   8) Something goes wrong in transaction N + 1, like hitting ENOSPC
      for example, and the transaction is aborted, turning the fs into
      readonly mode. The stack trace I got for example:

      [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
      [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
      [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
      [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
      [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
      [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
      [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
      (...)
      [112065.264493] ---[ end trace dd7903a975a31a08 ]---
      [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
      [112065.264997] BTRFS info (device sdc): forced readonly

   9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
      fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
      turn calls btrfs_destroy_pinned_extent();

   10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
       marked as dirty in fs_info->freed_extents[], and for each one
       it calls discard, if the fs was mounted with "-o discard", and
       adds the range to the free space cache of the respective block
       group;

   11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
       sees the free space entries and performs a discard;

   12) After an umount and mount (or fsck), our eb's location on disk was full
       of zeroes, and it should have been untouched, because it was marked as
       dirty in the fs_info->pinned_extents tree, and therefore used by the
       trees that the last committed superblock points to.

Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.

This isn't a new problem, as it's been present since 2011 (git commit
acce952b).

Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

678886bd

03 12月, 2014 1 次提交

Btrfs: fix race between fs trimming and block group remove/allocation · 04216820

由 Filipe Manana 提交于 11月 27, 2014

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

04216820

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多