提交 · e5987e1319e7b8155994407df9eb8f108871350b · openanolis / cloud-kernel

17 2月, 2017 1 次提交
- D
  btrfs: remove unused parameters from scrub_setup_wr_ctx · e5987e13
  由 David Sterba 提交于 2月 10, 2017
```
Never used.
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
  e5987e13
06 12月, 2016 5 次提交

btrfs: remove root parameter from transaction commit/end routines · 3a45bb20

由 Jeff Mahoney 提交于 9月 09, 2016

Now we only use the root parameter to print the root objectid in
a tracepoint.  We can use the root parameter from the transaction
handle for that.  It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3a45bb20

btrfs: take an fs_info directly when the root is not used otherwise · 2ff7e61e

由 Jeff Mahoney 提交于 6月 22, 2016

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer.  Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2ff7e61e

btrfs: root->fs_info cleanup, add fs_info convenience variables · 0b246afa

由 Jeff Mahoney 提交于 6月 22, 2016

In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable.  This makes the code considerably
more readable.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0b246afa

btrfs: pull node/sector/stripe sizes out of root and into fs_info · da17066c

由 Jeff Mahoney 提交于 6月 15, 2016

We track the node sizes per-root, but they never vary from the values
in the superblock.  This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

da17066c

J
btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere · fb456252
由 Jeff Mahoney 提交于 6月 22, 2016
```
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
fb456252

29 11月, 2016 1 次提交

btrfs: don't abuse REQ_OP_* flags for btrfs_map_block · cf8cddd3

由 Christoph Hellwig 提交于 10月 27, 2016

btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations.  But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags.  This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

cf8cddd3

01 11月, 2016 1 次提交

block,fs: use REQ_* flags directly · 70fd7614

由 Christoph Hellwig 提交于 11月 01, 2016

Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

70fd7614

27 9月, 2016 1 次提交

btrfs: unsplit printed strings · 5d163e0e

由 Jeff Mahoney 提交于 9月 20, 2016

CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5d163e0e

26 7月, 2016 2 次提交

btrfs: plumb fs_info into btrfs_work · cb001095

由 Jeff Mahoney 提交于 6月 09, 2016

In order to provide an fsid for trace events, we'll need a btrfs_fs_info
pointer.  The most lightweight way to do that for btrfs_work structures
is to associate it with the __btrfs_workqueue structure.  Each queued
btrfs_work structure has a workqueue associated with it, so that's
a natural fit.  It's a privately defined structures, so we add accessors
to retrieve the fs_info pointer.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

cb001095

Btrfs: subpage-blocksize: Rate limit scrub error message · 751bebbe

由 Chandan Rajendra 提交于 7月 04, 2016

btrfs/073 invokes scrub ioctl in a tight loop. In subpage-blocksize
scenario this results in a lot of "scrub: size assumption sectorsize !=
PAGE_SIZE " messages being printed on the console. To reduce the number
of such messages this commit uses btrfs_err_rl() instead of
btrfs_err().
Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

751bebbe

08 6月, 2016 2 次提交

btrfs: use bio op accessors · 37226b21

由 Mike Christie 提交于 6月 05, 2016

This should be the easier cases to convert btrfs to
bio_set_op_attrs/bio_op.
They are mostly just cut and replace type of changes.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

37226b21

block/fs/drivers: remove rw argument from submit_bio · 4e49ea4a

由 Mike Christie 提交于 6月 05, 2016

This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: NMike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c
Signed-off-by: NJens Axboe <axboe@fb.com>

4e49ea4a

30 5月, 2016 3 次提交

Btrfs: fix race setting block group back to RW mode during device replace · 1a1a8b73

由 Filipe Manana 提交于 5月 14, 2016

After it finishes processing a device extent, the device replace code sets
back the block group to RW mode and then after that it sets the left cursor
to match the logical end address of the block group, so that future writes
into extents belonging to the block group go both the source (old) and
target (new) devices. However from the moment we turn the block group
back to RW mode we have a short time window, that lasts until we update
the left cursor's value, where extents can be allocated from the block
group and written to, in which case they will not be copied/written to
the target (new) device. Fix this by updating the left cursor's value
before turning the block group back to RW mode.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NJosef Bacik <jbacik@fb.com>

1a1a8b73

Btrfs: fix unprotected assignment of the left cursor for device replace · 81e87a73

由 Filipe Manana 提交于 5月 14, 2016

We were assigning new values to fields of the device replace object
without holding the respective lock after processing each device extent.
This is important for the left cursor field which can be accessed by a
concurrent task running __btrfs_map_block (which, correctly, takes the
device replace lock).
So change these fields while holding the device replace lock.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NJosef Bacik <jbacik@fb.com>

81e87a73

Btrfs: fix race setting block group readonly during device replace · f0e9b7d6

由 Filipe Manana 提交于 5月 14, 2016

When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:

          CPU 1                                            CPU 2

 btrfs_dev_replace_start()
   btrfs_scrub_dev()
     scrub_enumerate_chunks()
       --> finds device extent belonging
           to block group X

                               <transaction N starts>

                                                      starts buffered write
                                                      against some inode

                                                      writepages is run against
                                                      that inode forcing dellaloc
                                                      to run

                                                      btrfs_writepages()
                                                        extent_writepages()
                                                          extent_write_cache_pages()
                                                            __extent_writepage()
                                                              writepage_delalloc()
                                                                run_delalloc_range()
                                                                  cow_file_range()
                                                                    btrfs_reserve_extent()
                                                                      --> allocates an extent
                                                                          from block group X
                                                                          (which is not yet
                                                                           in RO mode)
                                                                    btrfs_add_ordered_extent()
                                                                      --> creates ordered extent Y
                                                        flush_epd_write_bio()
                                                          --> bio against the extent from
                                                              block group X is submitted

       btrfs_inc_block_group_ro(bg X)
         --> sets block group X to readonly

       scrub_chunk(bg X)
         scrub_stripe(device extent from srcdev)
           --> keeps searching for extent items
               belonging to the block group using
               the extent tree's commit root
           --> it never blocks due to
               fs_info->scrub_pause_req as no
               one tries to commit transaction N
           --> copies all extents found from the
               source device into the target device
           --> finishes search loop

                                                        bio completes

                                                        ordered extent Y completes
                                                        and creates delayed data
                                                        reference which will add an
                                                        extent item to the extent
                                                        tree when run (typically
                                                        at transaction commit time)

                                                          --> so the task doing the
                                                              scrub/device replace
                                                              at CPU 1 misses this
                                                              and does not copy this
                                                              extent into the new/target
                                                              device

       btrfs_dec_block_group_ro(bg X)
         --> turns block group X back to RW mode

       dev_replace->cursor_left is set to the
       logical end offset of block group X

So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NJosef Bacik <jbacik@fb.com>

f0e9b7d6

26 5月, 2016 2 次提交

N
btrfs: fix string and comment grammatical issues and typos · 01327610
由 Nicholas D Steeves 提交于 5月 19, 2016
```
Signed-off-by: NNicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
01327610

btrfs: scrub: Set bbio to NULL before calling btrfs_map_block · f1fee653

由 Zhao Lei 提交于 5月 17, 2016

We usually call btrfs_put_bbio() when btrfs_map_block() failed,
btrfs_put_bbio() works right whether bbio is a valid value, or NULL.

But there is a exception, in some case, btrfs_map_block() will return
fail without touching *bbio(keeping its original value), and if bbio
was not initialized yet, invalid memory accessing will happened.

Above case is in scrub_missing_raid56_pages(), and similar case in
scrub_raid56_parity().
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f1fee653

16 5月, 2016 1 次提交

btrfs: fix memory leak during RAID 5/6 device replacement · 4673272f

由 Scott Talbert 提交于 5月 09, 2016

A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was never
freed anywhere.
Signed-off-by: NScott Talbert <scott.talbert@hgst.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4673272f

06 5月, 2016 3 次提交

btrfs: Fix BUG_ON condition in scrub_setup_recheck_block() · 24731149

由 Ashish Samant 提交于 4月 29, 2016

pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK.
page_index should be checked with the same to trigger BUG_ON().
Signed-off-by: NAshish Samant <ashish.samant@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

24731149

Btrfs: fix divide error upon chunk's stripe_len · 3d8da678

由 Liu Bo 提交于 4月 26, 2016

The struct 'map_lookup' uses type int for @stripe_len, while
btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
@stripe_len being undefined value and it can lead to 'divide error' in
 __btrfs_map_block().

This changes 'map_lookup' to use type u64 for stripe_len, also right now
we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
BTRFS_STRIPE_LEN.
Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
Reported-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ folded division fix to scrub_raid56_parity ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3d8da678

btrfs: reuse existing variable in scrub_stripe, reduce stack usage · e6c11f9a

由 David Sterba 提交于 3月 24, 2016

The key variable occupies 17 bytes, the key_start is used once, we can
simply reuse existing 'key' for that purpose. As the key is not a simple
type, compiler doest not do it on itself.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e6c11f9a

29 4月, 2016 2 次提交
- D
  btrfs: sink gfp parameter to clear_extent_bits · 91166212
  由 David Sterba 提交于 4月 26, 2016
```
Callers pass GFP_NOFS and GFP_KERNEL. No need to pass the flags around.
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
  91166212
- D
  btrfs: sink gfp parameter to set_extent_bits · ceeb0ae7
  由 David Sterba 提交于 4月 26, 2016
```
All callers pass GFP_NOFS.
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
  ceeb0ae7
05 4月, 2016 1 次提交

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf

由 Kirill A. Shutemov 提交于 4月 01, 2016

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

09cbfeaf

12 3月, 2016 1 次提交

btrfs: scrub: silence an uninitialized variable warning · 07c9a8e0

由 Dan Carpenter 提交于 3月 11, 2016

It's basically harmless if "ref_level" isn't initialized since it's only
used for an error message, but it causes a static checker warning.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

07c9a8e0

23 2月, 2016 1 次提交

Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9

由 Liu Bo 提交于 7月 17, 2015

Xfstests btrfs/011 complains about a deadlock warning,

[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1226.652955]  Possible interrupt unsafe locking scenario:

[ 1226.652955]        CPU0                    CPU1
[ 1226.652955]        ----                    ----
[ 1226.652955]   lock(&fs_info->dev_replace.lock);
[ 1226.652955]                                local_irq_disable();
[ 1226.652955]                                lock(&delayed_node->mutex);
[ 1226.652955]                                lock(&found->groups_sem);
[ 1226.652955]   <Interrupt>
[ 1226.652955]     lock(&delayed_node->mutex);
[ 1226.652955]
 *** DEADLOCK ***

Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.

The above lock chain comes from
btrfs_commit_transaction
  ->btrfs_run_delayed_items
    ...
    ->__btrfs_update_delayed_inode
      ...
      ->__btrfs_cow_block
         ...
         ->find_free_extent
            ->cache_block_group
              ->load_free_space_cache
                ->btrfs_readpages
                  ->submit_one_bio
                    ...
                    ->__btrfs_map_block
                      ->btrfs_dev_replace_lock

However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to

[ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700

delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.

To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.

With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

73beece9

11 2月, 2016 1 次提交

btrfs: scrub: use GFP_KERNEL on the submission path · 58c4e173

由 David Sterba 提交于 2月 11, 2016

Scrub is not on the critical writeback path we don't need to use
GFP_NOFS for all allocations. The failures are handled and stats passed
back to userspace.

Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the
global structures and the IO submission paths.

Functions that do the repair and fixups still use GFP_NOFS as we might
want to skip any other filesystem activity if we encounter an error.
This could turn out to be unnecessary, but requires more review compared
to the easy cases in this patch.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

58c4e173

23 1月, 2016 1 次提交

wrappers for ->i_mutex access · 5955102c

由 Al Viro 提交于 1月 22, 2016

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5955102c

20 1月, 2016 1 次提交

btrfs: Fix calculation of rbio->dbitmap's size calculation · bfca9a6d

由 Zhao Lei 提交于 12月 08, 2014

Current code is trying to calculate rbio->dbitmap's size to make it
align to sizeof(long), but implement haven't achived this object,
it is align to sizeof(char) instead.
This patch fixed above calculation, and use sizeof(long) instead of
fixed "8" to increate compatibility.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

bfca9a6d

16 1月, 2016 1 次提交

btrfs: cleanup, stop casting for extent_map->lookup everywhere · 95617d69

由 Jeff Mahoney 提交于 6月 03, 2015

Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

95617d69

07 1月, 2016 3 次提交

btrfs: cleanup, use enum values for btrfs_path reada · e4058b54

由 David Sterba 提交于 11月 27, 2015

Replace the integers by enums for better readability. The value 2 does
not have any meaning since a7175319
"Btrfs: do less aggressive btree readahead" (2009-01-22).
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e4058b54

D
btrfs: cleanup, remove stray return statements · 7928d672
由 David Sterba 提交于 11月 30, 2015
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
7928d672

btrfs: drop duplicate prefix from scrub workqueues · 0de270fa

由 David Sterba 提交于 12月 01, 2015

The helper btrfs_alloc_workqueue will add the "btrfs-" prefix.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0de270fa

03 12月, 2015 1 次提交

btrfs: drop unused parameter from lock_extent_bits · ff13db41

由 David Sterba 提交于 12月 03, 2015

We've always passed 0. Stack usage will slightly decrease.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ff13db41

25 11月, 2015 3 次提交

Btrfs: fix scrub preventing unused block groups from being deleted · 758f2dfc

由 Filipe Manana 提交于 11月 19, 2015

Currently scrub can race with the cleaner kthread when the later attempts
to delete an unused block group, and the result is preventing the cleaner
kthread from ever deleting later the block group - unless the block group
becomes used and unused again. The following diagram illustrates that
race:

              CPU 1                                 CPU 2

 cleaner kthread
   btrfs_delete_unused_bgs()

     gets block group X from
     fs_info->unused_bgs and
     removes it from that list

                                             scrub_enumerate_chunks()

                                               searches device tree using
                                               its commit root

                                               finds device extent for
                                               block group X

                                               gets block group X from the tree
                                               fs_info->block_group_cache_tree
                                               (via btrfs_lookup_block_group())

                                               sets bg X to RO

     sees the block group is
     already RO and therefore
     doesn't delete it nor adds
     it back to unused list

So fix this by making scrub add the block group again to the list of
unused block groups if the block group is still unused when it finished
scrubbing it and it hasn't been removed already.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

758f2dfc

Btrfs: fix race between scrub and block group deletion · 020d5b73

由 Filipe Manana 提交于 11月 19, 2015

Scrub can race with the cleaner kthread deleting block groups that are
unused (and with relocation too) leading to a failure with error -EINVAL
that gets returned to user space.

The following diagram illustrates how it happens:

              CPU 1                                 CPU 2

 cleaner kthread
   btrfs_delete_unused_bgs()

     gets block group X from
     fs_info->unused_bgs

     sets block group to RO

       btrfs_remove_chunk(bg X)

         deletes device extents

                                         scrub_enumerate_chunks()

                                           searches device tree using
                                           its commit root

                                           finds device extent for
                                           block group X

                                           gets block group X from the tree
                                           fs_info->block_group_cache_tree
                                           (via btrfs_lookup_block_group())

                                           sets bg X to RO (again)

          btrfs_remove_block_group(bg X)

            deletes block group from
            fs_info->block_group_cache_tree

            removes extent map from
            fs_info->mapping_tree

                                               scrub_chunk(offset X)

                                                 searches fs_info->mapping_tree
                                                 for extent map starting at
                                                 offset X

                                                    --> doesn't find any such
                                                        extent map
                                                    --> returns -EINVAL and scrub
                                                        errors out to userspace
                                                        with -EINVAL

Fix this by dealing with an extent map lookup failure as an indicator of
block group deletion.
Issue reproduced with fstest btrfs/071.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

020d5b73

btrfs: Continue replace when set_block_ro failed · 76a8efa1

由 Zhaolei 提交于 11月 17, 2015

xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
  DEV_LIST="/dev/vdd /dev/vde"
  DEV_REPLACE="/dev/vdf"

  do_test()
  {
      local mkfs_opt="$1"
      local size="$2"

      dmesg -c >/dev/null
      umount $SCRATCH_MNT &>/dev/null

      echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
      mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
      mount "${DEV_LIST[0]}" $SCRATCH_MNT

      echo -n "Writing big files"
      dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
      for ((i = 1; i <= size; i++)); do
          echo -n .
          /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
      done
      echo

      echo Start replace
      btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
          dmesg
          return 1
      }
      return 0
  }

  # Set size to value near fs size
  # for example, 1897 can trigger this bug in 2.6G device.
  #
  ./do_test "-d raid1 -m raid1" 1897

System will report replace fail with following warning in dmesg:
 [  134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
 [  135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
 [  135.543505] ------------[ cut here ]------------
 [  135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
 [  135.545276] Modules linked in:
 [  135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
 [  135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
 [  135.547798]  ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
 [  135.548774]  ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
 [  135.549774]  ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
 [  135.550758] Call Trace:
 [  135.551086]  [<ffffffff817fe7b5>] dump_stack+0x44/0x55
 [  135.551737]  [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
 [  135.552487]  [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
 [  135.553211]  [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
 [  135.554051]  [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
 [  135.554722]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
 [  135.555506]  [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
 [  135.556304]  [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
 [  135.557009]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
 [  135.557855]  [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
 [  135.558669]  [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
 [  135.559374]  [<ffffffff81202124>] SyS_ioctl+0x74/0x80
 [  135.559987]  [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
 [  135.560842] ---[ end trace 2a5c1fc3205abbdd ]---

Reason:
 When big data writen to fs, the whole free space will be allocated
 for data chunk.
 And operation as scrub need to set_block_ro(), and when there is
 only one metadata chunk in system(or other metadata chunks
 are all full), the function will try to allocate a new chunk,
 and failed because no space in device.

Fix:
 When set_block_ro failed for metadata chunk, it is not a problem
 because scrub_lock paused commit_trancaction in same time, and
 metadata are always cowed, so the on-the-fly writepages will not
 write data into same place with scrub/replace.
 Let replace continue in this case is no problem.

Tested by above script, and xfstests/011, plus 100 times xfstests/070.

Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: NFilipe Manana <fdmanana@gmail.com>
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

76a8efa1

11 11月, 2015 2 次提交

btrfs: Remove len argument from scrub_find_csum · 3b5753ec

由 Zhao Lei 提交于 8月 24, 2015

It is useless.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

3b5753ec

btrfs: Reduce unnecessary arguments in scrub_recheck_block · affe4a5a

由 Zhao Lei 提交于 8月 24, 2015

We don't need pass so many arguments for recheck sblock now,
this patch cleans them.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

affe4a5a

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功