提交 · f004fae0cfeb96d33240eb5471f14cb6fbbd4eea · openeuler / raspberrypi-kernel

23 2月, 2016 2 次提交

Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9

由 Liu Bo 提交于 7月 17, 2015

Xfstests btrfs/011 complains about a deadlock warning,

[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1226.652955]  Possible interrupt unsafe locking scenario:

[ 1226.652955]        CPU0                    CPU1
[ 1226.652955]        ----                    ----
[ 1226.652955]   lock(&fs_info->dev_replace.lock);
[ 1226.652955]                                local_irq_disable();
[ 1226.652955]                                lock(&delayed_node->mutex);
[ 1226.652955]                                lock(&found->groups_sem);
[ 1226.652955]   <Interrupt>
[ 1226.652955]     lock(&delayed_node->mutex);
[ 1226.652955]
 *** DEADLOCK ***

Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.

The above lock chain comes from
btrfs_commit_transaction
  ->btrfs_run_delayed_items
    ...
    ->__btrfs_update_delayed_inode
      ...
      ->__btrfs_cow_block
         ...
         ->find_free_extent
            ->cache_block_group
              ->load_free_space_cache
                ->btrfs_readpages
                  ->submit_one_bio
                    ...
                    ->__btrfs_map_block
                      ->btrfs_dev_replace_lock

However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to

[ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700

delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.

To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.

With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

73beece9

btrfs: avoid uninitialized variable warning · f827ba9a

由 Arnd Bergmann 提交于 2月 22, 2016

With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
to partially inline the get_state_failrec() function but cannot
figure out that means the failrec pointer is always valid
if the function returns success, which causes a harmless
warning:

fs/btrfs/extent_io.c: In function 'clean_io_failure':
fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This marks get_state_failrec() and set_state_failrec() both
as 'noinline', which avoids the warning in all cases for me,
and seems less ugly than adding a fake initialization.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: 47dc196a ("btrfs: use proper type for failrec in extent_state")
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f827ba9a

18 2月, 2016 24 次提交

btrfs: drop null testing before destroy functions · 5598e900

由 Kinglong Mee 提交于 1月 29, 2016

Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5598e900

btrfs: fix build warning · 89771cc9

由 Sudip Mukherjee 提交于 2月 16, 2016

We were getting build warning about:
fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used
	uninitialized in this function

It is not a valid warning as used_bg is never used uninitilized since
locked is initially false so we can never be in the section where
'used_bg' is used. But gcc is not able to understand that and we can
initialize it while declaring to silence the warning.
Signed-off-by: NSudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

89771cc9

btrfs: use proper type for failrec in extent_state · 47dc196a

由 David Sterba 提交于 2月 11, 2016

We use the private member of extent_state to store the failrec and play
pointless pointer games.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

47dc196a

btrfs: Replace CURRENT_TIME by current_fs_time() · 04b285f3

由 Deepa Dinamani 提交于 2月 06, 2016

CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

04b285f3

btrfs: remove open-coded swap() in backref.c:__merge_refs · 8f682f69

由 Dave Jones 提交于 1月 28, 2016

The kernel provides a swap() that does the same thing as this code.
Signed-off-by: NDave Jones <dsj@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8f682f69

btrfs: remove redundant error check · ac1407ba

由 Byongho Lee 提交于 1月 27, 2016

While running btrfs_mksubvol(), d_really_is_positive() is called twice.
First in btrfs_mksubvol() and second inside btrfs_may_create().  So I
remove the first one.
Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ac1407ba

btrfs: simplify expression in btrfs_calc_trans_metadata_size() · 0138b6fe

由 Byongho Lee 提交于 1月 27, 2016

Simplify expression in btrfs_calc_trans_metadata_size().
Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: NStefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0138b6fe

Btrfs: check reserved when deciding to background flush · baee8790

由 Josef Bacik 提交于 1月 26, 2016

We will sometimes start background flushing the various enospc related things
(delayed nodes, delalloc, etc) if we are getting close to reserving all of our
available space.  We don't want to do this however when we are actually using
this space as it causes unneeded thrashing.  We currently try to do this by
checking bytes_used >= thresh, but bytes_used is only part of the equation, we
need to use bytes_reserved as well as this represents space that is very likely
to become bytes_used in the future.

My tracing tool will keep count of the number of times we kick off the async
flusher, the following are counts for the entire run of generic/027

		No Patch	Patch
avg: 		5385		5009
median:		5500		4916

We skewed lower than the average with my patch and higher than the average with
the patch, overall it cuts the flushing from anywhere from 5-10%, which in the
case of actual ENOSPC is quite helpful.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

baee8790

Btrfs: add transaction space reservation tracepoints · 88d3a5aa

由 Josef Bacik 提交于 1月 13, 2016

There are a few places where we add to trans->bytes_reserved but don't have the
corresponding trace point.  With these added my tool no longer sees transaction
leaks.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

88d3a5aa

Btrfs: fix truncate_space_check · dc95f7bf

由 Josef Bacik 提交于 1月 13, 2016

truncate_space_check is using btrfs_csum_bytes_to_leaves() but forgetting to
multiply by nodesize so we get an actual byte count. We need a tracepoint here
so that we have the matching reserve for the release that will come later. Also
add a comment to make clear what the intent of truncate_space_check is.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dc95f7bf

Btrfs: change how we update the global block rsv · fb4b10e5

由 Josef Bacik 提交于 1月 11, 2016

I'm writing a tool to visualize the enospc system in order to help debug enospc
bugs and I found weird data and ran it down to when we update the global block
rsv. We add all of the remaining free space to the block rsv, do a trace event,
then remove the extra and do another trace event. This makes my visualization
look silly and is unintuitive code as well. Fix this stuff to only add the
amount we are missing, or free the amount we are missing. This is less clean to
read but more explicit in what it is doing, as well as only emitting events for
values that make sense. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fb4b10e5

btrfs: reada: ignore creating reada_extent for a non-existent device · 7aff8cf4

由 Zhao Lei 提交于 1月 14, 2016

For a non-existent device, old code bypasses adding it in dev's reada
queue.

And to solve problem of unfinished waitting in raid5/6,
commit 5fbc7c59 ("Btrfs: fix unfinished readahead thread for
raid5/6 degraded mounting")
adding an exception for the first stripe, in short, the first
stripe will always be processed whether the device exists or not.

Actually we have a better way for the above request: just bypass
creation of the reada_extent for non-existent device, it will make
code simple and effective.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

7aff8cf4

btrfs: reada: avoid undone reada extents in btrfs_reada_wait · 4fe7a0e1

由 Zhao Lei 提交于 1月 26, 2016

Reada background works is not designed to finish all jobs
completely, it will break in following case:
1: When a device reaches workload limit (MAX_IN_FLIGHT)
2: Total reads reach max limit (10000)
3: All devices don't have queued more jobs, often happened in DUP case

And if all background works exit with remaining jobs,
btrfs_reada_wait() will wait indefinetelly.

Above problem is rarely happened in old code, because:
1: Every work queues 2x new works
   So many works reduced chances of undone jobs.
2: One work will continue 10000 times loop in case of no-jobs
   It reduced no-thread window time.

But after we fixed above case, the "undone reada extents" frequently
happened.

Fix:
 Check to ensure we have at least one thread if there are undone jobs
 in btrfs_reada_wait().
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4fe7a0e1

btrfs: reada: limit max works count · 2fefd558

由 Zhao Lei 提交于 1月 07, 2016

Reada creates 2 works for each level of tree recursively.

In case of a tree having many levels, the number of created works
is 2^level_of_tree.
Actually we don't need so many works in parallel, this patch limits
max works to BTRFS_MAX_MIRRORS * 2.

The per-fs works_counter will be also used for btrfs_reada_wait() to
check is there are background workers.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2fefd558

btrfs: reada: simplify dev->reada_in_flight processing · 895a11b8

由 Zhao Lei 提交于 1月 12, 2016

No need to decrease dev->reada_in_flight in __readahead_hook()'s
internal and reada_extent_put().
reada_extent_put() have no chance to decrease dev->reada_in_flight
in free operation, because reada_extent have additional refcnt when
scheduled to a dev.

We can put inc and dec operation for dev->reada_in_flight to one
place instead to make logic simple and safe, and move useless
reada_extent->scheduled_for to a bool flag instead.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

895a11b8

btrfs: reada: Fix a debug code typo · 8afd6841

由 Zhao Lei 提交于 12月 31, 2015

Remove one copy of loop to fix the typo of iterate zones.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8afd6841

btrfs: reada: Jump into cleanup in direct way for __readahead_hook() · 57f16e08

由 Zhao Lei 提交于 12月 31, 2015

Current code set nritems to 0 to make for_loop useless to bypass it,
and set generation's value which is not necessary.
Jump into cleanup directly is better choise.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

57f16e08

btrfs: reada: Use fs_info instead of root in __readahead_hook's argument · 02873e43

由 Zhao Lei 提交于 12月 31, 2015

What __readahead_hook() need exactly is fs_info, no need to convert
fs_info to root in caller and convert back in __readahead_hook()
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

02873e43

btrfs: reada: Pass reada_extent into __readahead_hook directly · 6e39dbe8

由 Zhao Lei 提交于 12月 31, 2015

reada_start_machine_dev() already have reada_extent pointer, pass
it into __readahead_hook() directly instead of search radix_tree
will make code run faster.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6e39dbe8

btrfs: reada: move reada_extent_put to place after __readahead_hook() · b257cf50

由 Zhao Lei 提交于 12月 31, 2015

We can't release reada_extent earlier than __readahead_hook(), because
__readahead_hook() still need to use it, it is necessary to hode a refcnt
to avoid it be freed.

Actually it is not a problem after my patch named:
  Avoid many times of empty loop
It make reada_extent in above line include at least one reada_extctl,
which keeps additional one refcnt for reada_extent.

But we still need this patch to make the code in pretty logic.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b257cf50

btrfs: reada: Remove level argument in severial functions · 1e7970c0

由 Zhao Lei 提交于 12月 31, 2015

level is not used in severial functions, remove them from arguments,
and remove relative code for get its value.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1e7970c0

btrfs: reada: bypass adding extent when all zone failed · 31945021

由 Zhao Lei 提交于 12月 31, 2015

When failed adding all dev_zones for a reada_extent, the extent
will have no chance to be selected to run, and keep in memory
for ever.

We should bypass this extent to avoid above case.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

31945021

btrfs: reada: add all reachable mirrors into reada device list · 6a159d2a

由 Zhao Lei 提交于 12月 31, 2015

If some device is not reachable, we should bypass and continus addingb
next, instead of break on bad device.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6a159d2a

btrfs: reada: Move is_need_to_readahead contition earlier · a3f7fde2

由 Zhao Lei 提交于 12月 31, 2015

Move is_need_to_readahead contition earlier to avoid useless loop
to get relative data for readahead.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a3f7fde2

16 2月, 2016 4 次提交

btrfs: reada: Avoid many times of empty loop · 97d5f0e6

由 Zhao Lei 提交于 12月 31, 2015

We can see following loop(10000 times) in trace_log:
 [   75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [   75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

 [   75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [   75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

 ...(10000 times)

 [  124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [  124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [  124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [  124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

Reason:
 If more than one user trigger reada in same extent, the first task
 finished setting of reada data struct and call reada_start_machine()
 to start, and the second task only add a ref_count but have not
 add reada_extctl struct completely, the reada_extent can not finished
 all jobs, and will be selected in __reada_start_machine() for 10000
 times(total times in __reada_start_machine()).

Fix:
 For a reada_extent without job, we don't need to run it, just return
 0 to let caller break.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

97d5f0e6

btrfs: reada: Add missed segment checking in reada_find_zone · 8e9aa51f

由 Zhao Lei 提交于 12月 18, 2015

In rechecking zone-in-tree, we still need to check zone include
our logical address.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8e9aa51f

btrfs: reada: reduce additional fs_info->reada_lock in reada_find_zone · c37f49c7

由 Zhao Lei 提交于 12月 18, 2015

We can avoid additional locking-acquirment and one pair of
kref_get/put by combine two condition.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c37f49c7

btrfs: reada: Fix in-segment calculation for reada · 50378530

由 Zhao Lei 提交于 12月 18, 2015

reada_zone->end is end pos of segment:
 end = start + cache->key.offset - 1;

So we need to use "<=" in condition to judge is a pos in the
segment.

The problem happened rearly, because logical pos rarely pointed
to last 4k of a blockgroup, but we need to fix it to make code
right in logic.
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

50378530

12 2月, 2016 3 次提交

btrfs: Introduce new mount option alias for nologreplay · fed8f166

由 Qu Wenruo 提交于 1月 19, 2016

Introduce new mount option alias "norecovery" for nologreplay, to keep
"norecovery" behavior the same with other filesystems.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fed8f166

btrfs: Introduce new mount option to disable tree log replay · 96da0919

由 Qu Wenruo 提交于 1月 19, 2016

Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: NAustin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

96da0919

btrfs: Introduce new mount option usebackuproot to replace recovery · 8dcddfa0

由 Qu Wenruo 提交于 1月 19, 2016

Current "recovery" mount option will only try to use backup root.
However the word "recovery" is too generic and may be confusing for some
users.

Here introduce a new and more specific mount option, "usebackuproot" to
replace "recovery" mount option.
"Recovery" will be kept for compatibility reason, but will be
deprecated.

Also, since "usebackuproot" will only affect mount behavior and after
open_ctree() it has nothing to do with the filesystem, so clear the flag
after mount succeeded.

This provides the basis for later unified "norecovery" mount option.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
[ dropped usebackuproot from show_mount, added note about 'recovery' to
  docs ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8dcddfa0

11 2月, 2016 7 次提交

D
btrfs: teach print_leaf about temporary item subtypes · 9f07e1d7
由 David Sterba 提交于 1月 25, 2016
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
9f07e1d7
D
btrfs: teach print_leaf about permanent item subtypes · 585a3d0d
由 David Sterba 提交于 1月 25, 2016
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
585a3d0d
D
btrfs: switch dev stats item to the permanent item key · 242e2956
由 David Sterba 提交于 1月 25, 2016
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
242e2956

btrfs: introduce key type for persistent permanent items · 50c2d5ab

由 David Sterba 提交于 1月 25, 2016

The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree.

Similar to the temporary items, we'll introduce a new name for an
existing key value and use the objectid for further extension.  The
victim is the BTRFS_DEV_STATS_KEY (248).

The device stats are an example of a permanent item.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

50c2d5ab

D
btrfs: switch balance item to the temporary item key · c479cb4f
由 David Sterba 提交于 1月 25, 2016
```
No visible change.
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
c479cb4f

btrfs: introduce key type for persistent temporary items · 0bbbccb1

由 David Sterba 提交于 1月 25, 2016

The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree. We'll introduce a new
name for an existing key value and use the objectid for further
extension.  The victim is the BTRFS_BALANCE_ITEM_KEY (248).

The nature of the balance status item is a good example of the temporary
item. It exists from beginning of the balance, keeps the status until it
finishes.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0bbbccb1

btrfs: properly set the termination value of ctx->pos in readdir · bc4ef759

由 David Sterba 提交于 11月 13, 2015

The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.

There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.

We can get to that situation like that:

* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
  bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX

Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.

The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:

 Overflow: e
 Overflow: 7fffffff
 Overflow: 7fffffffffffffff
 PAX: size overflow detected in function btrfs_real_readdir
   fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
   context: dir_context;
 CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
  ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
  ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
  ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
 Call Trace:
  [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
  [<ffffffff811cb706>] report_size_overflow+0x36/0x40
  [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
  [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
  [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
  [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
 Overflow: 1a
  [<ffffffff811db070>] ? iterate_dir+0x150/0x150
  [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83

The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.

The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.

References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: NChris Mason <clm@fb.com>

bc4ef759