提交 · f1de968376340c97ac2d7acd25fa3107c398e0e5 · openeuler / raspberrypi-kernel

29 1月, 2014 40 次提交

Btrfs: fix the race between write back and nocow buffered write · f1de9683

由 Miao Xie 提交于 1月 09, 2014

When we ran the 274th case of xfstests with nodatacow mount option,
We met the following warning message:
WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0

It is caused by the race between the write back and nocow buffered
write:
  Task1				Task2
  __btrfs_buffered_write()
    skip data reservation
    reserve the metadata space
    copy the data
    dirty the pages
    unlock the pages
				write back the pages
				release the data space
   				  becasue there is no
				  noreserve flag
   set the noreserve flag

This patch fixes this problem by unlocking the pages after
the noreserve flag is set.
Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

f1de9683

Btrfs: only process as many file extents as there are refs · 7ef81ac8

由 Josef Bacik 提交于 1月 24, 2014

The backref walking code will search down to the key it is looking for and then
proceed to walk _all_ of the extents on the file until it hits the end. This is
suboptimal with large files, we only need to look for as many extents as we have
references for that inode. I have a testcase that creates a randomly written 4
gig file and before this patch it took 6min 30sec to do the initial send, with
this patch it takes 2min 30sec to do the intial send. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

7ef81ac8

Btrfs: fix qgroup rescan to work with skinny metadata · 3a6d75e8

由 Josef Bacik 提交于 1月 23, 2014

Could have sworn I fixed this before but apparently not.  This makes us pass
btrfs/022 with skinny metadata enabled.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

3a6d75e8

Btrfs: fix extent_from_logical to deal with skinny metadata · 580f0a67

由 Josef Bacik 提交于 1月 23, 2014

I don't think this is an issue and I've not seen it in practice but
extent_from_logical will fail to find a skinny extent because it uses
btrfs_previous_item and gives it the normal extent item type. This is just not
a place to use btrfs_previous_item since we care about either normal extents or
skinny extents, so open code btrfs_previous_item to properly check. This would
only affect metadata and the only place this is used for metadata is scrub and
I'm pretty sure it's just for printing stuff out, not actually doing any work so
hopefully it was never a problem other than a cosmetic one. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

580f0a67

Btrfs: throttle delayed refs better · 0a2b2a84

由 Josef Bacik 提交于 1月 23, 2014

On one of our gluster clusters we noticed some pretty big lag spikes. This
turned out to be because our transaction commit was taking like 3 minutes to
complete. This is because we have like 30 gigs of metadata, so our global
reserve would end up being the max which is like 512 mb. So our throttling code
would allow a ridiculous amount of delayed refs to build up and then they'd all
get run at transaction commit time, and for a cold mounted file system that
could take up to 3 minutes to run. So fix the throttling to be based on both
the size of the global reserve and how long it takes us to run delayed refs.
This patch tracks the time it takes to run delayed refs and then only allows 1
seconds worth of outstanding delayed refs at a time. This way it will auto-tune
itself from cold cache up to when everything is in memory and it no longer has
to go to disk. This makes our transaction commits take much less time to run.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

0a2b2a84

Btrfs: attach delayed ref updates to delayed ref heads · d7df2c79

由 Josef Bacik 提交于 1月 23, 2014

Currently we have two rb-trees, one for delayed ref heads and one for all of the
delayed refs, including the delayed ref heads. When we process the delayed refs
we have to hold onto the delayed ref lock for all of the selecting and merging
and such, which results in quite a bit of lock contention. This was solved by
having a waitqueue and only one flusher at a time, however this hurts if we get
a lot of delayed refs queued up.

So instead just have an rb tree for the delayed ref heads, and then attach the
delayed ref updates to an rb tree that is per delayed ref head. Then we only
need to take the delayed ref lock when adding new delayed refs and when
selecting a delayed ref head to process, all the rest of the time we deal with a
per delayed ref head lock which will be much less contentious.

The locking rules for this get a little more complicated since we have to lock
up to 3 things to properly process delayed refs, but I will address that problem
later. For now this passes all of xfstests and my overnight stress tests.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

d7df2c79

Btrfs: make fsync latency less sucky · 5039eddc

由 Josef Bacik 提交于 1月 15, 2014

Looking into some performance related issues with large amounts of metadata
revealed that we can have some pretty huge swings in fsync() performance. If we
have a lot of delayed refs backed up (as you will tend to do with lots of
metadata) fsync() will wander off and try to run some of those delayed refs
which can result in reading from disk and such. Since the actual act of fsync()
doesn't create any delayed refs there is no need to make it throttle on delayed
ref stuff, that will be handled by other people. With this patch we get much
smoother fsync performance with large amounts of metadata. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

5039eddc

Btrfs: add support for inode properties · 63541927

由 Filipe David Borba Manana 提交于 1月 07, 2014

This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."

Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).

This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.

The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.

Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.

Basically the tests correspond to:

Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.

Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.

Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.

Test 4 - same as test 3 but with 10 properties per file.

Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.

* Without properties (test 1)

                    file creation time        ls -lha time
10 000 files              3.49                   0.76
100 000 files            47.19                   8.37
1 000 000 files         518.51                 107.06

* With 1 property (compression property set to lzo - test 2)

                    file creation time        ls -lha time
10 000 files              3.63                    0.93
100 000 files            48.56                    9.74
1 000 000 files         537.72                  125.11

* With 4 properties (test 3)

                    file creation time        ls -lha time
10 000 files              3.94                    1.20
100 000 files            52.14                   11.48
1 000 000 files         572.70                  142.13

* With 10 properties (test 4)

                    file creation time        ls -lha time
10 000 files              4.61                    1.35
100 000 files            58.86                   13.83
1 000 000 files         656.01                  177.61

The increased latencies with properties are essencialy because of:

*) When creating an inode, we now synchronously write 1 more item
   (an xattr item) for each property inherited from the parent dir
   (or subvolume). This could be done in an asynchronous way such
   as we do for dir intex items (delayed-inode.c), which could help
   reduce the file creation latency;

*) With properties, we now have larger fs trees. For this particular
   test each xattr item uses 75 bytes of leaf space in the fs tree.
   This could be less by using a new item for xattr items, instead of
   the current btrfs_dir_item, since we could cut the 'location' and
   'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
   total of 26 bytes per xattr item) from the btrfs_dir_item type.

Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.

Test script:

$ cat test.pl
  #!/usr/bin/perl -w

  use strict;
  use Time::HiRes qw(time);
  use constant NUM_FILES => 10_000;
  use constant FILE_SIZES => (64 * 1024);
  use constant DEV => '/dev/sdb4';
  use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
  use constant TEST_DIR => (MNT_POINT . '/testdir');

  system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";

  # following line for testing without properties
  #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";

  # following 2 lines for testing with properties
  system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
  system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";

  system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
  my ($t1, $t2);

  $t1 = time();
  for (my $i = 1; $i <= NUM_FILES; $i++) {
      my $p = TEST_DIR . '/file_' . $i;
      open(my $f, '>', $p) or die "Error opening file!";
      $f->autoflush(1);
      for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
          print $f ('A' x 4096) or die "Error writing to file!";
      }
      close($f);
  }
  $t2 = time();
  print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
  system("umount", DEV) == 0 or die "umount failed!";
  system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";

  $t1 = time();
  system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
  $t2 = time();
  print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
  system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

63541927

Btrfs: faster file extent item replace operations · 1acae57b

由 Filipe David Borba Manana 提交于 1月 07, 2014

When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.

Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.

By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.

Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.

The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:

  sysbench --test=fileio --file-num=4096 --file-total-size=8G \
     --file-test-mode=[rndwr|seqwr] --num-threads=512 \
     --file-block-size=8192 \ --max-requests=1000000 \
     --file-fsync-freq=0 --file-io-mode=sync [prepare|run]

All results below are averages of 10 runs of the respective test.

** SSD sequential writes

Before this change: 225.88 Mb/sec
After this change:  277.26 Mb/sec

** SSD random writes

Before this change: 49.91 Mb/sec
After this change:  56.39 Mb/sec

** HDD sequential writes

Before this change: 68.53 Mb/sec
After this change:  69.87 Mb/sec

** HDD random writes

Before this change: 13.04 Mb/sec
After this change:  14.39 Mb/sec
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

1acae57b

Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot() · 90515e7f

由 Wang Shilong 提交于 1月 07, 2014

We may return early in btrfs_drop_snapshot(), we shouldn't
call btrfs_std_err() for this case, fix it.

Cc: stable@vger.kernel.org
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

90515e7f

Btrfs: remove unnecessary transaction commit before send · 8e56338d

由 Wang Shilong 提交于 1月 07, 2014

We will finish orphan cleanups during snapshot, so we don't
have to commit transaction here.
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

8e56338d

Btrfs: fix protection between send and root deletion · 18f687d5

由 Wang Shilong 提交于 1月 07, 2014

We should gurantee that parent and clone roots can not be destroyed
during send, for this we have two ideas.

1.by holding @subvol_sem, this might be a nightmare, because it will
block all subvolumes deletion for a long time.

2.Miao pointed out we can reuse @send_in_progress, that mean we will
skip snapshot deletion if root sending is in progress.

Here we adopt the second approach since it won't block other subvolumes
deletion for a long time.

Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
, if this root is involved in send, we return directly rather than
continue to check.There are several reasons about it:

1.this case happen seldomly.
2.after sending,cleaner thread can continue to drop that root.
3.make code simple

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

18f687d5

Btrfs: fix wrong send_in_progress accounting · 896c14f9

由 Wang Shilong 提交于 1月 07, 2014

Steps to reproduce:
 # mkfs.btrfs -f /dev/sda8
 # mount /dev/sda8 /mnt
 # btrfs sub snapshot -r /mnt /mnt/snap1
 # btrfs sub snapshot -r /mnt /mnt/snap2
 # btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1
 # dmesg

The problem is that we will sort clone roots(include @send_root), it
might push @send_root before thus @send_root's @send_in_progress will
be decreased twice.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

896c14f9

btrfs: Add treelog mount option. · a88998f2

由 Qu Wenruo 提交于 1月 06, 2014

Add treelog mount option to enable tree log with
remount option.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

a88998f2

btrfs: Add datasum mount option. · d399167d

由 Qu Wenruo 提交于 1月 06, 2014

Add datasum mount option to enable checksum with
remount option.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

d399167d

btrfs: Add datacow mount option. · a258af7a

由 Qu Wenruo 提交于 1月 06, 2014

Add datacow mount option to enable copy-on-write with
remount option.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

a258af7a

btrfs: Add acl mount option. · bd0330ad

由 Qu Wenruo 提交于 1月 06, 2014

Add acl mount option to enable acl with remount option.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

bd0330ad

btrfs: Add noflushoncommit mount option. · 2c9ee856

由 Qu Wenruo 提交于 1月 06, 2014

Add noflushoncommit mount option to disable flush on commit with
remount option.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

2c9ee856

btrfs: Add noenospc_debug mount option. · 53036293

由 Qu Wenruo 提交于 1月 06, 2014

Add noenospc_debug mount option to disable ENOSPC debug with
remount option.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

53036293

btrfs: Add nodiscard mount option. · e07a2ade

由 Qu Wenruo 提交于 1月 06, 2014

Add nodiscard mount option to disable discard with remount option.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

e07a2ade

btrfs: Add noautodefrag mount option. · fc0ca9af

由 Qu Wenruo 提交于 1月 06, 2014

Btrfs has autodefrag mount option but no pairing noautodefrag option,
which makes it impossible to disable autodefrag without umount.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

fc0ca9af

btrfs: Add "barrier" option to support "-o remount,barrier" · 842bef58

由 Qu Wenruo 提交于 1月 06, 2014

Btrfs can be remounted without barrier, but there is no "barrier" option
so nobody can remount btrfs back with barrier on. Only umount and
mount again can re-enable barrier.(Quite awkward)

Also the mount options in the document is also changed slightly for the
further pairing options changes.
Reported-by: NDaniel Blueman <daniel@quora.org>
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NMike Fleetwood <mike.fleetwood@googlemail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

842bef58

Btrfs: only fua the first superblock when writting supers · e8117c26

由 Wang Shilong 提交于 1月 03, 2014

We only intent to fua the first superblock in every device from
comments, fix it.
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

e8117c26

Btrfs: return free space to global_rsv as much as possible · 17504584

由 Liu Bo 提交于 12月 29, 2013

@full is not protected within global_rsv.lock, so we may think global_rsv
is already full but in fact it's not, so we miss the opportunity to return
free space to global_rsv directly when we release other block_rsvs.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

17504584

Btrfs: fix an oops when we fail to relocate tree blocks · 1708cc57

由 Wang Shilong 提交于 12月 28, 2013

During balance test, we hit an oops:
[ 2013.841551] kernel BUG at fs/btrfs/relocation.c:1174!

The problem is that if we fail to relocate tree blocks, we should
update backref cache, otherwise, some pending nodes are not updated
while snapshot check @cache->last_trans is within one transaction
and won't update it and then oops happen.
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

1708cc57

Btrfs: fix the wrong nocow range check · e77751aa

由 Miao Xie 提交于 12月 27, 2013

The following warning message was outputed when running the 274th case
of xfstests with nodatacow option:
 BUG: Bad page state in process kswapd0  pfn:1c66f
 page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000
 page flags: 0x1000000000100a(error|uptodate|private_2)

It is because the check of nocow range was wrong, we should compare the
start and end position of the extent with the write position to verify
if the write position was in the extent, but the current code just used
the start postion to do the check, so we got the wrong extent and told
the caller that it was a nocow write. And then when we write back the
dirty pages, we found we should cow the extent, but at that time, there
was no space in the fs, we had to the error flag for the page. When
someone reclaimed that page, the above warning outputed. Fix it.
Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

e77751aa

Btrfs: fix an oops when we fail to merge reloc roots · 25e293c2

由 Wang Shilong 提交于 12月 26, 2013

Previously, we will free reloc root memory and then force filesystem
to be readonly. The problem is that there may be another thread commiting
transaction which will try to access freed reloc root during merging reloc
roots process.

To keep consistency snapshots shared space, we should allow snapshot
finished if possible, so here we don't free reloc root memory.
signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

25e293c2

Btrfs: remove unused argument from select_reloc_root() · dc4103f9

由 Wang Shilong 提交于 12月 26, 2013

@nr is no longer used, remove it from select_reloc_root()
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

dc4103f9

Btrfs: reduce btree node locking duration on item update · eb653de1

由 Filipe David Borba Manana 提交于 12月 23, 2013

If we do a btree search with the goal of updating an existing item
without changing its size (ins_len == 0 and cow == 1), then we never
need to hold locks on upper level nodes (even when slot == 0) after we
COW their child nodes/leaves, as we won't have node splits or merges
in this scenario (that is, no key additions, removals or shifts on any
nodes or leaves).

Therefore release the locks immediately after COWing the child nodes/leaves
while navigating the btree, even if their parent slot is 0, instead of
returning a path to the caller with those nodes locked, which would get
released only when the caller releases or frees the path (or if it calls
btrfs_unlock_up_safe).

This is a common scenario, for example when updating inode items in fs
trees and block group items in the extent tree.

The following benchmarks were performed on a quad core machine with 32Gb
of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
quickly).

  sysbench --test=fileio --file-num=131072 --file-total-size=8G \
    --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
    --max-requests=100000 --file-io-mode=sync [prepare|run]

Before this change:  49.85Mb/s (average of 5 runs)
After this change:   50.38Mb/s (average of 5 runs)
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

eb653de1

fs/btrfs: Integer overflow in btrfs_ioctl_resize() · eb8052e0

由 Wenliang Fan 提交于 12月 20, 2013

The local variable 'new_size' comes from userspace. If a large number
was passed, there would be an integer overflow in the following line:
	new_size = old_size + new_size;
Signed-off-by: NWenliang Fan <fanwlexca@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

eb8052e0

Btrfs: stop caching thread if extent_commit_sem is contended · c9ea7b24

由 Josef Bacik 提交于 9月 19, 2013

We can starve out the transaction commit with a bunch of caching threads all
running at the same time.  This is because we will only drop the
extent_commit_sem if we need_resched(), which isn't likely to happen since we
will be reading a lot from the disk so have already schedule()'ed plenty.  Alex
observed that he could starve out a transaction commit for up to a minute with
32 caching threads all running at once.  This will allow us to drop the
extent_commit_sem to allow the transaction commit to swap the commit_root out
and then all the cachers will start back up. Here is an explanation provided by
Igno

So, just to fill in what happens in this loop:

                                mutex_unlock(&caching_ctl->mutex);
                                cond_resched();
                                goto again;

where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem
again:

        again:
                mutex_lock(&caching_ctl->mutex);
                /* need to make sure the commit_root doesn't disappear */
                down_read(&fs_info->extent_commit_sem);

So, if I'm reading the code correct, there can be a fair amount of
concurrency here: there may be multiple 'caching kthreads' per filesystem
active, while there's one fs_info->extent_commit_sem per filesystem
AFAICS.

So, what happens if there are a lot of CPUs all busy holding the
->extent_commit_sem rwsem read-locked and a writer arrives? They'd all
rush to try to release the fs_info->extent_commit_sem, and they'd block in
the down_read() because there's a writer waiting.

So there's a guarantee of forward progress. This should answer akpm's
concern I think.

Thanks,
Acked-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <clm@fb.com>

c9ea7b24

Btrfs: introduce the delayed inode ref deletion for the single link inode · 67de1176

由 Miao Xie 提交于 12月 26, 2013

The inode reference item is close to inode item, so we insert it simultaneously
with the inode item insertion when we create a file/directory.. In fact, we also
can handle the inode reference deletion by the same way. So we made this patch to
introduce the delayed inode reference deletion for the single link inode(At most
case, the file doesn't has hard link, so we don't take the hard link into account).

This function is based on the delayed inode mechanism. After applying this patch,
we can reduce the time of the file/directory deletion by ~10%.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

67de1176

M
Btrfs: use flags instead of the bool variants in delayed node · 7cf35d91
由 Miao Xie 提交于 12月 26, 2013
```
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>
```
7cf35d91

Btrfs: remove btrfs_end_transaction_dmeta() · a56dbd89

由 Miao Xie 提交于 12月 26, 2013

Two reasons:
- btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
  so it is unnecessary.
- All the delayed items should be dealt in the current transaction, so the
  workers should not commit the transaction, instead, deal with the delayed
  items as many as possible.

So we can remove btrfs_end_transaction_dmeta()
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

a56dbd89

Btrfs: cleanup code of btrfs_balance_delayed_items() · 0353808c

由 Miao Xie 提交于 12月 26, 2013

- move the condition check for wait into a function
- use wait_event_interruptible instead of prepare-schedule-finish process
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

0353808c

Btrfs: don't run delayed nodes again after all nodes flush · 4dd466d3

由 Miao Xie 提交于 12月 26, 2013

If the number of the delayed items is greater than the upper limit, we will
try to flush all the delayed items. After that, it is unnecessary to run
them again because they are being dealt with by the wokers or the number of
them is less than the lower limit.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

4dd466d3

Btrfs: remove residual code in delayed inode async helper · 74c40f92

由 Miao Xie 提交于 12月 26, 2013

Before applying the patch
  commit de3cb945
  title: Btrfs: improve the delayed inode throttling

We need requeue the async work after the current work was done, it
introduced a deadlock problem. So we wrote the code that this patch
removes to avoid the above problem. But after applying the above
patch, the deadlock problem didn't exist. So we should remove that
fix code.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

74c40f92

Btrfs: convert printk to btrfs_ and fix BTRFS prefix · efe120a0

由 Frank Holton 提交于 12月 20, 2013

Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.
Signed-off-by: NFrank Holton <fholton@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

efe120a0

Btrfs: fix tree mod logging · 5de865ee

由 Filipe David Borba Manana 提交于 12月 20, 2013

While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:

  btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
      --- tests/btrfs/004.out	2013-11-26 18:25:29.263333714 +0000
      +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad	2013-12-10 15:25:10.327518516 +0000
      @@ -1,3 +1,8 @@
       QA output created by 004
       *** test backref walking
      -*** done
      +unexpected output from
      +	/home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
      +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
      +
       ...
       (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
  Ran: btrfs/004
  Failures: btrfs/004
  Failed 1 of 1 tests

But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:

  $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
  inode 405 offset 454656 root 258
  inode 405 offset 454656 root 5

It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:

[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979]  000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982]  0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984]  ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991]  [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994]  [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997]  [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003]  [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005]  [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010]  [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019]  [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027]  [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034]  [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056]  [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065]  [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078]  [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080]  [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083]  [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0

These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:

  c8cc6341
  (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)

That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.

This issue has been experienced by several users recently, such as for example:

  http://www.spinics.net/lists/linux-btrfs/msg28574.html

After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.

Cc: stable@vger.kernel.org
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

5de865ee

btrfs: check balance of send_in_progress · 66ef7d65

由 David Sterba 提交于 12月 17, 2013

Warn if the balance goes below zero, which appears to be unlikely
though. Otherwise cleans up the code a bit.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

66ef7d65