1. 08 4月, 2014 2 次提交
  2. 07 4月, 2014 1 次提交
    • J
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik 提交于
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e351cc8
  3. 11 3月, 2014 22 次提交
  4. 29 1月, 2014 15 次提交
    • C
      Btrfs: don't use ram_bytes for uncompressed inline items · 514ac8ad
      Chris Mason 提交于
      If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
      the new size.  The fixe uses the size directly from the item header when
      reading uncompressed inlines, and also fixes truncate to update the
      size as it goes.
      Reported-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      514ac8ad
    • M
      Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss · 26b47ff6
      Miao Xie 提交于
      It is better that the position of the lock is close to the data which is
      protected by it, because they may be in the same cache line, we will load
      less cache lines when we access them. So we rearrange the members' position
      of btrfs_space_info structure to make the lock be closer to the its data.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      26b47ff6
    • Q
      btrfs: Add noinode_cache mount option · 3818aea2
      Qu Wenruo 提交于
      Add noinode_cache mount option for btrfs.
      
      Since inode map cache involves all the btrfs_find_free_ino/return_ino
      things and if just trigger the mount_opt,
      an inode number get from inode map cache will not returned to inode map
      cache.
      
      To keep the find and return inode both in the same behavior,
      a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
      CHANGE_INODE_CACHE is set/cleared in remounting, and the original
      INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
      success transaction.
      Since find/return inode is all done between btrfs_start_transaction and
      btrfs_commit_transaction, this will keep consistent behavior.
      
      Also noinode_cache mount option will not stop the caching_kthread.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3818aea2
    • W
      Btrfs: fix to search previous metadata extent item since skinny metadata · ade2e0b3
      Wang Shilong 提交于
      There is a bug that using btrfs_previous_item() to search metadata extent item.
      This is because in btrfs_previous_item(), we need type match, however, since
      skinny metada was introduced by josef, we may mix this two types. So just
      use btrfs_previous_item() is not working right.
      
      To keep btrfs_previous_item() like normal tree search, i introduce another
      function btrfs_previous_extent_item().
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ade2e0b3
    • J
      Btrfs: throttle delayed refs better · 0a2b2a84
      Josef Bacik 提交于
      On one of our gluster clusters we noticed some pretty big lag spikes.  This
      turned out to be because our transaction commit was taking like 3 minutes to
      complete.  This is because we have like 30 gigs of metadata, so our global
      reserve would end up being the max which is like 512 mb.  So our throttling code
      would allow a ridiculous amount of delayed refs to build up and then they'd all
      get run at transaction commit time, and for a cold mounted file system that
      could take up to 3 minutes to run.  So fix the throttling to be based on both
      the size of the global reserve and how long it takes us to run delayed refs.
      This patch tracks the time it takes to run delayed refs and then only allows 1
      seconds worth of outstanding delayed refs at a time.  This way it will auto-tune
      itself from cold cache up to when everything is in memory and it no longer has
      to go to disk.  This makes our transaction commits take much less time to run.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0a2b2a84
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • F
      Btrfs: faster file extent item replace operations · 1acae57b
      Filipe David Borba Manana 提交于
      When writing to a file we drop existing file extent items that cover the
      write range and then add a new file extent item that represents that write
      range.
      
      Before this change we were doing a tree lookup to remove the file extent
      items, and then after we did another tree lookup to insert the new file
      extent item.
      Most of the time all the file extent items we need to drop are located
      within a single leaf - this is the leaf where our new file extent item ends
      up at. Therefore, in this common case just combine these 2 operations into
      a single one.
      
      By avoiding the second btree navigation for insertion of the new file extent
      item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
      COW operations, CPU time on btree node/leaf key binary searches, etc.
      
      Besides for file writes, this is an operation that happens for file fsync's
      as well. However log btrees are much less likely to big as big as regular
      fs btrees, therefore the impact of this change is smaller.
      
      The following benchmark was performed against an SSD drive and a
      HDD drive, both for random and sequential writes:
      
        sysbench --test=fileio --file-num=4096 --file-total-size=8G \
           --file-test-mode=[rndwr|seqwr] --num-threads=512 \
           --file-block-size=8192 \ --max-requests=1000000 \
           --file-fsync-freq=0 --file-io-mode=sync [prepare|run]
      
      All results below are averages of 10 runs of the respective test.
      
      ** SSD sequential writes
      
      Before this change: 225.88 Mb/sec
      After this change:  277.26 Mb/sec
      
      ** SSD random writes
      
      Before this change: 49.91 Mb/sec
      After this change:  56.39 Mb/sec
      
      ** HDD sequential writes
      
      Before this change: 68.53 Mb/sec
      After this change:  69.87 Mb/sec
      
      ** HDD random writes
      
      Before this change: 13.04 Mb/sec
      After this change:  14.39 Mb/sec
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1acae57b
    • F
      Btrfs: convert printk to btrfs_ and fix BTRFS prefix · efe120a0
      Frank Holton 提交于
      Convert all applicable cases of printk and pr_* to the btrfs_* macros.
      
      Fix all uses of the BTRFS prefix.
      Signed-off-by: NFrank Holton <fholton@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      efe120a0
    • D
      btrfs: Check read-only status of roots during send · 2c686537
      David Sterba 提交于
      All the subvolues that are involved in send must be read-only during the
      whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
      status to read-write and the result of send stream is undefined if the
      data change unexpectedly.
      
      Fix that by adding a refcount for all involved roots and verify that
      there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
      read-only -> read-write transition.
      
      We need refcounts because there are no restrictions on number of send
      parallel operations currently run on a single subvolume, be it source,
      parent or one of the multiple clone sources.
      
      Kernel is silent when the RO checks fail and returns EPERM. The same set
      of checks is done already in userspace before send starts.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c686537
    • F
      Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct · e223cfcd
      Filipe David Borba Manana 提交于
      It's not used anywhere, so just drop it.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e223cfcd
    • J
      Btrfs: move the extent buffer radix tree into the fs_info · f28491e0
      Josef Bacik 提交于
      I need to create a fake tree to test qgroups and I don't want to have to setup a
      fake btree_inode.  The fact is we only use the radix tree for the fs_info, so
      everybody else who allocates an extent_io_tree is just wasting the space anyway.
      This patch moves the radix tree and its lock into btrfs_fs_info so there is less
      stuff I have to fake to do qgroup sanity tests.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f28491e0
    • F
      Btrfs: make btrfs_debug match pr_debug handling related to DEBUG · 27a0dd61
      Frank Holton 提交于
      The kernel macro pr_debug is defined as a empty statement when DEBUG is
      not defined. Make btrfs_debug match pr_debug to avoid spamming
      the kernel log with debug messages
      Signed-off-by: NFrank Holton <fholton@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      27a0dd61
    • S
      btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index' · 33b98f22
      Sergei Trofimovich 提交于
      Found by uselex.rb:
      > btrfs_get_inode_ref_index: [R]: exported from:
      fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o
      Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      Reviewed-by: NDavid Stebra <dsterba@suse.cz>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      33b98f22
    • K
      btrfs: bootstrap generic btrfs_find_item interface · e33d5c3d
      Kelley Nielsen 提交于
      There are many btrfs functions that manually search the tree for an
      item. They all reimplement the same mechanism and differ in the
      conditions that they use to find the item. __inode_info() is one such
      example. Zach Brown proposed creating a new interface to take the place
      of these functions.
      
      This patch is the first step to creating the interface. A new function,
      btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
      It is identical to __inode_info, except that the order of the parameters
      has been rearranged to more closely those of similar functions elsewhere
      in the code (now, root and path come first, then the objectid, offset
      and type, and the key to be filled in last). __inode_info's callers have
      been set to call this new function instead, and __inode_info itself has
      been removed.
      Signed-off-by: NKelley Nielsen <kelleynnn@gmail.com>
      Suggested-by: NZach Brown <zab@redhat.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e33d5c3d
    • J
      btrfs: publish device membership in sysfs · 29e5be24
      Jeff Mahoney 提交于
      Now that we have the infrastructure for per-super attributes, we can
      publish device membership in /sys/fs/btrfs/<fsid>/devices. The information
      is published as symlinks to the block devices.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      29e5be24