1. 10 6月, 2014 10 次提交
  2. 08 4月, 2014 3 次提交
    • W
      Btrfs: fix unlock in __start_delalloc_inodes() · a1ecaabb
      Wang Shilong 提交于
      This patch fix a regression caused by the following patch:
      Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
      
      break while loop will make us call @spin_unlock() without
      calling @spin_lock() before, fix it.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      a1ecaabb
    • W
      Btrfs: don't compress for a small write · 68bb462d
      Wang Shilong 提交于
      To compress a small file range(<=blocksize) that is not
      an inline extent can not save disk space at all. skip it can
      save us some cpu time.
      
      This patch can also fix wrong setting nocompression flag for
      inode, say a case when @total_in is 4096, and then we get
      @total_compressed 52,because we do aligment to page cache size
      firstly, and then we get into conclusion @total_in=@total_compressed
      thus we will clear this inode's compression flag.
      
      An exception comes from inserting inline extent failure but we
      still have @total_compressed < @total_in,so we will still reset
      inode's flag, this is ok, because we don't have good compression
      effect.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      68bb462d
    • W
      Btrfs: fix snapshot vs nocow writting · e9894fd3
      Wang Shilong 提交于
      While running fsstress and snapshots concurrently, we will hit something
      like followings:
      
      Thread 1			Thread 2
      
      |->fallocate
        |->write pages
          |->join transaction
             |->add ordered extent
          |->end transaction
      				|->flushing data
      				  |->creating pending snapshots
      |->write data into src root's
         fallocated space
      
      After above work flows finished, we will get a state that source and
      snapshot root share same space, but source root have written data into
      fallocated space, this will make fsck fail to verify checksums for
      snapshot root's preallocating file extent data.Nocow writting also
      has this same problem.
      
      Fix this problem by syncing snapshots with nocow writting:
      
       1.for nocow writting,if there are pending snapshots, we will
       fall into COW way.
      
       2.if there are pending nocow writes, snapshots for this root
       will be blocked until nocow writting finish.
      Reported-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e9894fd3
  3. 04 4月, 2014 1 次提交
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
  4. 11 3月, 2014 13 次提交
  5. 15 2月, 2014 1 次提交
    • J
      Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol · 3a0dfa6a
      Josef Bacik 提交于
      A user was running into errors from an NFS export of a subvolume that had a
      default subvol set.  When we mount a default subvol we will use d_obtain_alias()
      to find an existing dentry for the subvolume in the case that the root subvol
      has already been mounted, or a dummy one is allocated in the case that the root
      subvol has not already been mounted.  This allows us to connect the dentry later
      on if we wander into the path.  However if we don't ever wander into the path we
      will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It doesn't
      appear to cause any problems but it is annoying nonetheless, so simply unset
      DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
      use d_materialise_unique() instead which will make everything play nicely
      together and reconnect stuff if we wander into the defaul subvol path from a
      different way.  With this patch I'm no longer getting the NFS errors when
      exporting a volume that has been mounted with a default subvol set.  Thanks,
      
      cc: bfields@fieldses.org
      cc: ebiederm@xmission.com
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a0dfa6a
  6. 04 2月, 2014 1 次提交
  7. 29 1月, 2014 11 次提交
    • C
      Btrfs: setup inode location during btrfs_init_inode_locked · 90d3e592
      Chris Mason 提交于
      We have a race during inode init because the BTRFS_I(inode)->location is setup
      after the inode hash table lock is dropped.  btrfs_find_actor uses the location
      field, so our search might not find an existing inode in the hash table if we
      race with the inode init code.
      
      This commit changes things to setup the location field sooner.  Also the find actor now
      uses only the location objectid to match inodes.  For inode hashing, we just
      need a unique and stable test, it doesn't have to reflect the inode numbers we
      show to userland.
      Signed-off-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      90d3e592
    • C
      Btrfs: don't use ram_bytes for uncompressed inline items · 514ac8ad
      Chris Mason 提交于
      If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
      the new size.  The fixe uses the size directly from the item header when
      reading uncompressed inlines, and also fixes truncate to update the
      size as it goes.
      Reported-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      514ac8ad
    • G
      btrfs: fix warning while merging two adjacent extents · 3c9665df
      Gui Hecheng 提交于
      When we have two adjacent extents in relink_extent_backref,
      we try to merge them. When we use btrfs_search_slot to locate the
      slot for the current extent, we shouldn't set "ins_len = 1",
      because we will merge it into the previous extent rather than
      insert a new item. Otherwise, we may happen to create a new leaf
      in btrfs_search_slot and path->slot[0] will be 0. Then we try to
      fetch the previous item using "path->slots[0]--", and it will cause
      a warning as follows:
      
      	[  145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0
      	[  145.713387] btrfs bad mapping eb start 53370886 len 4096, wanted 167772306 8
      	...
      	[  145.713462]  [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0
      	[  145.713476]  [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40
      	[  145.713485]  [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0
      	[  145.713498]  [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820
      	[  145.713508]  [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80
      
      I encounter this warning when running defrag having mkfs.btrfs
      with option -M. At the same time there are read/writes & snapshots
      running at background.
      Signed-off-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3c9665df
    • W
      Btrfs: fix transaction abortion when remounting btrfs from RW to RO · 2c21b4d7
      Wang Shilong 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sda8
       # mount /dev/sda8 /mnt -o flushoncommit
       # dd if=/dev/zero of=/mnt/data bs=4k count=102400 &
       # mount /dev/sda8 /mnt -o remount, ro
      
      When remounting RW to RO, the logic is to firstly set flag
      to RO and then commit transaction, however with option
      flushoncommit enabled,we will do RO check within committing
      transaction, so we get a transaction abortion here.
      
      Actually,here check is wrong, we should check if FS_STATE_ERROR
      is set, fix it.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Suggested-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c21b4d7
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • F
      Btrfs: faster file extent item replace operations · 1acae57b
      Filipe David Borba Manana 提交于
      When writing to a file we drop existing file extent items that cover the
      write range and then add a new file extent item that represents that write
      range.
      
      Before this change we were doing a tree lookup to remove the file extent
      items, and then after we did another tree lookup to insert the new file
      extent item.
      Most of the time all the file extent items we need to drop are located
      within a single leaf - this is the leaf where our new file extent item ends
      up at. Therefore, in this common case just combine these 2 operations into
      a single one.
      
      By avoiding the second btree navigation for insertion of the new file extent
      item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
      COW operations, CPU time on btree node/leaf key binary searches, etc.
      
      Besides for file writes, this is an operation that happens for file fsync's
      as well. However log btrees are much less likely to big as big as regular
      fs btrees, therefore the impact of this change is smaller.
      
      The following benchmark was performed against an SSD drive and a
      HDD drive, both for random and sequential writes:
      
        sysbench --test=fileio --file-num=4096 --file-total-size=8G \
           --file-test-mode=[rndwr|seqwr] --num-threads=512 \
           --file-block-size=8192 \ --max-requests=1000000 \
           --file-fsync-freq=0 --file-io-mode=sync [prepare|run]
      
      All results below are averages of 10 runs of the respective test.
      
      ** SSD sequential writes
      
      Before this change: 225.88 Mb/sec
      After this change:  277.26 Mb/sec
      
      ** SSD random writes
      
      Before this change: 49.91 Mb/sec
      After this change:  56.39 Mb/sec
      
      ** HDD sequential writes
      
      Before this change: 68.53 Mb/sec
      After this change:  69.87 Mb/sec
      
      ** HDD random writes
      
      Before this change: 13.04 Mb/sec
      After this change:  14.39 Mb/sec
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1acae57b
    • M
      Btrfs: fix the wrong nocow range check · e77751aa
      Miao Xie 提交于
      The following warning message was outputed when running the 274th case
      of xfstests with nodatacow option:
       BUG: Bad page state in process kswapd0  pfn:1c66f
       page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000
       page flags: 0x1000000000100a(error|uptodate|private_2)
      
      It is because the check of nocow range was wrong, we should compare the
      start and end position of the extent with the write position to verify
      if the write position was in the extent, but the current code just used
      the start postion to do the check, so we got the wrong extent and told
      the caller that it was a nocow write. And then when we write back the
      dirty pages, we found we should cow the extent, but at that time, there
      was no space in the fs, we had to the error flag for the page. When
      someone reclaimed that page, the above warning outputed. Fix it.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e77751aa
    • F
      Btrfs: reduce btree node locking duration on item update · eb653de1
      Filipe David Borba Manana 提交于
      If we do a btree search with the goal of updating an existing item
      without changing its size (ins_len == 0 and cow == 1), then we never
      need to hold locks on upper level nodes (even when slot == 0) after we
      COW their child nodes/leaves, as we won't have node splits or merges
      in this scenario (that is, no key additions, removals or shifts on any
      nodes or leaves).
      
      Therefore release the locks immediately after COWing the child nodes/leaves
      while navigating the btree, even if their parent slot is 0, instead of
      returning a path to the caller with those nodes locked, which would get
      released only when the caller releases or frees the path (or if it calls
      btrfs_unlock_up_safe).
      
      This is a common scenario, for example when updating inode items in fs
      trees and block group items in the extent tree.
      
      The following benchmarks were performed on a quad core machine with 32Gb
      of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
      quickly).
      
        sysbench --test=fileio --file-num=131072 --file-total-size=8G \
          --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
          --max-requests=100000 --file-io-mode=sync [prepare|run]
      
      Before this change:  49.85Mb/s (average of 5 runs)
      After this change:   50.38Mb/s (average of 5 runs)
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      eb653de1
    • M
      Btrfs: introduce the delayed inode ref deletion for the single link inode · 67de1176
      Miao Xie 提交于
      The inode reference item is close to inode item, so we insert it simultaneously
      with the inode item insertion when we create a file/directory.. In fact, we also
      can handle the inode reference deletion by the same way. So we made this patch to
      introduce the delayed inode reference deletion for the single link inode(At most
      case, the file doesn't has hard link, so we don't take the hard link into account).
      
      This function is based on the delayed inode mechanism. After applying this patch,
      we can reduce the time of the file/directory deletion by ~10%.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      67de1176
    • F
      Btrfs: convert printk to btrfs_ and fix BTRFS prefix · efe120a0
      Frank Holton 提交于
      Convert all applicable cases of printk and pr_* to the btrfs_* macros.
      
      Fix all uses of the BTRFS prefix.
      Signed-off-by: NFrank Holton <fholton@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      efe120a0
    • W
      Btrfs: fix a warning when iput a file · 180589ef
      Wang Shilong 提交于
      See the warning below:
      
      [ 1209.102076]  [<ffffffffa04721b9>] remove_extent_mapping+0x69/0x70 [btrfs]
      [ 1209.102084]  [<ffffffffa0466b06>] btrfs_evict_inode+0x96/0x4d0 [btrfs]
      [ 1209.102089]  [<ffffffff81073010>] ? wake_atomic_t_function+0x40/0x40
      [ 1209.102092]  [<ffffffff8118ab2e>] evict+0x9e/0x190
      [ 1209.102094]  [<ffffffff8118b313>] iput+0xf3/0x180
      [ 1209.102101]  [<ffffffffa0461fd1>] btrfs_run_delayed_iputs+0xb1/0xd0 [btrfs]
      [ 1209.102107]  [<ffffffffa045d358>] __btrfs_end_transaction+0x268/0x350 [btrfs]
      
      clear extent bit here to avoid triggering WARN_ON() in remove_extent_mapping()
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      180589ef