1. 11 3月, 2014 4 次提交
    • F
      Btrfs: make defrag not fragment files when using prealloc extents · e2127cf0
      Filipe Manana 提交于
      When using prealloc extents, a file defragment operation may actually
      fragment the file and increase the amount of data space used by the file.
      This change fixes that behaviour.
      
      Example:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt
      $ cd /mnt
      $ xfs_io -f -c 'falloc 0 1048576' foobar && sync
      $ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync
      
      Before defragmenting the file:
      
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.25MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 0 nr 4096
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 4096 nr 102400 ram 1048576
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 106496 nr 90112
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 196608 nr 106496 ram 1048576
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 897024 nr 106496 ram 1048576
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 1003520 nr 45056
      (...)
      
      Now defragmenting the file results in more data space used than before:
      
      $ btrfs filesystem defragment -f foobar && sync
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.55MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      And the corresponding file extent items are now no longer perfectly sequential
      as before, and we're now needlessly using more space from data block groups:
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 0 nr 4096 ram 1048576
      		extent compression 0
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 13893632 nr 102400
      		extent data offset 0 nr 102400 ram 102400
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 106496 nr 90112 ram 1048576
      		extent compression 0
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 13996032 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 14102528 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 1003520 nr 45056 ram 1048576
      		extent compression 0
      (...)
      
      With this change, the above example will no longer cause allocation of new data
      space nor change the sequentiality of the file extents, that is, defragment will
      be effectless, leaving all extent items pointing to the extent starting at disk
      byte 12845056.
      
      In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
      400Mb each, initially consisting of a single prealloc extent of 400Mb, having
      random writes happening at a low rate, lead to a total of over ~17Gb of data
      space used, not far from eventually reaching an ENOSPC state.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e2127cf0
    • F
      Btrfs: correctly flush data on defrag when compression is enabled · dec8ef90
      Filipe Manana 提交于
      When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
      enabled, we weren't flushing completely, as writing compressed extents
      is a 2 steps process, one to compress the data and another one to write
      the compressed data to disk.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dec8ef90
    • H
      btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl · abccd00f
      Hugo Mills 提交于
      The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
      and 64-bit systems. This means that it is impossible to use btrfs
      receive on a system with a 64-bit kernel and 32-bit userspace, because
      the structure size (and hence the ioctl number) is different.
      
      This patch adds a compatibility structure and ioctl to deal with the
      above case.
      Signed-off-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      abccd00f
    • K
      btrfs: Return EXDEV for cross file system snapshot · 23ad5b17
      Kusanagi Kouichi 提交于
      EXDEV seems an appropriate error if an operation fails bacause it
      crosses file system boundaries.
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NKusanagi Kouichi <slash@ac.auone-net.jp>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      23ad5b17
  2. 15 2月, 2014 1 次提交
  3. 09 2月, 2014 2 次提交
  4. 29 1月, 2014 14 次提交
    • J
      btrfs: fix defrag 32-bit integer overflow · c41570c9
      Justin Maggard 提交于
      When defragging a very large file, the cluster variable can wrap its 32-bit
      signed int type and become negative, which eventually gets passed to
      btrfs_force_ra() as a very large unsigned long value.  On 32-bit platforms,
      this eventually results in an Oops from the SLAB allocator.
      
      Change the cluster and max_cluster signed int variables to unsigned long to
      match the readahead functions.  This also allows the min() comparison in
      btrfs_defrag_file() to work as intended.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c41570c9
    • D
      btrfs: call permission checks earlier in ioctls and return EPERM · bd60ea0f
      David Sterba 提交于
      The owner and capability checks in IOC_SUBVOL_SETFLAGS and
      SET_RECEIVED_SUBVOL should be called before any other checks are done.
      
      Also unify the error code to EPERM.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bd60ea0f
    • D
      btrfs: restrict snapshotting to own subvolumes · d0242061
      David Sterba 提交于
      Currently, any user can snapshot any subvolume if the path is accessible and
      thus indirectly create and keep files he does not own under his direcotries.
      This is not possible with traditional directories.
      
      In security context, a user can snapshot root filesystem and pin any
      potentially buggy binaries, even if the updates are applied.
      
      All the snapshots are visible to the administrator, so it's possible to
      verify if there are suspicious snapshots.
      
      Another more practical problem is that any user can pin the space used
      by eg. root and cause ENOSPC.
      
      Original report:
      https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786
      
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d0242061
    • F
      Btrfs: faster file extent item search in clone ioctl · e4355f34
      Filipe David Borba Manana 提交于
      When we are looking for file extent items that intersect the cloning
      range, for each one that falls completely outside the range, don't
      release the path and do another full tree search - just move on
      to the next slot and copy the file extent item into our buffer only
      if the item intersects the cloning range.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e4355f34
    • F
      Btrfs: unlock inodes in correct order in clone ioctl · c57c2b3e
      Filipe David Borba Manana 提交于
      In the clone ioctl, when the source and target inodes are different,
      we can acquire their mutexes in 2 possible different orders. After
      we're done cloning, we were releasing the mutexes always in the same
      order - the most correct way of doing it is to release them by the
      reverse order they were acquired.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c57c2b3e
    • L
      Btrfs: release subvolume's block_rsv before transaction commit · de6e8200
      Liu Bo 提交于
      We don't have to keep subvolume's block_rsv during transaction commit,
      and within transaction commit, we may also need the free space reclaimed
      from this block_rsv to process delayed refs.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de6e8200
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • W
      fs/btrfs: Integer overflow in btrfs_ioctl_resize() · eb8052e0
      Wenliang Fan 提交于
      The local variable 'new_size' comes from userspace. If a large number
      was passed, there would be an integer overflow in the following line:
      	new_size = old_size + new_size;
      Signed-off-by: NWenliang Fan <fanwlexca@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      eb8052e0
    • F
      Btrfs: convert printk to btrfs_ and fix BTRFS prefix · efe120a0
      Frank Holton 提交于
      Convert all applicable cases of printk and pr_* to the btrfs_* macros.
      
      Fix all uses of the BTRFS prefix.
      Signed-off-by: NFrank Holton <fholton@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      efe120a0
    • D
      btrfs: Check read-only status of roots during send · 2c686537
      David Sterba 提交于
      All the subvolues that are involved in send must be read-only during the
      whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
      status to read-write and the result of send stream is undefined if the
      data change unexpectedly.
      
      Fix that by adding a refcount for all involved roots and verify that
      there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
      read-only -> read-write transition.
      
      We need refcounts because there are no restrictions on number of send
      parallel operations currently run on a single subvolume, be it source,
      parent or one of the multiple clone sources.
      
      Kernel is silent when the RO checks fail and returns EPERM. The same set
      of checks is done already in userspace before send starts.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c686537
    • T
      Btrfs: fix error check of btrfs_lookup_dentry() · 5662344b
      Tsutomu Itoh 提交于
      Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT)
      instead. This keeps the return value convention consistent.
      
      Callers who use btrfs_lookup_dentry() require a trivial update.
      
      create_snapshot() in particular looks like it can also lose a BUG_ON(!inode)
      which is not really needed - there seems less harm in returning ENOENT to
      userspace at that point in the stack than there is to crash the machine.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5662344b
    • J
      btrfs: add ioctl to export size of global metadata reservation · 01e219e8
      Jeff Mahoney 提交于
      btrfs filesystem df output will show the size of the metadata space
      and how much of it is used, and the user assumes that the difference
      is all usable space. Since that's not actually the case due to the
      global metadata reservation, we should provide the full picture to the
      user.
      
      This patch adds an ioctl that exports the size of the global metadata
      reservation so that btrfs filesystem df can report it.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      01e219e8
    • J
      btrfs: use feature attribute names to print better error messages · 3b02a68a
      Jeff Mahoney 提交于
      Now that we have the feature name strings available in the kernel via
      the sysfs attributes, we can use them for printing better failure
      messages from the ioctl path.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3b02a68a
    • J
      btrfs: add ioctls to query/change feature bits online · 2eaa055f
      Jeff Mahoney 提交于
      There are some feature bits that require no offline setup and can
      be enabled online. I've only reviewed extended irefs, but there will
      probably be more.
      
      We introduce three new ioctls:
      - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
      - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
        basis, as well as querying for which features are changeable with mounted.
      - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.
      
      We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
      allow us to define which features are safe to change at runtime.
      
      The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
      - Enabling a completely unsupported feature: warns and returns -ENOTSUPP
      - Enabling a feature that can only be done offline: warns and returns -EPERM
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2eaa055f
  5. 12 12月, 2013 1 次提交
  6. 15 11月, 2013 2 次提交
  7. 12 11月, 2013 11 次提交
  8. 21 9月, 2013 3 次提交
  9. 01 9月, 2013 2 次提交
    • A
      btrfs: return btrfs error code for dev excl ops err · e57138b3
      Anand Jain 提交于
      now threads can return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
      as defined in btrfs.h for the dev excl operation error in
      the FS, which means with this kernel would stop logging
      (almost an user error) into the /var/log/messages
      
      v2: accepts Josef' comment
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      e57138b3
    • F
      Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl · f7171750
      Filipe David Borba Manana 提交于
      The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
      number of devices before acquiring the device list mutex.
      
      This could lead to inconsistent results because the update of
      the device list and the number of devices counter (amongst other
      counters related to the device list) are updated in volumes.c
      while holding the device list mutex - except for 2 places, one
      was volumes.c:btrfs_prepare_sprout() and the other was
      volumes.c:device_list_add().
      
      For example, if we have 2 devices, with IDs 1 and 2 and then add
      a new device, with ID 3, and while adding the device is in progress
      an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
      devices of 2 and a max dev id of 3. This would be incorrect.
      
      Also, this ioctl handler was reading the fsid while it can be
      updated concurrently. This can happen when while a new device is
      being added and the current filesystem is in seeding mode.
      Example:
      
      $ mkfs.btrfs -f /dev/sdb1
      $ mkfs.btrfs -f /dev/sdb2
      $ btrfstune -S 1 /dev/sdb1
      $ mount /dev/sdb1 /mnt/test
      $ btrfs device add /dev/sdb2 /mnt/test
      
      If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
      could read an fsid that was never valid (some bits part of the old
      fsid and others part of the new fsid). Also, it could read a number
      of devices that doesn't match the number of devices in the list and
      the max device id, as explained before.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      f7171750