1. 29 1月, 2014 40 次提交
    • D
      btrfs: call permission checks earlier in ioctls and return EPERM · bd60ea0f
      David Sterba 提交于
      The owner and capability checks in IOC_SUBVOL_SETFLAGS and
      SET_RECEIVED_SUBVOL should be called before any other checks are done.
      
      Also unify the error code to EPERM.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bd60ea0f
    • D
      btrfs: restrict snapshotting to own subvolumes · d0242061
      David Sterba 提交于
      Currently, any user can snapshot any subvolume if the path is accessible and
      thus indirectly create and keep files he does not own under his direcotries.
      This is not possible with traditional directories.
      
      In security context, a user can snapshot root filesystem and pin any
      potentially buggy binaries, even if the updates are applied.
      
      All the snapshots are visible to the administrator, so it's possible to
      verify if there are suspicious snapshots.
      
      Another more practical problem is that any user can pin the space used
      by eg. root and cause ENOSPC.
      
      Original report:
      https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786
      
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d0242061
    • M
      Btrfs: fix wrong block group in trace during the free space allocation · 89d4346a
      Miao Xie 提交于
      We allocate the free space from the former block group, not the current
      one, so should use the former one to output the trace information.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      89d4346a
    • M
      Btrfs: cleanup the code of used_block_group in find_free_extent() · 215a63d1
      Miao Xie 提交于
      used_block_group is just used for the space cluster which doesn't
      belong to the current block group, the other place needn't use it.
      Or the logic of code seems unclear.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      215a63d1
    • M
      920e4a58
    • M
      Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss · 26b47ff6
      Miao Xie 提交于
      It is better that the position of the lock is close to the data which is
      protected by it, because they may be in the same cache line, we will load
      less cache lines when we access them. So we rearrange the members' position
      of btrfs_space_info structure to make the lock be closer to the its data.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      26b47ff6
    • W
      Btrfs: fix wrong search path initialization before searching tree root · ffcfaf81
      Wang Shilong 提交于
      To search tree root without transaction protection, we should neither search commit
      root nor skip locking here, fix it.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ffcfaf81
    • M
      Btrfs: flush the dirty pages of the ordered extent aggressively during logging csum · 23c671a5
      Miao Xie 提交于
      The performance of fsync dropped down suddenly sometimes, the main reason
      of this problem was that we might only flush part dirty pages in a ordered
      extent, then got that ordered extent, wait for the csum calcucation. But if
      no task flushed the left part, we would wait until the flusher flushed them,
      sometimes we need wait for several seconds, it made the performance drop
      down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s)
      
      This patch improves the above problem by flushing left dirty pages aggressively.
      
      Test Environment:
      CPU:		2CPU * 2Cores
      Memory:		4GB
      Partition:	20GB(HDD)
      
      Test Command:
       # sysbench --num-threads=8 --test=fileio --file-num=1 \
       > --file-total-size=8G --file-block-size=32768 \
       > --file-io-mode=sync --file-fsync-freq=100 \
       > --file-fsync-end=no --max-requests=10000 \
       > --file-test-mode=rndwr run
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      23c671a5
    • W
      Btrfs: fix transaction abortion when remounting btrfs from RW to RO · 2c21b4d7
      Wang Shilong 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sda8
       # mount /dev/sda8 /mnt -o flushoncommit
       # dd if=/dev/zero of=/mnt/data bs=4k count=102400 &
       # mount /dev/sda8 /mnt -o remount, ro
      
      When remounting RW to RO, the logic is to firstly set flag
      to RO and then commit transaction, however with option
      flushoncommit enabled,we will do RO check within committing
      transaction, so we get a transaction abortion here.
      
      Actually,here check is wrong, we should check if FS_STATE_ERROR
      is set, fix it.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Suggested-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c21b4d7
    • F
      Btrfs: faster file extent item search in clone ioctl · e4355f34
      Filipe David Borba Manana 提交于
      When we are looking for file extent items that intersect the cloning
      range, for each one that falls completely outside the range, don't
      release the path and do another full tree search - just move on
      to the next slot and copy the file extent item into our buffer only
      if the item intersects the cloning range.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e4355f34
    • L
      Btrfs: fix extent state leak on transaction abortion · 1a4319cc
      Liu Bo 提交于
      When transaction is aborted, we fail to commit transaction, instead we do
      cleanup work.  After that when we umount btrfs, we get to free fs roots' log
      trees respectively, but that happens after we unpin extents, so those extents
      pinned by freeing log trees will remain in memory and lead to the leak.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a4319cc
    • Q
      btrfs: Cleanup the btrfs_parse_options for remount. · 07802534
      Qu Wenruo 提交于
      Since remount will pending the new mount options to the original mount
      options, which will make btrfs_parse_options check the old options then
      new options, causing some stupid output like "enabling XXX" following by
      "disable XXX".
      
      This patch will add extra check before every btrfs_info to skip the
      output from old options checking.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      07802534
    • Q
      btrfs: Add noinode_cache mount option · 3818aea2
      Qu Wenruo 提交于
      Add noinode_cache mount option for btrfs.
      
      Since inode map cache involves all the btrfs_find_free_ino/return_ino
      things and if just trigger the mount_opt,
      an inode number get from inode map cache will not returned to inode map
      cache.
      
      To keep the find and return inode both in the same behavior,
      a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
      CHANGE_INODE_CACHE is set/cleared in remounting, and the original
      INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
      success transaction.
      Since find/return inode is all done between btrfs_start_transaction and
      btrfs_commit_transaction, this will keep consistent behavior.
      
      Also noinode_cache mount option will not stop the caching_kthread.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3818aea2
    • W
      Btrfs: fix to search previous metadata extent item since skinny metadata · ade2e0b3
      Wang Shilong 提交于
      There is a bug that using btrfs_previous_item() to search metadata extent item.
      This is because in btrfs_previous_item(), we need type match, however, since
      skinny metada was introduced by josef, we may mix this two types. So just
      use btrfs_previous_item() is not working right.
      
      To keep btrfs_previous_item() like normal tree search, i introduce another
      function btrfs_previous_extent_item().
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ade2e0b3
    • W
      Btrfs: fix missing skinny metadata check in scrub_stripe() · 7c76edb7
      Wang Shilong 提交于
      Check if we support skinny metadata firstly and fix to use
      right type to search.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7c76edb7
    • F
      Btrfs: fix send to not send non-aligned clone operations · 28e5dd8f
      Filipe David Borba Manana 提交于
      It is possible for the send feature to send clone operations that
      request a cloning range (offset + length) that is not aligned with
      the block size. This makes the btrfs receive command send issue a
      clone ioctl call that will fail, as the ioctl will return an -EINVAL
      error because of the unaligned range.
      
      Fix this by not sending clone operations for non block aligned ranges,
      and instead send regular write operation for these (less common) cases.
      
      The following xfstest reproduces this issue, which fails on the second
      btrfs receive command without this change:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        tmp=`mktemp -d`
      
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -fr $tmp
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _need_to_be_root
      
        rm -f $seqres.full
      
        _scratch_mkfs >/dev/null 2>&1
        _scratch_mount
      
        $XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo | _filter_xfs_io
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
      
        $XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo | _filter_xfs_io
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
      
        $XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo | _filter_xfs_io
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
      
        $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 | \
            _filter_scratch
      
        $XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo | _filter_xfs_io
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
      
        $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 | \
            _filter_scratch
      
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 | _filter_scratch
        $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
            -f $tmp/2.snap 2>&1 | _filter_scratch
      
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
        md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
      
        _scratch_unmount
        _check_btrfs_filesystem $SCRATCH_DEV
        _scratch_mkfs >/dev/null 2>&1
        _scratch_mount
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
        md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
        md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
      
        _scratch_unmount
        _check_btrfs_filesystem $SCRATCH_DEV
      
        status=0
        exit
      
      The tests expected output is:
      
        QA output created by 025
        FSSync 'SCRATCH_MNT'
        FSSync 'SCRATCH_MNT'
        wrote 2978/2978 bytes at offset 1482752
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        FSSync 'SCRATCH_MNT'
        Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1'
        FSSync 'SCRATCH_MNT'
        Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2'
        At subvol SCRATCH_MNT/mysnap1
        At subvol SCRATCH_MNT/mysnap2
        129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/foo
        42b6369eae2a8725c1aacc0440e597aa  SCRATCH_MNT/mysnap1/foo
        129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/mysnap2/foo
        At subvol mysnap1
        42b6369eae2a8725c1aacc0440e597aa  SCRATCH_MNT/mysnap1/foo
        At snapshot mysnap2
        129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/mysnap2/foo
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      28e5dd8f
    • F
      Btrfs: fix btrfs boot when compiled as built-in · 14a958e6
      Filipe David Borba Manana 提交于
      After the change titled "Btrfs: add support for inode properties", if
      btrfs was built-in the kernel (i.e. not as a module), it would cause a
      kernel panic, as reported recently by Fengguang:
      
      [    2.024722] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [    2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b
      [    2.028684] PGD 0
      [    2.028684] Oops: 0000 [#1] SMP
      [    2.028684] Modules linked in:
      [    2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1
      [    2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [    2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000
      [    2.028684] RIP: 0010:[<ffffffff81501594>]  [<ffffffff81501594>] crc32c+0xc/0x6b
      [    2.028684] RSP: 0000:ffff88000edd7e58  EFLAGS: 00010246
      [    2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000
      [    2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe
      [    2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20
      [    2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff
      [    2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000
      [    2.028684] FS:  0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000
      [    2.028684] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [    2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0
      [    2.028684] Stack:
      [    2.028684]  ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05
      [    2.028684]  0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05
      [    2.028684]  ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404
      [    2.028684] Call Trace:
      [    2.028684]  [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96
      [    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
      [    2.028684]  [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0
      [    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
      [    2.028684]  [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a
      [    2.028684]  [<ffffffff810e2404>] ? parse_args+0x25f/0x33d
      [    2.028684]  [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230
      [    2.028684]  [<ffffffff8234c785>] ? do_early_param+0x88/0x88
      [    2.028684]  [<ffffffff819f61b5>] ? rest_init+0x89/0x89
      [    2.028684]  [<ffffffff819f61c3>] kernel_init+0xe/0x109
      
      The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
      started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
      part of the properties initialization routine), the libcrc32c is not yet initialized,
      so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
      panic on boot.
      
      The approach to fix this is to use crypto component directly to use its crc32c (which
      is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
      doing as well, it uses crypto directly to get crc32c functionality.
      
      Verified this works both when btrfs is built-in and when it's loadable kernel module.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      14a958e6
    • F
      Btrfs: unlock inodes in correct order in clone ioctl · c57c2b3e
      Filipe David Borba Manana 提交于
      In the clone ioctl, when the source and target inodes are different,
      we can acquire their mutexes in 2 possible different orders. After
      we're done cloning, we were releasing the mutexes always in the same
      order - the most correct way of doing it is to release them by the
      reverse order they were acquired.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c57c2b3e
    • W
      Btrfs: optimize to remove unnecessary removal with ulist reallocation · f499e40f
      Wang Shilong 提交于
      Here we are not going to free memory, no need to remove every node
      one by one, just init root node here is ok.
      
      Cc:  Liu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f499e40f
    • L
      Btrfs: release subvolume's block_rsv before transaction commit · de6e8200
      Liu Bo 提交于
      We don't have to keep subvolume's block_rsv during transaction commit,
      and within transaction commit, we may also need the free space reclaimed
      from this block_rsv to process delayed refs.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de6e8200
    • M
      Btrfs: fix the race between write back and nocow buffered write · f1de9683
      Miao Xie 提交于
      When we ran the 274th case of xfstests with nodatacow mount option,
      We met the following warning message:
      WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0
      
      It is caused by the race between the write back and nocow buffered
      write:
        Task1				Task2
        __btrfs_buffered_write()
          skip data reservation
          reserve the metadata space
          copy the data
          dirty the pages
          unlock the pages
      				write back the pages
      				release the data space
         				  becasue there is no
      				  noreserve flag
         set the noreserve flag
      
      This patch fixes this problem by unlocking the pages after
      the noreserve flag is set.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f1de9683
    • J
      Btrfs: only process as many file extents as there are refs · 7ef81ac8
      Josef Bacik 提交于
      The backref walking code will search down to the key it is looking for and then
      proceed to walk _all_ of the extents on the file until it hits the end.  This is
      suboptimal with large files, we only need to look for as many extents as we have
      references for that inode.  I have a testcase that creates a randomly written 4
      gig file and before this patch it took 6min 30sec to do the initial send, with
      this patch it takes 2min 30sec to do the intial send.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7ef81ac8
    • J
      Btrfs: fix qgroup rescan to work with skinny metadata · 3a6d75e8
      Josef Bacik 提交于
      Could have sworn I fixed this before but apparently not.  This makes us pass
      btrfs/022 with skinny metadata enabled.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a6d75e8
    • J
      Btrfs: fix extent_from_logical to deal with skinny metadata · 580f0a67
      Josef Bacik 提交于
      I don't think this is an issue and I've not seen it in practice but
      extent_from_logical will fail to find a skinny extent because it uses
      btrfs_previous_item and gives it the normal extent item type.  This is just not
      a place to use btrfs_previous_item since we care about either normal extents or
      skinny extents, so open code btrfs_previous_item to properly check.  This would
      only affect metadata and the only place this is used for metadata is scrub and
      I'm pretty sure it's just for printing stuff out, not actually doing any work so
      hopefully it was never a problem other than a cosmetic one.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      580f0a67
    • J
      Btrfs: throttle delayed refs better · 0a2b2a84
      Josef Bacik 提交于
      On one of our gluster clusters we noticed some pretty big lag spikes.  This
      turned out to be because our transaction commit was taking like 3 minutes to
      complete.  This is because we have like 30 gigs of metadata, so our global
      reserve would end up being the max which is like 512 mb.  So our throttling code
      would allow a ridiculous amount of delayed refs to build up and then they'd all
      get run at transaction commit time, and for a cold mounted file system that
      could take up to 3 minutes to run.  So fix the throttling to be based on both
      the size of the global reserve and how long it takes us to run delayed refs.
      This patch tracks the time it takes to run delayed refs and then only allows 1
      seconds worth of outstanding delayed refs at a time.  This way it will auto-tune
      itself from cold cache up to when everything is in memory and it no longer has
      to go to disk.  This makes our transaction commits take much less time to run.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0a2b2a84
    • J
      Btrfs: attach delayed ref updates to delayed ref heads · d7df2c79
      Josef Bacik 提交于
      Currently we have two rb-trees, one for delayed ref heads and one for all of the
      delayed refs, including the delayed ref heads.  When we process the delayed refs
      we have to hold onto the delayed ref lock for all of the selecting and merging
      and such, which results in quite a bit of lock contention.  This was solved by
      having a waitqueue and only one flusher at a time, however this hurts if we get
      a lot of delayed refs queued up.
      
      So instead just have an rb tree for the delayed ref heads, and then attach the
      delayed ref updates to an rb tree that is per delayed ref head.  Then we only
      need to take the delayed ref lock when adding new delayed refs and when
      selecting a delayed ref head to process, all the rest of the time we deal with a
      per delayed ref head lock which will be much less contentious.
      
      The locking rules for this get a little more complicated since we have to lock
      up to 3 things to properly process delayed refs, but I will address that problem
      later.  For now this passes all of xfstests and my overnight stress tests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7df2c79
    • J
      Btrfs: make fsync latency less sucky · 5039eddc
      Josef Bacik 提交于
      Looking into some performance related issues with large amounts of metadata
      revealed that we can have some pretty huge swings in fsync() performance.  If we
      have a lot of delayed refs backed up (as you will tend to do with lots of
      metadata) fsync() will wander off and try to run some of those delayed refs
      which can result in reading from disk and such.  Since the actual act of fsync()
      doesn't create any delayed refs there is no need to make it throttle on delayed
      ref stuff, that will be handled by other people.  With this patch we get much
      smoother fsync performance with large amounts of metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5039eddc
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • F
      Btrfs: faster file extent item replace operations · 1acae57b
      Filipe David Borba Manana 提交于
      When writing to a file we drop existing file extent items that cover the
      write range and then add a new file extent item that represents that write
      range.
      
      Before this change we were doing a tree lookup to remove the file extent
      items, and then after we did another tree lookup to insert the new file
      extent item.
      Most of the time all the file extent items we need to drop are located
      within a single leaf - this is the leaf where our new file extent item ends
      up at. Therefore, in this common case just combine these 2 operations into
      a single one.
      
      By avoiding the second btree navigation for insertion of the new file extent
      item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
      COW operations, CPU time on btree node/leaf key binary searches, etc.
      
      Besides for file writes, this is an operation that happens for file fsync's
      as well. However log btrees are much less likely to big as big as regular
      fs btrees, therefore the impact of this change is smaller.
      
      The following benchmark was performed against an SSD drive and a
      HDD drive, both for random and sequential writes:
      
        sysbench --test=fileio --file-num=4096 --file-total-size=8G \
           --file-test-mode=[rndwr|seqwr] --num-threads=512 \
           --file-block-size=8192 \ --max-requests=1000000 \
           --file-fsync-freq=0 --file-io-mode=sync [prepare|run]
      
      All results below are averages of 10 runs of the respective test.
      
      ** SSD sequential writes
      
      Before this change: 225.88 Mb/sec
      After this change:  277.26 Mb/sec
      
      ** SSD random writes
      
      Before this change: 49.91 Mb/sec
      After this change:  56.39 Mb/sec
      
      ** HDD sequential writes
      
      Before this change: 68.53 Mb/sec
      After this change:  69.87 Mb/sec
      
      ** HDD random writes
      
      Before this change: 13.04 Mb/sec
      After this change:  14.39 Mb/sec
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1acae57b
    • W
      Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot() · 90515e7f
      Wang Shilong 提交于
      We may return early in btrfs_drop_snapshot(), we shouldn't
      call btrfs_std_err() for this case, fix it.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      90515e7f
    • W
      Btrfs: remove unnecessary transaction commit before send · 8e56338d
      Wang Shilong 提交于
      We will finish orphan cleanups during snapshot, so we don't
      have to commit transaction here.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8e56338d
    • W
      Btrfs: fix protection between send and root deletion · 18f687d5
      Wang Shilong 提交于
      We should gurantee that parent and clone roots can not be destroyed
      during send, for this we have two ideas.
      
      1.by holding @subvol_sem, this might be a nightmare, because it will
      block all subvolumes deletion for a long time.
      
      2.Miao pointed out we can reuse @send_in_progress, that mean we will
      skip snapshot deletion if root sending is in progress.
      
      Here we adopt the second approach since it won't block other subvolumes
      deletion for a long time.
      
      Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
      , if this root is involved in send, we return directly rather than
      continue to check.There are several reasons about it:
      
      1.this case happen seldomly.
      2.after sending,cleaner thread can continue to drop that root.
      3.make code simple
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18f687d5
    • W
      Btrfs: fix wrong send_in_progress accounting · 896c14f9
      Wang Shilong 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sda8
       # mount /dev/sda8 /mnt
       # btrfs sub snapshot -r /mnt /mnt/snap1
       # btrfs sub snapshot -r /mnt /mnt/snap2
       # btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1
       # dmesg
      
      The problem is that we will sort clone roots(include @send_root), it
      might push @send_root before thus @send_root's @send_in_progress will
      be decreased twice.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      896c14f9
    • Q
      btrfs: Add treelog mount option. · a88998f2
      Qu Wenruo 提交于
      Add treelog mount option to enable tree log with
      remount option.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a88998f2
    • Q
      btrfs: Add datasum mount option. · d399167d
      Qu Wenruo 提交于
      Add datasum mount option to enable checksum with
      remount option.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d399167d
    • Q
      btrfs: Add datacow mount option. · a258af7a
      Qu Wenruo 提交于
      Add datacow mount option to enable copy-on-write with
      remount option.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a258af7a
    • Q
      btrfs: Add acl mount option. · bd0330ad
      Qu Wenruo 提交于
      Add acl mount option to enable acl with remount option.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bd0330ad
    • Q
      btrfs: Add noflushoncommit mount option. · 2c9ee856
      Qu Wenruo 提交于
      Add noflushoncommit mount option to disable flush on commit with
      remount option.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c9ee856
    • Q
      btrfs: Add noenospc_debug mount option. · 53036293
      Qu Wenruo 提交于
      Add noenospc_debug mount option to disable ENOSPC debug with
      remount option.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      53036293
    • Q
      btrfs: Add nodiscard mount option. · e07a2ade
      Qu Wenruo 提交于
      Add nodiscard mount option to disable discard with remount option.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e07a2ade