1. 03 6月, 2015 18 次提交
    • O
      Btrfs: lock superblock before remounting for rw subvol · 773cd04e
      Omar Sandoval 提交于
      Since commit 0723a047 ("btrfs: allow mounting btrfs subvolumes with
      different ro/rw options"), when mounting a subvolume read/write when
      another subvolume has previously been mounted read-only, we first do a
      remount. However, this should be done with the superblock locked, as per
      sync_filesystem():
      
      	/*
      	 * We need to be protected against the filesystem going from
      	 * r/o to r/w or vice versa.
      	 */
      	WARN_ON(!rwsem_is_locked(&sb->s_umount));
      
      This WARN_ON can easily be hit with:
      
      mkfs.btrfs -f /dev/vdb
      mount /dev/vdb /mnt
      btrfs subvol create /mnt/vol1
      btrfs subvol create /mnt/vol2
      umount /mnt
      mount -oro,subvol=/vol1 /dev/vdb /mnt
      mount -orw,subvol=/vol2 /dev/vdb /mnt2
      
      Fixes: 0723a047 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options")
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      773cd04e
    • F
      Btrfs: wake up extent state waiters on unlock through clear_extent_bits · 0f31871f
      Filipe Manana 提交于
      When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits()
      through free_io_failure(), we weren't waking up any tasks waiting for the
      extent's state EXTENT_LOCKED bit, leading to an hang.
      
      So make sure clear_extent_bits() ends up waking up any waiters if the
      bit EXTENT_LOCKED is supplied by its callers.
      
      Zygo Blaxell was experiencing such hangs at inode eviction time after
      file unlinks. Thanks to him for a set of scripts to reproduce the issue.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f31871f
    • F
      Btrfs: fix chunk allocation regression leading to transaction abort · c152b63e
      Filipe Manana 提交于
      With commit 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction
      in case device tree has hole") introduced in the kernel 4.1 merge window,
      we end up using part of a device hole for which there are already pending
      chunks or pinned chunks. Before that commit we didn't use the hole and
      would just move on to the next hole in the device.
      
      However when we adjust the start offset for the chunk allocation and we
      have pinned chunks, we set it blindly to the end offset of the pinned
      chunk we are currently processing, which is dangerous because we can
      have a pending chunk that has a start offset that matches the end offset
      of our pinned chunk - leading us to a case where we end up getting two
      pending chunks that start at the same physical device offset, which makes
      us later abort the current transaction with -EEXIST when finishing the
      chunk allocation at btrfs_create_pending_block_groups():
      
      [194737.659017] ------------[ cut here ]------------
      [194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [194737.662209] BTRFS: Transaction aborted (error -17)
      [194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
      [194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [194737.682999]  0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2
      [194737.684540]  ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78
      [194737.686017]  ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40
      [194737.687509] Call Trace:
      [194737.688068]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [194737.689027]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [194737.690095]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [194737.691198]  [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [194737.693789]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [194737.695065]  [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [194737.696806]  [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [194737.698683]  [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [194737.700329]  [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [194737.701924]  [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [194737.703675]  [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs]
      [194737.705417]  [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [194737.707058]  [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
      [194737.708560]  [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs]
      [194737.710673]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
      [194737.712076]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
      [194737.713293]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
      [194737.714443]  [<ffffffff81154424>] SyS_pwrite64+0x64/0x82
      [194737.715646]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [194737.717175] ---[ end trace f2d5dc04e56d7e48 ]---
      [194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists
      
      The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by
      btrfs_create_pending_block_groups(), when it attempts to insert a
      duplicated device extent item via btrfs_alloc_dev_extent().
      
      This issue was reproducible with fstests generic/038 running in a loop for
      several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard".
      Applying Jeff's recent patch titled "btrfs: add missing discards when
      unpinning extents with -o discard" makes the issue much easier to reproduce
      (usually within 4 to 5 hours), since it pins chunks for longer periods of
      time when an unused block group is deleted by the cleaner kthread.
      
      Fix this by making sure that we never adjust the start offset to a lower
      value than it currently has.
      
      Fixes: 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole"
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c152b63e
    • S
      btrfs: use after free when closing devices · 2037a093
      Sasha Levin 提交于
      __btrfs_close_devices() would call_rcu to free the device, which is racy with
      list_for_each_entry() accessing the memory to retrieve the next device on the
      list.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      2037a093
    • D
      btrfs: make root id query unprivileged · 01b810b8
      David Sterba 提交于
      The INO_LOOKUP ioctl can lookup path for a given inode number and is
      thus restricted. As a sideefect it can find the root id of the
      containing subvolume and we're using this int the 'btrfs inspect rootid'
      command.
      
      The restriction is unnecessary in case we set the ioctl args
       args::treeid    = 0
       args::objectid  = 256 (BTRFS_FIRST_FREE_OBJECTID)
      
      Then the path will be empty and the treeid is filled with the root id of
      the inode on which the ioctl is called. This behaviour is unchanged,
      after the root restriction is removed.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      01b810b8
    • F
      Btrfs: fix block group ->space_info null pointer dereference · 2e6e5183
      Filipe Manana 提交于
      When we create a block group we add it to the rbtree of block groups
      before setting its ->space_info field (while it's NULL). This is
      problematic since other tasks can access the block group from the
      rbtree and attempt to use its ->space_info before it is set by
      btrfs_make_block_group().
      
      This can happen for example when a concurrent fitrim ioctl operation
      is ongoing, which produces a trace like the following when
      CONFIG_DEBUG_PAGEALLOC is set.
      
      [11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0
      [11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou
      [11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000
      [11509.608179] RIP: 0010:[<ffffffff8107d675>]  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179] RSP: 0018:ffff8801b3edf9e8  EFLAGS: 00010002
      [11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000
      [11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
      [11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000
      [11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000
      [11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0
      [11509.608179] FS:  00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000
      [11509.608179] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0
      [11509.608179] Stack:
      [11509.608179]  0000000000000000 0000000000000000 0000000000000004 0000000000000000
      [11509.608179]  ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000
      [11509.608179]  0000000000000001 0000000000000000 ffff880100000000 00000000000006c4
      [11509.608179] Call Trace:
      [11509.608179]  [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02
      [11509.608179]  [<ffffffff8107e806>] lock_acquire+0xa5/0x116
      [11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44
      [11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs]
      [11509.608179]  [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs]
      [11509.608179]  [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs]
      [11509.608179]  [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs]
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e
      [11509.608179]  [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479
      [11509.608179]  [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e
      [11509.608179]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [11509.608179]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [11509.608179]  [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f
      [11509.608179]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00
      [11509.608179] RIP  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179]  RSP <ffff8801b3edf9e8>
      [11509.608179] CR2: 0000000000000018
      [11509.608179] ---[ end trace 570a5c6769f0e49a ]---
      
      Which corresponds to the following access in fs/btrfs/free-space-cache.c:
      
        static int do_trimming(struct btrfs_block_group_cache *block_group,
                               u64 *total_trimmed, u64 start, u64 bytes,
                               u64 reserved_start, u64 reserved_bytes,
                               struct btrfs_trim_range *trim_entry)
        {
             struct btrfs_space_info *space_info = block_group->space_info;
        (...)
             spin_lock(&space_info->lock);
             ^^^^^ - block_group->space_info is NULL...
      
      Fix this by ensuring the block group's ->space_info is set before adding
      the block group to the rbtree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2e6e5183
    • A
      Btrfs: check error before reporting missing device and add uuid · 33b97e43
      Anand Jain 提交于
      Report missing device when add is successful,
      otherwise it would exit as ENOMEM. And add uuid
      to the report.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      33b97e43
    • Q
      btrfs: Fix superblock csum type check. · 1f6e4b3f
      Qu Wenruo 提交于
      Old csum type check is wrong and can't catch csum_type 1(not supported).
      
      Fix it to avoid hostile 0 division.
      Reported-by: NLukas Lueg <lukas.lueg@gmail.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      1f6e4b3f
    • F
      Btrfs: incremental send, fix clone operations for compressed extents · 619d8c4e
      Filipe Manana 提交于
      Marc reported a problem where the receiving end of an incremental send
      was performing clone operations that failed with -EINVAL. This happened
      because, unlike for uncompressed extents, we were not checking if the
      source clone offset and length, after summing the data offset, falls
      within the source file's boundaries.
      
      So make sure we do such checks when attempting to issue clone operations
      for compressed extents.
      
      Problem reproducible with the following steps:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount -o compress /dev/sdc /mnt2
      
        # Create the file with a single extent of 128K. This creates a metadata file
        # extent item with a data start offset of 0 and a logical length of 128K.
        $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo
      
        # Now rewrite the range 64K to 112K of our file. This will make the inode's
        # metadata continue to point to the 128K extent we created before, but now
        # with an extent item that points to the extent with a data start offset of
        # 112K and a logical length of 16K.
        # That metadata file extent item is associated with the logical file offset
        # at 176K and covers the logical file range 176K to 192K.
        $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo
      
        # Now rewrite the range 180K to 12K. This will make the inode's metadata
        # continue to point the the 128K extent we created earlier, with a single
        # extent item that points to it with a start offset of 112K and a logical
        # length of 4K.
        # That metadata file extent item is associated with the logical file offset
        # at 176K and covers the logical file range 176K to 180K.
        $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
        $ touch /mnt/bar
        # Calls the btrfs clone ioctl.
        $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \
          -l $((4 * 1024)) /mnt/foo /mnt/bar
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
        $ btrfs send /mnt/snap1 | btrfs receive /mnt2
        At subvol /mnt/snap1
        At subvol snap1
      
        $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
        At subvol /mnt/snap2
        At snapshot snap2
        ERROR: failed to clone extents to bar
        Invalid argument
      
      A test case for fstests follows soon.
      Reported-by: NMarc MERLIN <marc@merlins.org>
      Tested-by: NMarc MERLIN <marc@merlins.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Tested-by: NJan Alexander Steffens (heftig) <jan.steffens@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      619d8c4e
    • C
      btrfs: qgroup: Fix possible leak in btrfs_add_qgroup_relation() · ab3680dd
      Christian Engelmayer 提交于
      Commit 9c8b35b1 ("btrfs: quota: Automatically update related qgroups or
      mark INCONSISTENT flags when assigning/deleting a qgroup relations.")
      introduced the allocation of a temporary ulist in function
      btrfs_add_qgroup_relation() and added the corresponding cleanup to the out
      path. However, the allocation was introduced before the src/dst level check
      that directly returns. Fix the possible leakage of the ulist by moving the
      allocation after the input validation. Detected by Coverity CID 1295988.
      Signed-off-by: NChristian Engelmayer <cengelma@gmx.at>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      ab3680dd
    • F
      Btrfs: fix mutex unlock without prior lock on space cache truncation · 35c76642
      Filipe Manana 提交于
      If the call to btrfs_truncate_inode_items() failed and we don't have a block
      group, we were unlocking the cache_write_mutex without having locked it (we
      do it only if we have a block group).
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      35c76642
    • A
      Btrfs: log when missing device is created · 816fcebe
      Anand Jain 提交于
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      816fcebe
    • D
      btrfs: fix warnings after changes in btrfs_abort_transaction · 6d13f549
      David Sterba 提交于
      fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’:
      fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
         btrfs_abort_transaction(trans, tree_root,
         ^
        CC [M]  fs/btrfs/ioctl.o
      fs/btrfs/ioctl.c: In function ‘create_subvol’:
      fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
         btrfs_abort_transaction(trans, root, PTR_ERR(new_root));
      
      PTR_ERR returns long, but we're really using 'int' for the error codes
      everywhere so just set and use the local variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      6d13f549
    • D
      btrfs: add 'cold' compiler annotations to all error handling functions · c0d19e2b
      David Sterba 提交于
      The annotated functios will be placed into .text.unlikely section. The
      annotation also hints compiler to move the code out of the hot paths,
      and may implicitly mark if-statement leading to that block as unlikely.
      
      This is a heuristic, the impact on the generated code is not
      significant.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c0d19e2b
    • D
      btrfs: report exact callsite where transaction abort occurs · 1a9a8a71
      David Sterba 提交于
      WARN is called from a single location and all bugreports say that's in
      super.c __btrfs_abort_transaction. This is slightly confusing as we'd
      rather want to know the exact callsite. Whereas this information is
      printed in the syslog below the stacktrace, this requires further look
      and we usually see only the headline from WARNING.
      
      Moving the WARN into the macro has to inline some code and increases
      code by a few kilobytes:
      
        text    data     bss     dec     hex filename
      835481   20305   14120  869906   d4612 btrfs.ko.before
      842883   20305   14120  877308   d62fc btrfs.ko.after
      
      The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config.
      The increase is not small and could lead to worse icache use. The code
      is on error/exit paths that can be recognized by compiler as cold and
      moved out of the way so the impact is speculated to be low, if
      measurable at all.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a9a8a71
    • D
      btrfs: let tree defrag work in SSD mode · 13028901
      David Sterba 提交于
      Long time ago (2008) the defrag was automatic for new b-tree writes but
      has been disabled after performance problems. There was a leftover in
      tree-defrag.c that effectively stops any defragmentation on b-trees.
      This is a bit unexpected and IMHO undesired. The SSD mode is an
      optimization and defrag is supposed to work if the users asks for it.
      
      Related commits:
      
      6702ed49
      Btrfs: Add run time btree defrag, and an ioctl to force btree defrag
      
      e18e4809
      Btrfs: Add mount -o ssd, which includes optimizations for seek free
      storage
      
      b3236e68
      Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there
      
      9afbb0b7
      Btrfs: Disable tree defrag in SSD mode
      
      The last three commits switch the defrag+ssd off/on/off and the last one
      
      3f157a2f
      Btrfs: Online btree defragmentation fixes
      
      misses the bits from tree-defrag.c to revert to the behaviour introduced
      in e18e4809.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      13028901
    • F
      Btrfs: check pending chunks when shrinking fs to avoid corruption · 53e489bc
      Filipe Manana 提交于
      When we shrink the usable size of a device (its total_bytes), we go over
      all the device extent items in the device tree and attempt to relocate
      the chunk of any device extent that goes beyond the new usable size for
      the device. We do that after setting the new usable size (total_bytes) in
      the device object, so that all new allocations (and reallocations) don't
      use areas of the device that go beyond the new (shorter) size. However we
      were not considering that before setting the new size in the device,
      pending chunks might have been created that use device extents that go
      beyond the new size, and those device extents are not yet in the device
      tree after we search the device tree - they are still attached to the
      list of new block group for some ongoing transaction handle, and they are
      only added to the device tree when the transaction handle is ended (via
      btrfs_create_pending_block_groups()).
      
      So check for pending chunks with device extents that go beyond the new
      size and if any exists, commit the current transaction and repeat the
      search in the device tree.
      
      Not doing this it would mean we would return success to user space while
      still having extents that go beyond the new size, and later user space
      could override those locations on the device while the fs still references
      them, causing all sorts of corruption and unexpected events.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      53e489bc
    • O
      Btrfs: don't invalidate root dentry when subvolume deletion fails · 64ad6c48
      Omar Sandoval 提交于
      Since commit bafc9b75 ("vfs: More precise tests in d_invalidate"),
      mounted subvolumes can be deleted because d_invalidate() won't fail.
      However, we run into problems when we attempt to delete the default
      subvolume while it is mounted as the root filesystem:
      
      	# btrfs subvol list /
      	ID 257 gen 306 top level 5 path rootvol
      	ID 267 gen 334 top level 5 path snap1
      	# btrfs subvol get-default /
      	ID 267 gen 334 top level 5 path snap1
      	# btrfs inspect-internal rootid /
      	267
      	# mount -o subvol=/ /dev/vda1 /mnt
      	# btrfs subvol del /mnt/snap1
      	Delete subvolume (no-commit): '/mnt/snap1'
      	ERROR: cannot delete '/mnt/snap1' - Operation not permitted
      	# findmnt /
      	findmnt: can't read /proc/mounts: No such file or directory
      	# ls /proc
      	#
      
      Markus reported that this same scenario simply led to a kernel oops.
      
      This happens because in btrfs_ioctl_snap_destroy(), we call
      d_invalidate() before we check may_destroy_subvol(), which means that we
      detach the submounts and drop the dentry before erroring out. Instead,
      we should only invalidate the dentry once the deletion has succeeded.
      Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
      will prune the dcache for the deleted subvolume.
      
      Cc: <stable@vger.kernel.org>
      Fixes: bafc9b75 ("vfs: More precise tests in d_invalidate")
      Reported-by: NMarkus Schauler <mschauler@gmail.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      64ad6c48
  2. 21 5月, 2015 1 次提交
  3. 20 5月, 2015 2 次提交
    • F
      Btrfs: fix racy system chunk allocation when setting block group ro · a9629596
      Filipe Manana 提交于
      If while setting a block group read-only we end up allocating a system
      chunk, through check_system_chunk(), we were not doing it while holding
      the chunk mutex which is a problem if a concurrent chunk allocation is
      happening, through do_chunk_alloc(), as it means both block groups can
      end up using the same logical addresses and physical regions in the
      device(s). So make sure we hold the chunk mutex.
      
      Cc: stable@vger.kernel.org  # 4.0+
      Fixes: 2f081088 ("btrfs: delete chunk allocation attemp when
                            setting block group ro")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a9629596
    • M
      btrfs: clear 'ret' in btrfs_check_shared() loop · 2c2ed5aa
      Mark Fasheh 提交于
      btrfs_check_shared() is leaking a return value of '1' from
      find_parent_nodes(). As a result, callers (in this case, extent_fiemap())
      are told extents are shared when they are not. This in turn broke fiemap on
      btrfs for kernels v3.18 and up.
      
      The fix is simple - we just have to clear 'ret' after we are done processing
      the results of find_parent_nodes().
      
      It wasn't clear to me at first what was happening with return values in
      btrfs_check_shared() and find_parent_nodes() - thanks to Josef for the help
      on irc. I added documentation to both functions to make things more clear
      for the next hacker who might come across them.
      
      If we could queue this up for -stable too that would be great.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c2ed5aa
  4. 11 5月, 2015 4 次提交
    • F
      Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON · 062c19e9
      Filipe Manana 提交于
      There's a race between releasing extent buffers that are flagged as stale
      and recycling them that makes us it the following BUG_ON at
      btrfs_release_extent_buffer_page:
      
          BUG_ON(extent_buffer_under_io(eb))
      
      The BUG_ON is triggered because the extent buffer has the flag
      EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
      dirty by another concurrent task.
      
      Here follows a sequence of steps that leads to the BUG_ON.
      
            CPU 0                                                    CPU 1                                                CPU 2
      
      path->nodes[0] == eb X
      X->refs == 2 (1 for the tree, 1 for the path)
      btrfs_header_generation(X) == current trans id
      flag EXTENT_BUFFER_DIRTY set on X
      
      btrfs_release_path(path)
          unlocks X
      
                                                            reads eb X
                                                               X->refs incremented to 3
                                                            locks eb X
                                                            btrfs_del_items(X)
                                                               X becomes empty
                                                               clean_tree_block(X)
                                                                   clear EXTENT_BUFFER_DIRTY from X
                                                               btrfs_del_leaf(X)
                                                                   unlocks X
                                                                   extent_buffer_get(X)
                                                                      X->refs incremented to 4
                                                                   btrfs_free_tree_block(X)
                                                                      X's range is not pinned
                                                                      X's range added to free
                                                                        space cache
                                                                   free_extent_buffer_stale(X)
                                                                      lock X->refs_lock
                                                                      set EXTENT_BUFFER_STALE on X
                                                                      release_extent_buffer(X)
                                                                          X->refs decremented to 3
                                                                          unlocks X->refs_lock
                                                            btrfs_release_path()
                                                               unlocks X
                                                               free_extent_buffer(X)
                                                                   X->refs becomes 2
      
                                                                                                            __btrfs_cow_block(Y)
                                                                                                                btrfs_alloc_tree_block()
                                                                                                                    btrfs_reserve_extent()
                                                                                                                        find_free_extent()
                                                                                                                            gets offset == X->start
                                                                                                                    btrfs_init_new_buffer(X->start)
                                                                                                                        btrfs_find_create_tree_block(X->start)
                                                                                                                            alloc_extent_buffer(X->start)
                                                                                                                                find_extent_buffer(X->start)
                                                                                                                                    finds eb X in radix tree
      
          free_extent_buffer(X)
              lock X->refs_lock
                  test X->refs == 2
                  test bit EXTENT_BUFFER_STALE is set
                  test !extent_buffer_under_io(eb)
      
                                                                                                                                    increments X->refs to 3
                                                                                                                                    mark_extent_buffer_accessed(X)
                                                                                                                                        check_buffer_tree_ref(X)
                                                                                                                                          --> does nothing,
                                                                                                                                              X->refs >= 2 and
                                                                                                                                              EXTENT_BUFFER_TREE_REF
                                                                                                                                              is set in X
                                                                                                                    clear EXTENT_BUFFER_STALE from X
                                                                                                                    locks X
                                                                                                                btrfs_mark_buffer_dirty()
                                                                                                                    set_extent_buffer_dirty(X)
                                                                                                                        check_buffer_tree_ref(X)
                                                                                                                           --> does nothing, X->refs >= 2 and
                                                                                                                               EXTENT_BUFFER_TREE_REF is set
                                                                                                                        sets EXTENT_BUFFER_DIRTY on X
      
                  test and clear EXTENT_BUFFER_TREE_REF
                  decrements X->refs to 2
              release_extent_buffer(X)
                  decrements X->refs to 1
                  unlock X->refs_lock
      
                                                                                                            unlock X
                                                                                                            free_extent_buffer(X)
                                                                                                                lock X->refs_lock
                                                                                                                release_extent_buffer(X)
                                                                                                                    decrements X->refs to 0
                                                                                                                    btrfs_release_extent_buffer_page(X)
                                                                                                                         BUG_ON(extent_buffer_under_io(X))
                                                                                                                             --> EXTENT_BUFFER_DIRTY set on X
      
      Fix this by making find_extent buffer wait for any ongoing task currently
      executing free_extent_buffer()/free_extent_buffer_stale() if the extent
      buffer has the stale flag set.
      A more clean alternative would be to always increment the extent buffer's
      reference count while holding its refs_lock spinlock but find_extent_buffer
      is a performance critical area and that would cause lock contention whenever
      multiple tasks search for the same extent buffer concurrently.
      
      A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
      btrfs patches backported from newer kernels) was hitting this often:
      
      [1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
      (...)
      [1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
      [1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
      [1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
      [1212302.540792] RIP: 0010:[<ffffffffa0507081>]  [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
      (...)
      [1212302.630008] Call Trace:
      [1212302.630008]  [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
      [1212302.630008]  [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
      [1212302.630008]  [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
      [1212302.630008]  [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
      [1212302.630008]  [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
      [1212302.630008]  [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
      [1212302.630008]  [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
      [1212302.630008]  [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
      [1212302.630008]  [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
      [1212302.630008]  [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
      [1212302.630008]  [<ffffffff811b9e28>] evict+0xa8/0x190
      [1212302.630008]  [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0
      
      I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
      integration branch from about a month ago, running the following stress
      test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):
      
        while true; do
           mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
           mount /dev/sdd /mnt
           snapshot_cmd="btrfs subvolume snapshot -r /mnt"
           snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
           fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
           umount /mnt
        done
      
      Which usually triggers the BUG_ON within less than 24 hours:
      
      [49558.618097] ------------[ cut here ]------------
      [49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
      (...)
      [49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G        W      3.19.0-btrfs-next-7+ #3
      [49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
      [49558.620031] RIP: 0010:[<ffffffffa0476b1a>]  [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
      (...)
      [49558.620031] Call Trace:
      [49558.620031]  [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
      [49558.620031]  [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
      [49558.620031]  [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
      [49558.620031]  [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
      [49558.620031]  [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
      [49558.620031]  [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
      [49558.620031]  [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
      [49558.728054]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
      [49558.728054]  [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
      [49558.728054]  [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
      [49558.728054]  [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
      [49558.728054]  [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
      [49558.728054]  [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
      [49558.728054]  [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
      [49558.728054]  [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
      [49558.728054]  [<ffffffff81155939>] iterate_supers+0x76/0xc4
      [49558.728054]  [<ffffffff811798e8>] sys_sync+0x55/0x83
      [49558.728054]  [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      062c19e9
    • F
      Btrfs: fix race between block group creation and their cache writeout · ff1f8250
      Filipe Manana 提交于
      So creating a block group has 2 distinct phases:
      
      Phase 1 - creates the btrfs_block_group_cache item and adds it to the
      rbtree fs_info->block_group_cache_tree and to the corresponding list
      space_info->block_groups[];
      
      Phase 2 - adds the block group item to the extent tree and corresponding
      items to the chunk tree.
      
      The first phase adds the block_group_cache_item to a list of pending block
      groups in the transaction handle, and phase 2 happens when
      btrfs_end_transaction() is called against the transaction handle.
      
      It happens that once phase 1 completes, other concurrent tasks that use
      their own transaction handle, but points to the same running transaction
      (struct btrfs_trans_handle->transaction), can use this block group for
      space allocations and therefore mark it dirty. Dirty block groups are
      tracked in a list belonging to the currently running transaction (struct
      btrfs_transaction) and not in the transaction handle (btrfs_trans_handle).
      
      This is a problem because once a task calls btrfs_commit_transaction(),
      it calls btrfs_start_dirty_block_groups() which will see all dirty block
      groups and attempt to start their writeout, including those that are
      still attached to the transaction handle of some concurrent task that
      hasn't called btrfs_end_transaction() yet - which means those block
      groups haven't gone through phase 2 yet and therefore when
      write_one_cache_group() is called, it won't find the block group items
      in the extent tree and abort the current transaction with -ENOENT,
      turning the fs into readonly mode and require a remount.
      
      Fix this by ignoring -ENOENT when looking for block group items in the
      extent tree when we attempt to start the writeout of the block group
      caches outside the critical section of the transaction commit. We will
      try again later during the critical section and if there we still don't
      find the block group item in the extent tree, we then abort the current
      transaction.
      
      This issue happened twice, once while running fstests btrfs/067 and once
      for btrfs/078, which produced the following trace:
      
      [ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [ 3278.707329] BTRFS: Transaction aborted (error -2)
      (...)
      [ 3278.731555] Call Trace:
      [ 3278.732396]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [ 3278.733860]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [ 3278.735312]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [ 3278.736874]  [<ffffffffa03ada6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [ 3278.738302]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [ 3278.739520]  [<ffffffffa03ada6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [ 3278.741222]  [<ffffffffa03b9e56>] write_one_cache_group+0xae/0xbf [btrfs]
      [ 3278.742797]  [<ffffffffa03c487b>] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs]
      [ 3278.744492]  [<ffffffffa03d309c>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [ 3278.746084]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [ 3278.747249]  [<ffffffffa03e5660>] btrfs_sync_file+0x313/0x387 [btrfs]
      [ 3278.748744]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
      [ 3278.749958]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [ 3278.751218]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
      [ 3278.754197]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
      [ 3278.755192]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
      [ 3278.756236]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [ 3278.757366] ---[ end trace 9a4d4df4969709aa ]---
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ff1f8250
    • F
      Btrfs: fix panic when starting bg cache writeout after IO error · 28aeeac1
      Filipe Manana 提交于
      When waiting for the writeback of block group cache we returned
      immediately if there was an error during writeback without waiting
      for the ordered extent to complete. This left a short time window
      where if some other task attempts to start the writeout for the same
      block group cache it can attempt to add a new ordered extent, starting
      at the same offset (0) before the previous one is removed from the
      ordered tree, causing an ordered tree panic (calls BUG()).
      
      This normally doesn't happen in other write paths, such as buffered
      writes or direct IO writes for regular files, since before marking
      page ranges dirty we lock the ranges and wait for any ordered extents
      within the range to complete first.
      
      Fix this by making btrfs_wait_ordered_range() not return immediately
      if it gets an error from the writeback, waiting for all ordered extents
      to complete first.
      
      This issue happened often when running the fstest btrfs/088 and it's
      easy to trigger it by running in a loop until the panic happens:
      
        for ((i = 1; i <= 10000; i++)) do ./check btrfs/088 ; done
      
      [17156.862573] BTRFS critical (device sdc): panic in ordered_data_tree_panic:70: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
      [17156.864052] ------------[ cut here ]------------
      [17156.864052] kernel BUG at fs/btrfs/ordered-data.c:70!
      (...)
      [17156.864052] Call Trace:
      [17156.864052]  [<ffffffffa03876e3>] btrfs_add_ordered_extent+0x12/0x14 [btrfs]
      [17156.864052]  [<ffffffffa03787e2>] run_delalloc_nocow+0x5bf/0x747 [btrfs]
      [17156.864052]  [<ffffffffa03789ff>] run_delalloc_range+0x95/0x353 [btrfs]
      [17156.864052]  [<ffffffffa038b7fe>] writepage_delalloc.isra.16+0xb9/0x13f [btrfs]
      [17156.864052]  [<ffffffffa038d75b>] __extent_writepage+0x129/0x1f7 [btrfs]
      [17156.864052]  [<ffffffffa038da5a>] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs]
      [17156.864052]  [<ffffffff810ad2af>] ? __module_text_address+0x12/0x59
      [17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [17156.864052]  [<ffffffffa038df76>] extent_writepages+0x4b/0x5c [btrfs]
      [17156.864052]  [<ffffffff81144431>] ? kmem_cache_free+0x9b/0xce
      [17156.864052]  [<ffffffffa0376a46>] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs]
      [17156.864052]  [<ffffffffa0389cd6>] ? free_extent_state+0x8c/0xc1 [btrfs]
      [17156.864052]  [<ffffffffa0374871>] btrfs_writepages+0x28/0x2a [btrfs]
      [17156.864052]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
      [17156.864052]  [<ffffffff81102f36>] __filemap_fdatawrite_range+0x5a/0x61
      [17156.864052]  [<ffffffff81102f6e>] filemap_fdatawrite_range+0x13/0x15
      [17156.864052]  [<ffffffffa0383ef7>] btrfs_fdatawrite_range+0x21/0x48 [btrfs]
      [17156.864052]  [<ffffffffa03ab89e>] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs]
      [17156.864052]  [<ffffffffa03ac1ab>] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
      [17156.864052]  [<ffffffffa03ac1fd>] btrfs_write_out_cache+0x93/0xdc [btrfs]
      [17156.864052]  [<ffffffffa0363847>] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs]
      [17156.864052]  [<ffffffffa03638e6>] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs]
      [17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [17156.864052]  [<ffffffffa037209e>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [17156.864052]  [<ffffffffa034c748>] btrfs_sync_fs+0xe1/0x12d [btrfs]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      28aeeac1
    • F
      Btrfs: fix crash after inode cache writeback failure · e43699d4
      Filipe Manana 提交于
      If the writeback of an inode cache failed we were unnecessarilly
      attempting to release again the delalloc metadata that we previously
      reserved. However attempting to do this a second time triggers an
      assertion at drop_outstanding_extent() because we have no more
      outstanding extents for our inode cache's inode. If we were able
      to start writeback of the cache the reserved metadata space is
      released at btrfs_finished_ordered_io(), even if an error happens
      during writeback.
      
      So make sure we don't repeat the metadata space release if writeback
      started for our inode cache.
      
      This issue was trivial to reproduce by running the fstest btrfs/088
      with "-o inode_cache", which triggered the assertion leading to a
      BUG() call and requiring a reboot in order to run the remaining
      fstests. Trace produced by btrfs/088:
      
      [255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276
      [255289.388094] ------------[ cut here ]------------
      [255289.389184] kernel BUG at fs/btrfs/ctree.h:4057!
      [255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      (...)
      [255289.392068] Call Trace:
      [255289.392068]  [<ffffffffa035e774>] drop_outstanding_extent+0x3d/0x6d [btrfs]
      [255289.392068]  [<ffffffffa0364988>] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs]
      [255289.392068]  [<ffffffffa03b4174>] btrfs_write_out_ino_cache+0x95/0xad [btrfs]
      [255289.392068]  [<ffffffffa036f5c4>] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
      [255289.392068]  [<ffffffffa03e2d83>] commit_fs_roots.isra.12+0xaa/0x137 [btrfs]
      [255289.392068]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [255289.392068]  [<ffffffffa037841f>] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs]
      [255289.392068]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
      [255289.392068]  [<ffffffffa037842e>] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs]
      (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e43699d4
  5. 07 5月, 2015 1 次提交
  6. 30 4月, 2015 1 次提交
  7. 26 4月, 2015 9 次提交
    • Y
      Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b
      Yang Dongsheng 提交于
      We need to fill inode when we found a node for it in delayed_nodes_tree.
      But we did not fill the ->last_trans currently, it will cause the test
      of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
      
      Problem:
      	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(2). pwrite(test_fd, buf, 4096, 0)
      	(3). close(test_fd)
      	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
      	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(6). fsync(test_fd);
      				<-------- we did not get the correct log entry for the file
      Reason:
      	When we re-open this file in (5), we would find a node
      in delayed_nodes_tree and fill the inode we are lookup with the
      information. But the ->last_trans is not filled, then the fsync()
      will check the ->last_trans and found it's 0 then say this inode
      is already in our tree which is commited, not recording the extents
      for it.
      
      Fix:
      	This patch fill the ->last_trans properly and set the
      runtime_flags if needed in this situation. Then we can get the
      log entries we expected after (6) and generic/311 passed.
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e17d30b
    • O
      btrfs: unlock i_mutex after attempting to delete subvolume during send · 909e26dc
      Omar Sandoval 提交于
      Whenever the check for a send in progress introduced in commit
      521e0546 (btrfs: protect snapshots from deleting during send) is
      hit, we return without unlocking inode->i_mutex. This is easy to see
      with lockdep enabled:
      
      [  +0.000059] ================================================
      [  +0.000028] [ BUG: lock held when returning to user space! ]
      [  +0.000029] 4.0.0-rc5-00096-g3c435c1e #93 Not tainted
      [  +0.000026] ------------------------------------------------
      [  +0.000029] btrfs/211 is leaving the kernel with locks still held!
      [  +0.000029] 1 lock held by btrfs/211:
      [  +0.000023]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0
      
      Make sure we unlock it in the error path.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      909e26dc
    • O
      btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache · b8605454
      Omar Sandoval 提交于
      If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
      When we try to access them later, things will blow up in various ways.
      
      Also fix the comment about the return value, which is an errno on error,
      not -1, and update the cases where it was not.
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b8605454
    • O
      btrfs: fix race on ENOMEM in alloc_extent_buffer · 5ca64f45
      Omar Sandoval 提交于
      Consider the following interleaving of overlapping calls to
      alloc_extent_buffer:
      
      Call 1:
      
      - Successfully allocates a few pages with find_or_create_page
      - find_or_create_page fails, goto free_eb
      - Unlocks the allocated pages
      
      Call 2:
      - Calls find_or_create_page and gets a page in call 1's extent_buffer
      - Finds that the page is already associated with an extent_buffer
      - Grabs a reference to the half-written extent_buffer and calls
        mark_extent_buffer_accessed on it
      
      mark_extent_buffer_accessed will then try to call mark_page_accessed on
      a null page and panic.
      
      The fix is to decrement the reference count on the half-written
      extent_buffer before unlocking the pages so call 2 won't use it. We
      should also set exists = NULL in the case that we don't use exists to
      avoid accidentally returning a freed extent_buffer in an error case.
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5ca64f45
    • O
      btrfs: handle ENOMEM in btrfs_alloc_tree_block · 67b7859e
      Omar Sandoval 提交于
      This is one of the first places to give out when memory is tight. Handle
      it properly rather than with a BUG_ON.
      
      Also fix the comment about the return value, which is an ERR_PTR, not
      NULL, on error.
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      67b7859e
    • F
      Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole · 1b984508
      Forrest Liu 提交于
      If device tree has hole, find_free_dev_extent() cannot find available
      address properly.
      
      The problem can be reproduce by following script.
      
          mntpath=/btrfs
          loopdev=/dev/loop0
          filepath=/home/forrest/image
      
          umount $mntpath
          losetup -d $loopdev
          truncate --size 100g $filepath
          losetup $loopdev $filepath
          mkfs.btrfs -f $loopdev
          mount $loopdev $mntpath
      
          # make device tree with one big hole
          for i in `seq 1 1 100`; do
              fallocate -l 1g $mntpath/$i
          done
          sync
          for i in `seq 1 1 95`; do
              rm $mntpath/$i
          done
          sync
      
          # wait cleaner thread remove unused block group
          sleep 300
      
          fallocate -l 1g $mntpath/aaa
      
          # failed to allocate new chunk
          fallocate -l 1g $mntpath/bbb
      
      Above script will make device tree with one big hole, and can only allocate
      just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb
      
          item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
              dev extent chunk_tree 3
              chunk objectid 256 chunk offset 106292051968 length 1073741824
          item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
              dev extent chunk_tree 3
              chunk objectid 256 chunk offset 103108575232 length 1073741824
      Signed-off-by: NForrest Liu <forrestl@synology.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1b984508
    • C
      Btrfs: don't check for delalloc_bytes in cache_save_setup · e4c88f00
      Chris Mason 提交于
      Now that we're doing free space cache writeback outside the critical
      section in the commit, there is a bigger window for delalloc_bytes to
      be added after a cache has been written.  find_free_extent may do this
      without putting the block group back into the dirty list, and also
      without a transaction running.
      
      Checking for delalloc_bytes in cache_save_setup means we might leave the
      cache marked as written without invalidating it.  Consistency checks
      during mount will toss the cache, but it's better to get rid of the
      check in cache_save_setup and let it get invalidated by the checks
      already done during cache write out.
      Signed-off-by: NChris Mason <clm@fb.com>
      e4c88f00
    • F
      Btrfs: fix deadlock when starting writeback of bg caches · 24b89d08
      Filipe Manana 提交于
      While starting the writes of the dirty block group caches, if we don't
      find a block group item in the extent tree we were leaving without
      releasing our path, running delayed references and then looping again to
      process any new dirty block groups. However this second iteration of the
      loop could cause a deadlock because it tries to lock some other extent
      tree node/leaf which another task already locked and it's blocked because
      it's waiting for a lock on some node/leaf that is in our path that was not
      released before.
      We could also deadlock when running the delayed references - as we could
      end up trying to lock the same nodes/leafs that we have in our local path
      (with a different lock type).
      
      Got into such case when running xfstests:
      
      [20892.242791] ------------[ cut here ]------------
      [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [20892.245874] BTRFS: Transaction aborted (error -2)
      (...)
      [20892.269378] Call Trace:
      [20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
      [20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      (...)
      [20892.291316] ---[ end trace 597f77e664245373 ]---
      [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
      [20892.297390] BTRFS info (device sdg): forced readonly
      [20892.298222] ------------[ cut here ]------------
      [20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]()
      (...)
      [20892.326253] Call Trace:
      [20892.326904]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.329503]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.330815]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.332556]  [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs]
      [20892.333955]  [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c
      [20892.335562]  [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs]
      [20892.336849]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
      [20892.338222]  [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs]
      [20892.339823]  [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs]
      [20892.341275]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
      [20892.342810]  [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs]
      [20892.344184]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.347162]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      (...)
      [20892.361015] ---[ end trace 597f77e664245374 ]---
      [21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds.
      [21120.689881]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.703696] Call Trace:
      [21120.704310]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.705490]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
      [21120.706757]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.708156]  [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs]
      [21120.709892]  [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs]
      [21120.711605]  [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs]
      [21120.723440]  [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs]
      [21120.724943]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
      [21120.726008]  [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa
      [21120.727230]  [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b
      [21120.728526]  [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b
      [21120.729701]  [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b
      (...)
      [21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds.
      [21120.749459]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.768457] Call Trace:
      [21120.769039]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.770107]  [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs]
      [21120.771558]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.773659]  [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs]
      [21120.776257]  [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs]
      [21120.777755]  [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs]
      [21120.779459]  [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs]
      [21120.781153]  [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs]
      [21120.783918]  [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [21120.785436]  [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f
      [21120.786434]  [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      (...)
      [21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds.
      [21120.890526]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.899960] Call Trace:
      [21120.900743]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.903004]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
      [21120.904383]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.905608]  [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs]
      [21120.906812]  [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c
      [21120.907874]  [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
      [21120.909551]  [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
      [21120.910914]  [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs]
      [21120.912181]  [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs]
      [21120.913784]  [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs]
      [21120.915374]  [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [21120.916735]  [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [21120.917996]  [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [21120.919478]  [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs]
      [21120.921226]  [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs]
      [21120.923121]  [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs]
      [21120.924449]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [21120.926602]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
      [21120.927769]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [21120.929324]  [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
      [21120.930723]  [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs]
      [21120.931897]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
      [21120.934446]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
      [21120.935528]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
      (...)
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      24b89d08
    • F
      Btrfs: fix race between start dirty bg cache writeout and bg deletion · b58d1a9e
      Filipe Manana 提交于
      While running xfstests I ran into the following:
      
      [20892.242791] ------------[ cut here ]------------
      [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [20892.245874] BTRFS: Transaction aborted (error -2)
      [20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$
      [20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [20892.264738]  0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2
      [20892.266244]  ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48
      [20892.267761]  ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0
      [20892.269378] Call Trace:
      [20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
      [20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [20892.281781]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [20892.282873]  [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs]
      [20892.284111]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
      [20892.285203]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
      [20892.286290]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [20892.287469]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
      [20892.288412]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
      [20892.289348]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
      [20892.290255]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [20892.291316] ---[ end trace 597f77e664245373 ]---
      [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
      [20892.297390] BTRFS info (device sdg): forced readonly
      
      This happens because in btrfs_start_dirty_block_groups() we splice the
      transaction's list of dirty block groups into a local list and then we
      keep extracting the first element of the list without holding the
      cache_write_mutex mutex. This means that before we acquire that mutex
      the first block group on the list might be removed by a conurrent task
      running btrfs_remove_block_group(). So make sure we extract the first
      element (and test the list emptyness) while holding that mutex.
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b58d1a9e
  8. 25 4月, 2015 2 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
    • C
      Btrfs: prevent list corruption during free space cache processing · a3bdccc4
      Chris Mason 提交于
      __btrfs_write_out_cache is holding the ctl->tree_lock while it prepares
      a list of bitmaps to record in the free space cache.  It was dropping
      the lock while it worked on other components, which made a window for
      free_bitmap() to free the bitmap struct without removing it from the
      list.
      
      This changes things to hold the lock the whole time, and also makes sure
      we hold the lock during enospc cleanup.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a3bdccc4
  9. 24 4月, 2015 1 次提交
    • C
      Btrfs: fix inode cache writeout · 85db36cf
      Chris Mason 提交于
      The code to fix stalls during free spache cache IO wasn't using
      the correct root when waiting on the IO for inode caches.  This
      is only a problem when the inode cache is enabled with
      
      mount -o inode_cache
      
      This fixes the inode cache writeout to preserve any error values and
      makes sure not to override the root when inode cache writeout is done.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      85db36cf
  10. 16 4月, 2015 1 次提交