1. 03 6月, 2015 17 次提交
    • F
      Btrfs: wake up extent state waiters on unlock through clear_extent_bits · 0f31871f
      Filipe Manana 提交于
      When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits()
      through free_io_failure(), we weren't waking up any tasks waiting for the
      extent's state EXTENT_LOCKED bit, leading to an hang.
      
      So make sure clear_extent_bits() ends up waking up any waiters if the
      bit EXTENT_LOCKED is supplied by its callers.
      
      Zygo Blaxell was experiencing such hangs at inode eviction time after
      file unlinks. Thanks to him for a set of scripts to reproduce the issue.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f31871f
    • F
      Btrfs: fix chunk allocation regression leading to transaction abort · c152b63e
      Filipe Manana 提交于
      With commit 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction
      in case device tree has hole") introduced in the kernel 4.1 merge window,
      we end up using part of a device hole for which there are already pending
      chunks or pinned chunks. Before that commit we didn't use the hole and
      would just move on to the next hole in the device.
      
      However when we adjust the start offset for the chunk allocation and we
      have pinned chunks, we set it blindly to the end offset of the pinned
      chunk we are currently processing, which is dangerous because we can
      have a pending chunk that has a start offset that matches the end offset
      of our pinned chunk - leading us to a case where we end up getting two
      pending chunks that start at the same physical device offset, which makes
      us later abort the current transaction with -EEXIST when finishing the
      chunk allocation at btrfs_create_pending_block_groups():
      
      [194737.659017] ------------[ cut here ]------------
      [194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [194737.662209] BTRFS: Transaction aborted (error -17)
      [194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
      [194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [194737.682999]  0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2
      [194737.684540]  ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78
      [194737.686017]  ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40
      [194737.687509] Call Trace:
      [194737.688068]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [194737.689027]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [194737.690095]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [194737.691198]  [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [194737.693789]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [194737.695065]  [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [194737.696806]  [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [194737.698683]  [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [194737.700329]  [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [194737.701924]  [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [194737.703675]  [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs]
      [194737.705417]  [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [194737.707058]  [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
      [194737.708560]  [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs]
      [194737.710673]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
      [194737.712076]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
      [194737.713293]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
      [194737.714443]  [<ffffffff81154424>] SyS_pwrite64+0x64/0x82
      [194737.715646]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [194737.717175] ---[ end trace f2d5dc04e56d7e48 ]---
      [194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists
      
      The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by
      btrfs_create_pending_block_groups(), when it attempts to insert a
      duplicated device extent item via btrfs_alloc_dev_extent().
      
      This issue was reproducible with fstests generic/038 running in a loop for
      several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard".
      Applying Jeff's recent patch titled "btrfs: add missing discards when
      unpinning extents with -o discard" makes the issue much easier to reproduce
      (usually within 4 to 5 hours), since it pins chunks for longer periods of
      time when an unused block group is deleted by the cleaner kthread.
      
      Fix this by making sure that we never adjust the start offset to a lower
      value than it currently has.
      
      Fixes: 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole"
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c152b63e
    • S
      btrfs: use after free when closing devices · 2037a093
      Sasha Levin 提交于
      __btrfs_close_devices() would call_rcu to free the device, which is racy with
      list_for_each_entry() accessing the memory to retrieve the next device on the
      list.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      2037a093
    • D
      btrfs: make root id query unprivileged · 01b810b8
      David Sterba 提交于
      The INO_LOOKUP ioctl can lookup path for a given inode number and is
      thus restricted. As a sideefect it can find the root id of the
      containing subvolume and we're using this int the 'btrfs inspect rootid'
      command.
      
      The restriction is unnecessary in case we set the ioctl args
       args::treeid    = 0
       args::objectid  = 256 (BTRFS_FIRST_FREE_OBJECTID)
      
      Then the path will be empty and the treeid is filled with the root id of
      the inode on which the ioctl is called. This behaviour is unchanged,
      after the root restriction is removed.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      01b810b8
    • F
      Btrfs: fix block group ->space_info null pointer dereference · 2e6e5183
      Filipe Manana 提交于
      When we create a block group we add it to the rbtree of block groups
      before setting its ->space_info field (while it's NULL). This is
      problematic since other tasks can access the block group from the
      rbtree and attempt to use its ->space_info before it is set by
      btrfs_make_block_group().
      
      This can happen for example when a concurrent fitrim ioctl operation
      is ongoing, which produces a trace like the following when
      CONFIG_DEBUG_PAGEALLOC is set.
      
      [11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0
      [11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou
      [11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000
      [11509.608179] RIP: 0010:[<ffffffff8107d675>]  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179] RSP: 0018:ffff8801b3edf9e8  EFLAGS: 00010002
      [11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000
      [11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
      [11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000
      [11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000
      [11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0
      [11509.608179] FS:  00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000
      [11509.608179] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0
      [11509.608179] Stack:
      [11509.608179]  0000000000000000 0000000000000000 0000000000000004 0000000000000000
      [11509.608179]  ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000
      [11509.608179]  0000000000000001 0000000000000000 ffff880100000000 00000000000006c4
      [11509.608179] Call Trace:
      [11509.608179]  [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02
      [11509.608179]  [<ffffffff8107e806>] lock_acquire+0xa5/0x116
      [11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44
      [11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs]
      [11509.608179]  [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs]
      [11509.608179]  [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs]
      [11509.608179]  [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs]
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e
      [11509.608179]  [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479
      [11509.608179]  [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e
      [11509.608179]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [11509.608179]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [11509.608179]  [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f
      [11509.608179]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00
      [11509.608179] RIP  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179]  RSP <ffff8801b3edf9e8>
      [11509.608179] CR2: 0000000000000018
      [11509.608179] ---[ end trace 570a5c6769f0e49a ]---
      
      Which corresponds to the following access in fs/btrfs/free-space-cache.c:
      
        static int do_trimming(struct btrfs_block_group_cache *block_group,
                               u64 *total_trimmed, u64 start, u64 bytes,
                               u64 reserved_start, u64 reserved_bytes,
                               struct btrfs_trim_range *trim_entry)
        {
             struct btrfs_space_info *space_info = block_group->space_info;
        (...)
             spin_lock(&space_info->lock);
             ^^^^^ - block_group->space_info is NULL...
      
      Fix this by ensuring the block group's ->space_info is set before adding
      the block group to the rbtree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2e6e5183
    • A
      Btrfs: check error before reporting missing device and add uuid · 33b97e43
      Anand Jain 提交于
      Report missing device when add is successful,
      otherwise it would exit as ENOMEM. And add uuid
      to the report.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      33b97e43
    • Q
      btrfs: Fix superblock csum type check. · 1f6e4b3f
      Qu Wenruo 提交于
      Old csum type check is wrong and can't catch csum_type 1(not supported).
      
      Fix it to avoid hostile 0 division.
      Reported-by: NLukas Lueg <lukas.lueg@gmail.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      1f6e4b3f
    • F
      Btrfs: incremental send, fix clone operations for compressed extents · 619d8c4e
      Filipe Manana 提交于
      Marc reported a problem where the receiving end of an incremental send
      was performing clone operations that failed with -EINVAL. This happened
      because, unlike for uncompressed extents, we were not checking if the
      source clone offset and length, after summing the data offset, falls
      within the source file's boundaries.
      
      So make sure we do such checks when attempting to issue clone operations
      for compressed extents.
      
      Problem reproducible with the following steps:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount -o compress /dev/sdc /mnt2
      
        # Create the file with a single extent of 128K. This creates a metadata file
        # extent item with a data start offset of 0 and a logical length of 128K.
        $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo
      
        # Now rewrite the range 64K to 112K of our file. This will make the inode's
        # metadata continue to point to the 128K extent we created before, but now
        # with an extent item that points to the extent with a data start offset of
        # 112K and a logical length of 16K.
        # That metadata file extent item is associated with the logical file offset
        # at 176K and covers the logical file range 176K to 192K.
        $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo
      
        # Now rewrite the range 180K to 12K. This will make the inode's metadata
        # continue to point the the 128K extent we created earlier, with a single
        # extent item that points to it with a start offset of 112K and a logical
        # length of 4K.
        # That metadata file extent item is associated with the logical file offset
        # at 176K and covers the logical file range 176K to 180K.
        $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
        $ touch /mnt/bar
        # Calls the btrfs clone ioctl.
        $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \
          -l $((4 * 1024)) /mnt/foo /mnt/bar
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
        $ btrfs send /mnt/snap1 | btrfs receive /mnt2
        At subvol /mnt/snap1
        At subvol snap1
      
        $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
        At subvol /mnt/snap2
        At snapshot snap2
        ERROR: failed to clone extents to bar
        Invalid argument
      
      A test case for fstests follows soon.
      Reported-by: NMarc MERLIN <marc@merlins.org>
      Tested-by: NMarc MERLIN <marc@merlins.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Tested-by: NJan Alexander Steffens (heftig) <jan.steffens@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      619d8c4e
    • C
      btrfs: qgroup: Fix possible leak in btrfs_add_qgroup_relation() · ab3680dd
      Christian Engelmayer 提交于
      Commit 9c8b35b1 ("btrfs: quota: Automatically update related qgroups or
      mark INCONSISTENT flags when assigning/deleting a qgroup relations.")
      introduced the allocation of a temporary ulist in function
      btrfs_add_qgroup_relation() and added the corresponding cleanup to the out
      path. However, the allocation was introduced before the src/dst level check
      that directly returns. Fix the possible leakage of the ulist by moving the
      allocation after the input validation. Detected by Coverity CID 1295988.
      Signed-off-by: NChristian Engelmayer <cengelma@gmx.at>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      ab3680dd
    • F
      Btrfs: fix mutex unlock without prior lock on space cache truncation · 35c76642
      Filipe Manana 提交于
      If the call to btrfs_truncate_inode_items() failed and we don't have a block
      group, we were unlocking the cache_write_mutex without having locked it (we
      do it only if we have a block group).
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      35c76642
    • A
      Btrfs: log when missing device is created · 816fcebe
      Anand Jain 提交于
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      816fcebe
    • D
      btrfs: fix warnings after changes in btrfs_abort_transaction · 6d13f549
      David Sterba 提交于
      fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’:
      fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
         btrfs_abort_transaction(trans, tree_root,
         ^
        CC [M]  fs/btrfs/ioctl.o
      fs/btrfs/ioctl.c: In function ‘create_subvol’:
      fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
         btrfs_abort_transaction(trans, root, PTR_ERR(new_root));
      
      PTR_ERR returns long, but we're really using 'int' for the error codes
      everywhere so just set and use the local variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      6d13f549
    • D
      btrfs: add 'cold' compiler annotations to all error handling functions · c0d19e2b
      David Sterba 提交于
      The annotated functios will be placed into .text.unlikely section. The
      annotation also hints compiler to move the code out of the hot paths,
      and may implicitly mark if-statement leading to that block as unlikely.
      
      This is a heuristic, the impact on the generated code is not
      significant.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c0d19e2b
    • D
      btrfs: report exact callsite where transaction abort occurs · 1a9a8a71
      David Sterba 提交于
      WARN is called from a single location and all bugreports say that's in
      super.c __btrfs_abort_transaction. This is slightly confusing as we'd
      rather want to know the exact callsite. Whereas this information is
      printed in the syslog below the stacktrace, this requires further look
      and we usually see only the headline from WARNING.
      
      Moving the WARN into the macro has to inline some code and increases
      code by a few kilobytes:
      
        text    data     bss     dec     hex filename
      835481   20305   14120  869906   d4612 btrfs.ko.before
      842883   20305   14120  877308   d62fc btrfs.ko.after
      
      The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config.
      The increase is not small and could lead to worse icache use. The code
      is on error/exit paths that can be recognized by compiler as cold and
      moved out of the way so the impact is speculated to be low, if
      measurable at all.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a9a8a71
    • D
      btrfs: let tree defrag work in SSD mode · 13028901
      David Sterba 提交于
      Long time ago (2008) the defrag was automatic for new b-tree writes but
      has been disabled after performance problems. There was a leftover in
      tree-defrag.c that effectively stops any defragmentation on b-trees.
      This is a bit unexpected and IMHO undesired. The SSD mode is an
      optimization and defrag is supposed to work if the users asks for it.
      
      Related commits:
      
      6702ed49
      Btrfs: Add run time btree defrag, and an ioctl to force btree defrag
      
      e18e4809
      Btrfs: Add mount -o ssd, which includes optimizations for seek free
      storage
      
      b3236e68
      Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there
      
      9afbb0b7
      Btrfs: Disable tree defrag in SSD mode
      
      The last three commits switch the defrag+ssd off/on/off and the last one
      
      3f157a2f
      Btrfs: Online btree defragmentation fixes
      
      misses the bits from tree-defrag.c to revert to the behaviour introduced
      in e18e4809.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      13028901
    • F
      Btrfs: check pending chunks when shrinking fs to avoid corruption · 53e489bc
      Filipe Manana 提交于
      When we shrink the usable size of a device (its total_bytes), we go over
      all the device extent items in the device tree and attempt to relocate
      the chunk of any device extent that goes beyond the new usable size for
      the device. We do that after setting the new usable size (total_bytes) in
      the device object, so that all new allocations (and reallocations) don't
      use areas of the device that go beyond the new (shorter) size. However we
      were not considering that before setting the new size in the device,
      pending chunks might have been created that use device extents that go
      beyond the new size, and those device extents are not yet in the device
      tree after we search the device tree - they are still attached to the
      list of new block group for some ongoing transaction handle, and they are
      only added to the device tree when the transaction handle is ended (via
      btrfs_create_pending_block_groups()).
      
      So check for pending chunks with device extents that go beyond the new
      size and if any exists, commit the current transaction and repeat the
      search in the device tree.
      
      Not doing this it would mean we would return success to user space while
      still having extents that go beyond the new size, and later user space
      could override those locations on the device while the fs still references
      them, causing all sorts of corruption and unexpected events.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      53e489bc
    • O
      Btrfs: don't invalidate root dentry when subvolume deletion fails · 64ad6c48
      Omar Sandoval 提交于
      Since commit bafc9b75 ("vfs: More precise tests in d_invalidate"),
      mounted subvolumes can be deleted because d_invalidate() won't fail.
      However, we run into problems when we attempt to delete the default
      subvolume while it is mounted as the root filesystem:
      
      	# btrfs subvol list /
      	ID 257 gen 306 top level 5 path rootvol
      	ID 267 gen 334 top level 5 path snap1
      	# btrfs subvol get-default /
      	ID 267 gen 334 top level 5 path snap1
      	# btrfs inspect-internal rootid /
      	267
      	# mount -o subvol=/ /dev/vda1 /mnt
      	# btrfs subvol del /mnt/snap1
      	Delete subvolume (no-commit): '/mnt/snap1'
      	ERROR: cannot delete '/mnt/snap1' - Operation not permitted
      	# findmnt /
      	findmnt: can't read /proc/mounts: No such file or directory
      	# ls /proc
      	#
      
      Markus reported that this same scenario simply led to a kernel oops.
      
      This happens because in btrfs_ioctl_snap_destroy(), we call
      d_invalidate() before we check may_destroy_subvol(), which means that we
      detach the submounts and drop the dentry before erroring out. Instead,
      we should only invalidate the dentry once the deletion has succeeded.
      Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
      will prune the dcache for the deleted subvolume.
      
      Cc: <stable@vger.kernel.org>
      Fixes: bafc9b75 ("vfs: More precise tests in d_invalidate")
      Reported-by: NMarkus Schauler <mschauler@gmail.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      64ad6c48
  2. 29 5月, 2015 12 次提交
    • A
      d_walk() might skip too much · 2159184e
      Al Viro 提交于
      when we find that a child has died while we'd been trying to ascend,
      we should go into the first live sibling itself, rather than its sibling.
      
      Off-by-one in question had been introduced in "deal with deadlock in
      d_walk()" and the fix needs to be backported to all branches this one
      has been backported to.
      
      Cc: stable@vger.kernel.org # 3.2 and later
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2159184e
    • B
      omfs: fix potential integer overflow in allocator · 5a6b2b36
      Bob Copeland 提交于
      Both 'i' and 'bits_per_entry' are signed integers but the result is a
      u64 block number.  Cast i to u64 to avoid truncation on 32-bit targets.
      
      Found by Coverity (CID 200679).
      Signed-off-by: NBob Copeland <me@bobcopeland.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a6b2b36
    • B
      omfs: fix sign confusion for bitmap loop counter · c0345ee5
      Bob Copeland 提交于
      The count variable is used to iterate down to (below) zero from the size
      of the bitmap and handle the one-filling the remainder of the last
      partial bitmap block.  The loop conditional expects count to be signed
      in order to detect when the final block is processed, after which count
      goes negative.
      
      Unfortunately, a recent change made this unsigned along with some other
      related fields.  The result of is this is that during mount,
      omfs_get_imap will overrun the bitmap array and corrupt memory unless
      number of blocks happens to be a multiple of 8 * blocksize.
      
      Fix by changing count back to signed: it is guaranteed to fit in an s32
      without overflow due to an enforced limit on the number of blocks in the
      filesystem.
      Signed-off-by: NBob Copeland <me@bobcopeland.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0345ee5
    • B
      omfs: set error return when d_make_root() fails · 3a281f94
      Bob Copeland 提交于
      A static checker found the following issue in the error path for
      omfs_fill_super:
      
          fs/omfs/inode.c:552 omfs_fill_super()
          warn: missing error code here? 'd_make_root()' failed. 'ret' = '0'
      
      Fix by returning -ENOMEM in this case.
      Signed-off-by: NBob Copeland <me@bobcopeland.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a281f94
    • S
      fs, omfs: add NULL terminator in the end up the token list · dcbff39d
      Sasha Levin 提交于
      match_token() expects a NULL terminator at the end of the token list so
      that it would know where to stop.  Not having one causes it to overrun
      to invalid memory.
      
      In practice, passing a mount option that omfs didn't recognize would
      sometimes panic the system.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NBob Copeland <me@bobcopeland.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcbff39d
    • A
      fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappings · 2b1d3ae9
      Andrew Morton 提交于
      load_elf_binary() returns `retval', not `error'.
      
      Fixes: a87938b2 ("fs/binfmt_elf.c: fix bug in loading of PIE binaries")
      Reported-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Michael Davidson <md@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b1d3ae9
    • B
      xfs: fix broken i_nlink accounting for whiteout tmpfile inode · 22419ac9
      Brian Foster 提交于
      XFS uses the internal tmpfile() infrastructure for the whiteout inode
      used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates
      the inode, drops di_nlink, adds the inode to the agi unlinked list,
      calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode,
      and then finishes the common inode setup (e.g., clear I_NEW and unlock).
      
      The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was
      pulled up out of that function as part of the following commit to
      resolve a deadlock issue:
      
      	330033d6 xfs: fix tmpfile/selinux deadlock and initialize security
      
      As a result, callers of xfs_create_tmpfile() are responsible for either
      calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout
      tmpfile allocation helper does neither. As a result, the vfs ->i_nlink
      becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links
      it back into the source dentry and calls xfs_bumplink().
      
      Update the assert in xfs_rename() to help detect this problem in the
      future and update xfs_rename_alloc_whiteout() to decrement the link
      count as part of the manual tmpfile inode setup.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      22419ac9
    • D
      xfs: xfs_iozero can return positive errno · cddc1162
      Dave Chinner 提交于
      It was missed when we converted everything in XFs to use negative error
      numbers, so fix it now. Bug introduced in 3.17 by commit 2451337d ("xfs: global
      error sign conversion"), and should go back to stable kernels.
      
      Thanks to Brian Foster for noticing it.
      
      cc: <stable@vger.kernel.org> # 3.17, 3.18, 3.19, 4.0
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cddc1162
    • D
      xfs: xfs_attr_inactive leaves inconsistent attr fork state behind · 6dfe5a04
      Dave Chinner 提交于
      xfs_attr_inactive() is supposed to clean up the attribute fork when
      the inode is being freed. While it removes attribute fork extents,
      it completely ignores attributes in local format, which means that
      there can still be active attributes on the inode after
      xfs_attr_inactive() has run.
      
      This leads to problems with concurrent inode writeback - the in-core
      inode attribute fork is removed without locking on the assumption
      that nothing will be attempting to access the attribute fork after a
      call to xfs_attr_inactive() because it isn't supposed to exist on
      disk any more.
      
      To fix this, make xfs_attr_inactive() completely remove all traces
      of the attribute fork from the inode, regardless of it's state.
      Further, also remove the in-core attribute fork structure safely so
      that there is nothing further that needs to be done by callers to
      clean up the attribute fork. This means we can remove the in-core
      and on-disk attribute forks atomically.
      
      Also, on error simply remove the in-memory attribute fork. There's
      nothing that can be done with it once we have failed to remove the
      on-disk attribute fork, so we may as well just blow it away here
      anyway.
      
      cc: <stable@vger.kernel.org> # 3.12 to 4.0
      Reported-by: NWaiman Long <waiman.long@hp.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      6dfe5a04
    • D
      xfs: extent size hints can round up extents past MAXEXTLEN · 6dea405e
      Dave Chinner 提交于
      This results in BMBT corruption, as seen by this test:
      
      # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
      ....
      # mount /dev/vdc /mnt/scratch
      # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo
      
      which results in this failure on a debug kernel:
      
      XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
      ....
      Call Trace:
       [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100
       [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20
       [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120
       [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70
       [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70
       [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0
       [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170
       [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400
       [<ffffffff81ddcc29>] ? down_write+0x29/0x60
       [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310
       [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120
       [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570
       [<ffffffff811cd680>] vfs_fallocate+0x140/0x260
       [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80
       [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17
      
      The tracepoint that indicates the extent that triggered the assert
      failure is:
      
      xfs_iext_insert:   idx 0 offset 0 block 16777224 count 2097152 flag 1
      
      Clearly indicating that the extent length is greater than MAXEXTLEN,
      which is 2097151. A prior trace point shows the allocation was an
      exact size match and that a length greater than MAXEXTLEN was asked
      for:
      
      xfs_alloc_size_done:  agno 1 agbno 8 minlen 2097152 maxlen 2097152
      					    ^^^^^^^        ^^^^^^^
      
      We don't see this problem with extent size hints through the IO path
      because we can't do single IOs large enough to trigger MAXEXTLEN
      allocation. fallocate(), OTOH, is not limited in it's allocation
      sizes and so needs help here.
      
      The issue is that the extent size hint alignment is rounding up the
      extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking
      into account extent size hints when calculating the maximum extent
      length to allocate. xfs_bmapi_reserve_delalloc() is already doing
      this, but direct extent allocation is not.
      
      Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is
      wrong, and it works only because delayed allocation extents are not
      limited in size to MAXEXTLEN in the in-core extent tree. hence this
      calculation does not work for direct allocation, and the delalloc
      code needs fixing. This may, in fact be the underlying bug that
      occassionally causes transaction overruns in delayed allocation
      extent conversion, so now we know it's wrong we should fix it, too.
      Many thanks to Brian Foster for finding this problem during review
      of this patch.
      
      Hence the fix, after much code reading, is to allow
      xfs_bmap_extsize_align() to align partial extents when full
      alignment would extend the alignment past MAXEXTLEN. We can safely
      do this because all callers have higher layer allocation loops that
      already handle short allocations, and so will simply run another
      allocation to cover the remainder of the requested allocation range
      that we ignored during alignment. The advantage of this approach is
      that it also removes the need for callers to do anything other than
      limit their requests to MAXEXTLEN - they don't really need to be
      aware of extent size hints at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6dea405e
    • D
      xfs: inode and free block counters need to use __percpu_counter_compare · 8c1903d3
      Dave Chinner 提交于
      Because the counters use a custom batch size, the comparison
      functions need to be aware of that batch size otherwise the
      comparison does not work correctly. This leads to ASSERT failures
      on generic/027 like this:
      
       XFS: Assertion failed: 0, file: fs/xfs/xfs_mount.c, line: 1099
       ------------[ cut here ]------------
      ....
       Call Trace:
        [<ffffffff81522a39>] xfs_mod_icount+0x99/0xc0
        [<ffffffff815285cb>] xfs_trans_unreserve_and_mod_sb+0x28b/0x5b0
        [<ffffffff8152f941>] xfs_log_commit_cil+0x321/0x580
        [<ffffffff81528e17>] xfs_trans_commit+0xb7/0x260
        [<ffffffff81503d4d>] xfs_bmap_finish+0xcd/0x1b0
        [<ffffffff8151da41>] xfs_inactive_ifree+0x1e1/0x250
        [<ffffffff8151dbe0>] xfs_inactive+0x130/0x200
        [<ffffffff81523a21>] xfs_fs_evict_inode+0x91/0xf0
        [<ffffffff811f3958>] evict+0xb8/0x190
        [<ffffffff811f433b>] iput+0x18b/0x1f0
        [<ffffffff811e8853>] do_unlinkat+0x1f3/0x320
        [<ffffffff811d548a>] ? filp_close+0x5a/0x80
        [<ffffffff811e999b>] SyS_unlinkat+0x1b/0x40
        [<ffffffff81e0892e>] system_call_fastpath+0x12/0x71
      
      This is a regression introduced by commit 501ab323 ("xfs: use generic
      percpu counters for inode counter").
      
      This patch fixes the same problem for both the inode counter and the
      free block counter in the superblocks.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8c1903d3
    • G
      xfs: use percpu_counter_read_positive for mp->m_icount · 74f9ce1c
      George Wang 提交于
      Function percpu_counter_read just return the current counter, which can be
      negative. This will cause the checking of "allocated inode
      counts <= m_maxicount" false positive. Use percpu_counter_read_positive can
      solve this problem, and be consistent with the purpose to introduce percpu
      mechanism to xfs.
      Signed-off-by: NGeorge Wang <xuw2015@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      74f9ce1c
  3. 21 5月, 2015 6 次提交
  4. 20 5月, 2015 3 次提交
    • S
      [cifs] fix null pointer check · 1dc92c45
      Steve French 提交于
      Dan Carpenter pointed out an inconsistent null pointer check
      in smb2_hdr_assemble that was pointed out by static checker.
      Signed-off-by: NSteve French <smfrench@gmail.com>
      Reviewed-by: NSachin Prabhu <sprabhu@redhat.com>
      CC: Dan Carpenter <dan.carpenter@oracle.com>w
      1dc92c45
    • F
      Btrfs: fix racy system chunk allocation when setting block group ro · a9629596
      Filipe Manana 提交于
      If while setting a block group read-only we end up allocating a system
      chunk, through check_system_chunk(), we were not doing it while holding
      the chunk mutex which is a problem if a concurrent chunk allocation is
      happening, through do_chunk_alloc(), as it means both block groups can
      end up using the same logical addresses and physical regions in the
      device(s). So make sure we hold the chunk mutex.
      
      Cc: stable@vger.kernel.org  # 4.0+
      Fixes: 2f081088 ("btrfs: delete chunk allocation attemp when
                            setting block group ro")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a9629596
    • M
      btrfs: clear 'ret' in btrfs_check_shared() loop · 2c2ed5aa
      Mark Fasheh 提交于
      btrfs_check_shared() is leaking a return value of '1' from
      find_parent_nodes(). As a result, callers (in this case, extent_fiemap())
      are told extents are shared when they are not. This in turn broke fiemap on
      btrfs for kernels v3.18 and up.
      
      The fix is simple - we just have to clear 'ret' after we are done processing
      the results of find_parent_nodes().
      
      It wasn't clear to me at first what was happening with return values in
      btrfs_check_shared() and find_parent_nodes() - thanks to Josef for the help
      on irc. I added documentation to both functions to make things more clear
      for the next hacker who might come across them.
      
      If we could queue this up for -stable too that would be great.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c2ed5aa
  5. 19 5月, 2015 1 次提交
    • M
      ovl: mount read-only if workdir can't be created · cc6f67bc
      Miklos Szeredi 提交于
      OpenWRT folks reported that overlayfs fails to mount if upper fs is full,
      because workdir can't be created.  Wordir creation can fail for various
      other reasons too.
      
      There's no reason that the mount itself should fail, overlayfs can work
      fine without a workdir, as long as the overlay isn't modified.
      
      So mount it read-only and don't allow remounting read-write.
      
      Add a couple of WARN_ON()s for the impossible case of workdir being used
      despite being read-only.
      
      Reported-by: Bastian Bittorf <bittorf@bluebottle.com> 
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: <stable@vger.kernel.org> # v3.18+
      cc6f67bc
  6. 15 5月, 2015 1 次提交