1. 22 5月, 2019 3 次提交
    • N
      btrfs: Honour FITRIM range constraints during free space trim · 8b13bb91
      Nikolay Borisov 提交于
      commit c2d1b3aae33605a61cbab445d8ae1c708ccd2698 upstream.
      
      Up until now trimming the freespace was done irrespective of what the
      arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
      will be entirely ignored. Fix it by correctly handling those paramter.
      This requires breaking if the found freespace extent is after the end of
      the passed range as well as completing trim after trimming
      fstrim_range::len bytes.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b13bb91
    • N
      btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails · 87dcf0c6
      Nikolay Borisov 提交于
      commit 537f38f019fa0b762dbb4c0fc95d7fcce9db8e2d upstream.
      
      If a an eb fails to be read for whatever reason - it's corrupted on disk
      and parent transid/key validations fail or IO for eb pages fail then
      this buffer must be removed from the buffer cache. Currently the code
      calls free_extent_buffer if an error occurs. Unfortunately this doesn't
      achieve the desired behavior since btrfs_find_create_tree_block returns
      with eb->refs == 2.
      
      On the other hand free_extent_buffer will only decrement the refs once
      leaving it added to the buffer cache radix tree.  This enables later
      code to look up the buffer from the cache and utilize it potentially
      leading to a crash.
      
      The correct way to free the buffer is call free_extent_buffer_stale.
      This function will correctly call atomic_dec explicitly for the buffer
      and subsequently call release_extent_buffer which will decrement the
      final reference thus correctly remove the invalid buffer from buffer
      cache. This change affects only newly allocated buffers since they have
      eb->refs == 2.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755Reported-by: NJungyeon <jungyeon@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      87dcf0c6
    • Q
      btrfs: Check the first key and level for cached extent buffer · d8925a1f
      Qu Wenruo 提交于
      commit 448de471cd4cab0cedd15770082567a69a784a11 upstream.
      
      [BUG]
      When reading a file from a fuzzed image, kernel can panic like:
      
        BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1
        assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3500!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs]
        Call Trace:
         btrfs_lookup_csum+0x52/0x150 [btrfs]
         __btrfs_lookup_bio_sums+0x209/0x640 [btrfs]
         btrfs_submit_bio_hook+0x103/0x170 [btrfs]
         submit_one_bio+0x59/0x80 [btrfs]
         extent_read_full_page+0x58/0x80 [btrfs]
         generic_file_read_iter+0x2f6/0x9d0
         __vfs_read+0x14d/0x1a0
         vfs_read+0x8d/0x140
         ksys_read+0x52/0xc0
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      The fuzzed image has a corrupted leaf whose first key doesn't match its
      parent:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
        fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
        chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
        	...
                key (EXTENT_CSUM EXTENT_CSUM 79691776) block 29761536 gen 19
      
        leaf 29761536 items 1 free space 1726 generation 19 owner CSUM_TREE
        leaf 29761536 flags 0x1(WRITTEN) backref revision 1
        fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
        chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
                item 0 key (EXTENT_CSUM EXTENT_CSUM 8798638964736) itemoff 1751 itemsize 2244
                        range start 8798638964736 end 8798641262592 length 2297856
      
      When reading the above tree block, we have extent_buffer->refs = 2 in
      the context:
      
      - initial one from __alloc_extent_buffer()
        alloc_extent_buffer()
        |- __alloc_extent_buffer()
           |- atomic_set(&eb->refs, 1)
      
      - one being added to fs_info->buffer_radix
        alloc_extent_buffer()
        |- check_buffer_tree_ref()
           |- atomic_inc(&eb->refs)
      
      So if even we call free_extent_buffer() in read_tree_block or other
      similar situation, we only decrease the refs by 1, it doesn't reach 0
      and won't be freed right now.
      
      The staled eb and its corrupted content will still be kept cached.
      
      Furthermore, we have several extra cases where we either don't do first
      key check or the check is not proper for all callers:
      
      - scrub
        We just don't have first key in this context.
      
      - shared tree block
        One tree block can be shared by several snapshot/subvolume trees.
        In that case, the first key check for one subvolume doesn't apply to
        another.
      
      So for the above reasons, a corrupted extent buffer can sneak into the
      buffer cache.
      
      [FIX]
      Call verify_level_key in read_block_for_search to do another
      verification. For that purpose the function is exported.
      
      Due to above reasons, although we can free corrupted extent buffer from
      cache, we still need the check in read_block_for_search(), for scrub and
      shared tree blocks.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202757
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202759
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202761
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202767
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202769Reported-by: NYoon Jungyeon <jungyeon@gatech.edu>
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8925a1f
  2. 17 4月, 2019 3 次提交
  3. 06 4月, 2019 1 次提交
    • Q
      btrfs: qgroup: Make qgroup async transaction commit more aggressive · dcedd379
      Qu Wenruo 提交于
      [ Upstream commit f5fef4593653dfa2a865c485bb81415de51d5c99 ]
      
      [BUG]
      Btrfs qgroup will still hit EDQUOT under the following case:
      
        $ dev=/dev/test/test
        $ mnt=/mnt/btrfs
        $ umount $mnt &> /dev/null
        $ umount $dev &> /dev/null
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt -o nospace_cache
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ fallocate -l 900M $mnt/subv/padding
        $ sync
      
        $ rm $mnt/subv/padding
      
        # Hit EDQUOT
        $ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file
      
      [CAUSE]
      Since commit a514d638 ("btrfs: qgroup: Commit transaction in advance
      to reduce early EDQUOT"), btrfs is not forced to commit transaction to
      reclaim more quota space.
      
      Instead, we just check pertrans metadata reservation against some
      threshold and try to do asynchronously transaction commit.
      
      However in above case, the pertrans metadata reservation is pretty small
      thus it will never trigger asynchronous transaction commit.
      
      [FIX]
      Instead of only accounting pertrans metadata reservation, we calculate
      how much free space we have, and if there isn't much free space left,
      commit transaction asynchronously to try to free some space.
      
      This may slow down the fs when we have less than 32M free qgroup space,
      but should reduce a lot of false EDQUOT, so the cost should be
      acceptable.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      dcedd379
  4. 03 4月, 2019 6 次提交
    • F
      Btrfs: fix assertion failure on fsync with NO_HOLES enabled · fd1b2536
      Filipe Manana 提交于
      commit 0ccc3876e4b2a1559a4dbe3126dda4459d38a83b upstream.
      
      Back in commit a89ca6f2 ("Btrfs: fix fsync after truncate when
      no_holes feature is enabled") I added an assertion that is triggered when
      an inline extent is found to assert that the length of the (uncompressed)
      data the extent represents is the same as the i_size of the inode, since
      that is true most of the time I couldn't find or didn't remembered about
      any exception at that time. Later on the assertion was expanded twice to
      deal with a case of a compressed inline extent representing a range that
      matches the sector size followed by an expanding truncate, and another
      case where fallocate can update the i_size of the inode without adding
      or updating existing extents (if the fallocate range falls entirely within
      the first block of the file). These two expansion/fixes of the assertion
      were done by commit 7ed586d0a8241 ("Btrfs: fix assertion on fsync of
      regular file when using no-holes feature") and commit 6399fb5a
      ("Btrfs: fix assertion failure during fsync in no-holes mode").
      These however missed the case where an falloc expands the i_size of an
      inode to exactly the sector size and inline extent exists, for example:
      
       $ mkfs.btrfs -f -O no-holes /dev/sdc
       $ mount /dev/sdc /mnt
      
       $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
       wrote 1096/1096 bytes at offset 0
       1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
      
       $ xfs_io -c "falloc 1096 3000" /mnt/foobar
       $ xfs_io -c "fsync" /mnt/foobar
       Segmentation fault
      
       $ dmesg
       [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
       [701253.602962] ------------[ cut here ]------------
       [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
       [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
       [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
       (...)
       [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
       [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
       [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
       [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
       [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
       [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
       [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
       [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
       [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [701253.609608] Call Trace:
       [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
       [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
       [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
       [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
       [701253.612010]  ? do_sysinfo+0xb0/0xf0
       [701253.612552]  do_fsync+0x38/0x60
       [701253.612988]  __x64_sys_fsync+0x10/0x20
       [701253.613360]  do_syscall_64+0x60/0x1b0
       [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [701253.614103] RIP: 0033:0x7f14da4e66d0
       (...)
       [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
       [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
       [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
       [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
       [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
       (...)
       [701253.619941] ---[ end trace e088d74f132b6da5 ]---
      
      Updating the assertion again to allow for this particular case would result
      in a meaningless assertion, plus there is currently no risk of logging
      content that would result in any corruption after a log replay if the size
      of the data encoded in an inline extent is greater than the inode's i_size
      (which is not currently possibe either with or without compression),
      therefore just remove the assertion.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd1b2536
    • N
      btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size · 0ae3b84b
      Nikolay Borisov 提交于
      commit 139a56170de67101791d6e6c8e940c6328393fe9 upstream.
      
      qgroup_rsv_size is calculated as the product of
      outstanding_extent * fs_info->nodesize. The product is calculated with
      32 bit precision since both variables are defined as u32. Yet
      qgroup_rsv_size expects a 64 bit result.
      
      Avoid possible multiplication overflow by casting outstanding_extent to
      u64. Such overflow would in the worst case (64K nodesize) require more
      than 65536 extents, which is quite large and i'ts not likely that it
      would happen in practice.
      
      Fixes-coverity-id: 1435101
      Fixes: ff6bc37e ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ae3b84b
    • A
      btrfs: raid56: properly unmap parity page in finish_parity_scrub() · 1cf4ab01
      Andrea Righi 提交于
      commit 3897b6f0a859288c22fb793fad11ec2327e60fcd upstream.
      
      Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
      a reference counter bug on i386, i.e.:
      
       [ 157.662401] kernel BUG at mm/highmem.c:349!
       [ 157.666725] invalid opcode: 0000 [#1] SMP PTI
      
      The reason is that kunmap(p_page) was completely left out, so we never
      did an unmap for the p_page and the loop unmapping the rbio page was
      iterating over the wrong number of stripes: unmapping should be done
      with nr_data instead of rbio->real_stripes.
      
      Test case to reproduce the bug:
      
       - create a raid5 btrfs filesystem:
         # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde
      
       - mount it:
         # mount /dev/sdb /mnt
      
       - run btrfs scrub in a loop:
         # while :; do btrfs scrub start -BR /mnt; done
      
      BugLink: https://bugs.launchpad.net/bugs/1812845
      Fixes: 5a6ac9ea ("Btrfs, raid56: support parity scrub on raid56")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NAndrea Righi <andrea.righi@canonical.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cf4ab01
    • D
      btrfs: don't report readahead errors and don't update statistics · d952c337
      David Sterba 提交于
      commit 0cc068e6ee59c1fffbfa977d8bf868b7551d80ac upstream.
      
      As readahead is an optimization, all errors are usually filtered out,
      but still properly handled when the real read call is done. The commit
      5e9d3982 ("btrfs: readpages() should submit IO as read-ahead") added
      REQ_RAHEAD to readpages() because that's only used for readahead
      (despite what one would expect from the callback name).
      
      This causes a flood of messages and inflated read error stats, so skip
      reporting in case it's readahead.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403Reported-by: NLimeTech <tomm@lime-technology.com>
      Fixes: 5e9d3982 ("btrfs: readpages() should submit IO as read-ahead")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d952c337
    • J
      btrfs: remove WARN_ON in log_dir_items · b57220cc
      Josef Bacik 提交于
      commit 2cc8334270e281815c3850c3adea363c51f21e0d upstream.
      
      When Filipe added the recursive directory logging stuff in
      2f2ff0ee ("Btrfs: fix metadata inconsistencies after directory
      fsync") he specifically didn't take the directory i_mutex for the
      children directories that we need to log because of lockdep.  This is
      generally fine, but can lead to this WARN_ON() tripping if we happen to
      run delayed deletion's in between our first search and our second search
      of dir_item/dir_indexes for this directory.  We expect this to happen,
      so the WARN_ON() isn't necessary.  Drop the WARN_ON() and add a comment
      so we know why this case can happen.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b57220cc
    • F
      Btrfs: fix incorrect file size after shrinking truncate and fsync · 22dcb30f
      Filipe Manana 提交于
      commit bf504110bc8aa05df48b0e5f0aa84bfb81e0574b upstream.
      
      If we do a shrinking truncate against an inode which is already present
      in the respective log tree and then rename it, as part of logging the new
      name we end up logging an inode item that reflects the old size of the
      file (the one which we previously logged) and not the new smaller size.
      The decision to preserve the size previously logged was added by commit
      1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
      inode") in order to avoid data loss after replaying the log. However that
      decision is only needed for the case the logged inode size is smaller then
      the current size of the inode, as explained in that commit's change log.
      If the current size of the inode is smaller then the previously logged
      size, we know a shrinking truncate happened and therefore need to use
      that smaller size.
      
      Example to trigger the problem:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
        $ xfs_io -c "fsync" /mnt/foo
        $ xfs_io -c "truncate 3000" /mnt/foo
      
        $ mv /mnt/foo /mnt/bar
        $ xfs_io -c "fsync" /mnt/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/bar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0008000
      
      Once we rename the file, we log its name (and inode item), and because
      the inode was already logged before in the current transaction, we log it
      with a size of 8000 bytes because that is the size we previously logged
      (with the first fsync). As part of the rename, besides logging the inode,
      we do also sync the log, which is done since commit d4682ba0
      ("Btrfs: sync log after logging new name"), so the next fsync against our
      inode is effectively a no-op, since no new changes happened since the
      rename operation. Even if did not sync the log during the rename
      operation, the same problem (fize size of 8000 bytes instead of 3000
      bytes) would be visible after replaying the log if the log ended up
      getting synced to disk through some other means, such as for example by
      fsyncing some other modified file. In the example above the fsync after
      the rename operation is there just because not every filesystem may
      guarantee logging/journalling the inode (and syncing the log/journal)
      during the rename operation, for example it is needed for f2fs, but not
      for ext4 and xfs.
      
      Fix this scenario by, when logging a new name (which is triggered by
      rename and link operations), using the current size of the inode instead
      of the previously logged inode size.
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: NSeulbae Kim <seulbae@gatech.edu>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22dcb30f
  5. 24 3月, 2019 4 次提交
    • F
      Btrfs: fix corruption reading shared and compressed extents after hole punching · 898488e2
      Filipe Manana 提交于
      commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream.
      
      In the past we had data corruption when reading compressed extents that
      are shared within the same file and they are consecutive, this got fixed
      by commit 005efedf ("Btrfs: fix read corruption of compressed and
      shared extents") and by commit 808f80b4 ("Btrfs: update fix for read
      corruption of compressed and shared extents"). However there was a case
      that was missing in those fixes, which is when the shared and compressed
      extents are referenced with a non-zero offset. The following shell script
      creates a reproducer for this issue:
      
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdc &> /dev/null
        mount -o compress /dev/sdc /mnt/sdc
      
        # Create a file with 3 consecutive compressed extents, each has an
        # uncompressed size of 128Kb and a compressed size of 4Kb.
        for ((i = 1; i <= 3; i++)); do
            head -c 4096 /dev/zero
            for ((j = 1; j <= 31; j++)); do
                head -c 4096 /dev/zero | tr '\0' "\377"
            done
        done > /mnt/sdc/foobar
        sync
      
        echo "Digest after file creation:   $(md5sum /mnt/sdc/foobar)"
      
        # Clone the first extent into offsets 128K and 256K.
        xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
        xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
        sync
      
        echo "Digest after cloning:         $(md5sum /mnt/sdc/foobar)"
      
        # Punch holes into the regions that are already full of zeroes.
        xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
        xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
        xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
        sync
      
        echo "Digest after hole punching:   $(md5sum /mnt/sdc/foobar)"
      
        echo "Dropping page cache..."
        sysctl -q vm.drop_caches=1
        echo "Digest after hole punching:   $(md5sum /mnt/sdc/foobar)"
      
        umount /dev/sdc
      
      When running the script we get the following output:
      
        Digest after file creation:   5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        linked 131072/131072 bytes at offset 131072
        128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
        linked 131072/131072 bytes at offset 262144
        128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
        Digest after cloning:         5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        Digest after hole punching:   5a0888d80d7ab1fd31c229f83a3bbcc8  /mnt/sdc/foobar
        Dropping page cache...
        Digest after hole punching:   fba694ae8664ed0c2e9ff8937e7f1484  /mnt/sdc/foobar
      
      This happens because after reading all the pages of the extent in the
      range from 128K to 256K for example, we read the hole at offset 256K
      and then when reading the page at offset 260K we don't submit the
      existing bio, which is responsible for filling all the page in the
      range 128K to 256K only, therefore adding the pages from range 260K
      to 384K to the existing bio and submitting it after iterating over the
      entire range. Once the bio completes, the uncompressed data fills only
      the pages in the range 128K to 256K because there's no more data read
      from disk, leaving the pages in the range 260K to 384K unfilled. It is
      just a slightly different variant of what was solved by commit
      005efedf ("Btrfs: fix read corruption of compressed and shared
      extents").
      
      Fix this by forcing a bio submit, during readpages(), whenever we find a
      compressed extent map for a page that is different from the extent map
      for the previous page or has a different starting offset (in case it's
      the same compressed extent), instead of the extent map's original start
      offset.
      
      A test case for fstests follows soon.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 808f80b4 ("Btrfs: update fix for read corruption of compressed and shared extents")
      Fixes: 005efedf ("Btrfs: fix read corruption of compressed and shared extents")
      Cc: stable@vger.kernel.org # 4.3+
      Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      898488e2
    • J
      btrfs: ensure that a DUP or RAID1 block group has exactly two stripes · 1a00f7fd
      Johannes Thumshirn 提交于
      commit 349ae63f40638a28c6fce52e8447c2d14b84cc0c upstream.
      
      We recently had a customer issue with a corrupted filesystem. When
      trying to mount this image btrfs panicked with a division by zero in
      calc_stripe_length().
      
      The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
      takes this value and divides it by the number of copies the RAID profile
      is expected to have to calculate the amount of data stripes. As a DUP
      profile is expected to have 2 copies this division resulted in 1/2 = 0.
      Later then the 'data_stripes' variable is used as a divisor in the
      stripe length calculation which results in a division by 0 and thus a
      kernel panic.
      
      When encountering a filesystem with a DUP block group and a
      'num_stripes' value unequal to 2, refuse mounting as the image is
      corrupted and will lead to unexpected behaviour.
      
      Code inspection showed a RAID1 block group has the same issues.
      
      Fixes: e06cd3dd ("Btrfs: add validadtion checks for chunk loading")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a00f7fd
    • F
      Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl · 6e24f5a1
      Filipe Manana 提交于
      commit a0873490660246db587849a9e172f2b7b21fa88a upstream.
      
      We are holding a transaction handle when setting an acl, therefore we can
      not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock
      if reclaim is triggered by the allocation, therefore setup a nofs context.
      
      Fixes: 39a27ec1 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e24f5a1
    • F
      Btrfs: setup a nofs context for memory allocation at btrfs_create_tree() · 61f92096
      Filipe Manana 提交于
      commit b89f6d1fcb30a8cbdc18ce00c7d93792076af453 upstream.
      
      We are holding a transaction handle when creating a tree, therefore we can
      not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is
      triggered by the allocation, therefore setup a nofs context.
      
      Fixes: 74e4d827 ("btrfs: let callers of btrfs_alloc_root pass gfp flags")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61f92096
  6. 13 2月, 2019 2 次提交
    • E
      btrfs: use tagged writepage to mitigate livelock of snapshot · f5d5b543
      Ethan Lien 提交于
      [ Upstream commit 3cd24c698004d2f7668e0eb9fc1f096f533c791b ]
      
      Snapshot is expected to be fast. But if there are writers steadily
      creating dirty pages in our subvolume, the snapshot may take a very long
      time to complete. To fix the problem, we use tagged writepage for
      snapshot flusher as we do in the generic write_cache_pages(), so we can
      omit pages dirtied after the snapshot command.
      
      This does not change the semantics regarding which data get to the
      snapshot, if there are pages being dirtied during the snapshotting
      operation.  There's a sync called before snapshot is taken in old/new
      case, any IO in flight just after that may be in the snapshot but this
      depends on other system effects that might still sync the IO.
      
      We do a simple snapshot speed test on a Intel D-1531 box:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
      --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 1m58sec
      patched:  6.54sec
      
      This is the best case for this patch since for a sequential write case,
      we omit nearly all pages dirtied after the snapshot command.
      
      For a multi writers, random write test:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
      --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 15.83sec
      patched:  10.35sec
      
      The improvement is smaller compared to the sequential write case,
      since we omit only half of the pages dirtied after snapshot command.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f5d5b543
    • A
      btrfs: harden agaist duplicate fsid on scanned devices · 3733632e
      Anand Jain 提交于
      [ Upstream commit a9261d4125c97ce8624e9941b75dee1b43ad5df9 ]
      
      It's not that impossible to imagine that a device OR a btrfs image is
      copied just by using the dd or the cp command. Which in case both the
      copies of the btrfs will have the same fsid. If on the system with
      automount enabled, the copied FS gets scanned.
      
      We have a known bug in btrfs, that we let the device path be changed
      after the device has been mounted. So using this loop hole the new
      copied device would appears as if its mounted immediately after it's
      been copied.
      
      For example:
      
      Initially.. /dev/mmcblk0p4 is mounted as /
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part /
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show
           Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
           Total devices 1 FS bytes used 1.40GiB
           devid    1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4
      
      Copy mmcblk0 to sda
      
        $ dd if=/dev/mmcblk0 of=/dev/sda
      
      And immediately after the copy completes the change in the device
      superblock is notified which the automount scans using btrfs device scan
      and the new device sda becomes the mounted root device.
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        sda           8:0    1 14.9G  0 disk
        |-sda4        8:4    1    4G  0 part /
        |-sda2        8:2    1  500M  0 part
        |-sda3        8:3    1  256M  0 part
        `-sda1        8:1    1  256M  0 part
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show /
          Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
          Total devices 1 FS bytes used 1.40GiB
          devid    1 size 4.00GiB used 3.00GiB path /dev/sda4
      
      The bug is quite nasty that you can't either unmount /dev/sda4 or
      /dev/mmcblk0p4. And the problem does not get solved until you take sda
      out of the system on to another system to change its fsid using the
      'btrfstune -u' command.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3733632e
  7. 07 2月, 2019 2 次提交
    • E
      btrfs: On error always free subvol_name in btrfs_mount · 9ee5987f
      Eric W. Biederman 提交于
      commit 532b618bdf237250d6d4566536d4b6ce3d0a31fe upstream.
      
      The subvol_name is allocated in btrfs_parse_subvol_options and is
      consumed and freed in mount_subvol.  Add a free to the error paths that
      don't call mount_subvol so that it is guaranteed that subvol_name is
      freed when an error happens.
      
      Fixes: 312c89fb ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
      Cc: stable@vger.kernel.org # v4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ee5987f
    • F
      Btrfs: fix deadlock when allocating tree block during leaf/node split · 5bce1436
      Filipe Manana 提交于
      commit a6279470762c19ba97e454f90798373dccdf6148 upstream.
      
      When splitting a leaf or node from one of the trees that are modified when
      flushing pending block groups (extent, chunk, device and free space trees),
      we need to allocate a new tree block, which in turn can result in the need
      to allocate a new block group. After allocating the new block group we may
      need to flush new block groups that were previously allocated during the
      course of the current transaction, which is what may cause a deadlock due
      to attempts to write lock twice the same leaf or node, as when splitting
      a leaf or node we are holding a write lock on it and its parent node.
      
      The same type of deadlock can also happen when increasing the tree's
      height, since we are holding a lock on the existing root while allocating
      the tree block to use as the new root node.
      
      An example trace when the deadlock happens during the leaf split path is:
      
        [27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G        W         4.19.16 #1
        [27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018
        [27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
        (...)
        [27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246
        [27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001
        [27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48
        [27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000
        [27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000
        [27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18
        [27175.303601] FS:  0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000
        [27175.304468] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0
        [27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [27175.307940] Call Trace:
        [27175.308802]  btrfs_search_slot+0x779/0x9a0 [btrfs]
        [27175.309669]  ? update_space_info+0xba/0xe0 [btrfs]
        [27175.310534]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
        [27175.311397]  btrfs_insert_item+0x60/0xd0 [btrfs]
        [27175.312253]  btrfs_create_pending_block_groups+0xee/0x210 [btrfs]
        [27175.313116]  do_chunk_alloc+0x25f/0x300 [btrfs]
        [27175.313984]  find_free_extent+0x706/0x10d0 [btrfs]
        [27175.314855]  btrfs_reserve_extent+0x9b/0x1d0 [btrfs]
        [27175.315707]  btrfs_alloc_tree_block+0x100/0x5b0 [btrfs]
        [27175.316548]  split_leaf+0x130/0x610 [btrfs]
        [27175.317390]  btrfs_search_slot+0x94d/0x9a0 [btrfs]
        [27175.318235]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
        [27175.319087]  alloc_reserved_file_extent+0x84/0x2c0 [btrfs]
        [27175.319938]  __btrfs_run_delayed_refs+0x596/0x1150 [btrfs]
        [27175.320792]  btrfs_run_delayed_refs+0xed/0x1b0 [btrfs]
        [27175.321643]  delayed_ref_async_start+0x81/0x90 [btrfs]
        [27175.322491]  normal_work_helper+0xd0/0x320 [btrfs]
        [27175.323328]  ? move_linked_works+0x6e/0xa0
        [27175.324160]  process_one_work+0x191/0x370
        [27175.324976]  worker_thread+0x4f/0x3b0
        [27175.325763]  kthread+0xf8/0x130
        [27175.326531]  ? rescuer_thread+0x320/0x320
        [27175.327284]  ? kthread_create_worker_on_cpu+0x50/0x50
        [27175.328027]  ret_from_fork+0x35/0x40
        [27175.328741] ---[ end trace 300a1b9f0ac30e26 ]---
      
      Fix this by preventing the flushing of new blocks groups when splitting a
      leaf/node and when inserting a new root node for one of the trees modified
      by the flushing operation, similar to what is done when COWing a node/leaf
      from on of these trees.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383Reported-by: NEli V <eliventer@gmail.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5bce1436
  8. 26 1月, 2019 4 次提交
    • J
      btrfs: improve error handling of btrfs_add_link · 310f8296
      Johannes Thumshirn 提交于
      [ Upstream commit 1690dd41e0cb1dade80850ed8a3eb0121b96d22f ]
      
      In the error handling block, err holds the return value of either
      btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
      since it's introduction with commit fe66a05a (Btrfs: improve error
      handling for btrfs_insert_dir_item callers) in 2012.
      
      If the error handling in the error handling fails, there's not much left
      to do and the abort either happened earlier in the callees or is
      necessary here.
      
      So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
      the transaction, but still return the original code of the failure
      stored in 'ret' as this will be reported to the user.
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      310f8296
    • A
      btrfs: fix use-after-free due to race between replace start and cancel · 38b17eee
      Anand Jain 提交于
      [ Upstream commit d189dd70e2556181732598956d808ea53cc8774e ]
      
      The device replace cancel thread can race with the replace start thread
      and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
      fail to stop the scrub thread.
      
      The scrub thread continues with the scrub for replace which then will
      try to write to the target device and which is already freed by the
      cancel thread.
      
      scrub_setup_ctx() warns as tgtdev is NULL.
      
        struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
        {
        ...
      	  if (is_dev_replace) {
      		  WARN_ON(!fs_info->dev_replace.tgtdev);  <===
      		  sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
      		  sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
      		  sctx->flush_all_writes = false;
      	  }
      
        [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
        [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
        [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.432970] Call Trace:
        [ 6852.433202]  btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
        [ 6852.433471]  btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
        [ 6852.433800]  btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
        [ 6852.434097]  btrfs_ioctl+0x2476/0x2d20 [btrfs]
        [ 6852.434365]  ? do_sigaction+0x7d/0x1e0
        [ 6852.434623]  do_vfs_ioctl+0xa9/0x6c0
        [ 6852.434865]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435124]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435387]  ksys_ioctl+0x60/0x90
        [ 6852.435663]  __x64_sys_ioctl+0x16/0x20
        [ 6852.435907]  do_syscall_64+0x50/0x180
        [ 6852.436150]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Further, as the replace thread enters scrub_write_page_to_dev_replace()
      without the target device it panics:
      
        static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
      				      struct scrub_page *spage)
        {
        ...
      	bio_set_dev(bio, sbio->dev->bdev); <======
      
        [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
        ..
        [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
        [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
        [btrfs]
        ..
        [ 6929.721430] Call Trace:
        [ 6929.721663]  scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
        [ 6929.721975]  scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
        [ 6929.722277]  normal_work_helper+0xf0/0x4c0 [btrfs]
        [ 6929.722552]  process_one_work+0x1f4/0x520
        [ 6929.722805]  ? process_one_work+0x16e/0x520
        [ 6929.723063]  worker_thread+0x46/0x3d0
        [ 6929.723313]  kthread+0xf8/0x130
        [ 6929.723544]  ? process_one_work+0x520/0x520
        [ 6929.723800]  ? kthread_delayed_work_timer_fn+0x80/0x80
        [ 6929.724081]  ret_from_fork+0x3a/0x50
      
      Fix this by letting the btrfs_dev_replace_finishing() to do the job of
      cleaning after the cancel, including freeing of the target device.
      btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
      along with the scrub return status.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      38b17eee
    • H
      btrfs: alloc_chunk: fix more DUP stripe size handling · 720b86a5
      Hans van Kranenburg 提交于
      [ Upstream commit baf92114c7e6dd6124aa3d506e4bc4b694da3bc3 ]
      
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      720b86a5
    • Q
      btrfs: volumes: Make sure there is no overlap of dev extents at mount time · bb5717a4
      Qu Wenruo 提交于
      [ Upstream commit 5eb193812a42dc49331f25137a38dfef9612d3e4 ]
      
      Enhance btrfs_verify_dev_extents() to remember previous checked dev
      extents, so it can verify no dev extents can overlap.
      
      Analysis from Hans:
      
      "Imagine allocating a DATA|DUP chunk.
      
       In the chunk allocator, we first set...
         max_stripe_size = SZ_1G;
         max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
       ... which is 10GiB.
      
       Then...
         /* we don't want a chunk larger than 10% of writeable space */
         max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
             		 max_chunk_size);
      
       Imagine we only have one 7880MiB block device in this filesystem. Now
       max_chunk_size is down to 788MiB.
      
       The next step in the code is to search for max_stripe_size * dev_stripes
       amount of free space on the device, which is in our example 1GiB * 2 =
       2GiB. Imagine the device has exactly 1578MiB free in one contiguous
       piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
      
       Next we recalculate the stripe_size (which is actually the device extent
       length), based on the actual maximum amount of available raw disk space:
         stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
      
       stripe_size is now 789MiB
      
       Next we do...
         data_stripes = num_stripes / ncopies
       ...where data_stripes ends up as 1, because num_stripes is 2 (the amount
       of device extents we're going to have), and DUP has ncopies 2.
      
       Next there's a check...
         if (stripe_size * data_stripes > max_chunk_size)
       ...which matches because 789MiB * 1 > 788MiB.
      
       We go into the if code, and next is...
         stripe_size = div_u64(max_chunk_size, data_stripes);
       ...which resets stripe_size to max_chunk_size: 788MiB
      
       Next is a fun one...
         /* bump the answer up to a 16MB boundary */
         stripe_size = round_up(stripe_size, SZ_16M);
       ...which changes stripe_size from 788MiB to 800MiB.
      
       We're not done changing stripe_size yet...
         /* But don't go higher than the limits we found while searching
          * for free extents
          */
         stripe_size = min(devices_info[ndevs - 1].max_avail,
             	      stripe_size);
      
       This is bad. max_avail is twice the stripe_size (we need to fit 2 device
       extents on the same device for DUP).
      
       The result here is that 800MiB < 1578MiB, so it's unchanged. However,
       the resulting DUP chunk will need 1600MiB disk space, which isn't there,
       and the second dev_extent might extend into the next thing (next
       dev_extent? end of device?) for 22MiB.
      
       The last shown line of code relies on a situation where there's twice
       the value of stripe_size present as value for the variable stripe_size
       when it's DUP. This was actually the case before commit 92e222df
       "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
         "[...] in the meantime there's a check to see if the stripe_size does
       not exceed max_chunk_size. Since during this check stripe_size is twice
       the amount as intended, the check will reduce the stripe_size to
       max_chunk_size if the actual correct to be used stripe_size is more than
       half the amount of max_chunk_size."
      
       In the previous version of the code, the 16MiB alignment (why is this
       done, by the way?) would result in a 50% chance that it would actually
       do an 8MiB alignment for the individual dev_extents, since it was
       operating on double the size. Does this matter?
      
       Does it matter that stripe_size can be set to anything which is not
       16MiB aligned because of the amount of remaining available disk space
       which is just taken?
      
       What is the main purpose of this round_up?
      
       The most straightforward thing to do seems something like...
         stripe_size = min(
             div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
             stripe_size
         )
       ..just putting half of the max_avail into stripe_size."
      
      Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/Reported-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ add analysis from report ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bb5717a4
  9. 23 1月, 2019 2 次提交
    • J
      btrfs: wait on ordered extents on abort cleanup · 01634ac5
      Josef Bacik 提交于
      commit 74d5d229b1bf60f93bff244b2dfc0eb21ec32a07 upstream.
      
      If we flip read-only before we initiate writeback on all dirty pages for
      ordered extents we've created then we'll have ordered extents left over
      on umount, which results in all sorts of bad things happening.  Fix this
      by making sure we wait on ordered extents if we have to do the aborted
      transaction cleanup stuff.
      
      generic/475 can produce this warning:
      
       [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
       [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
       [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
       [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
       [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
       [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
       [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
       [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
       [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
       [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
       [ 8531.207516] Call Trace:
       [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
       [ 8531.210209]  ? wait_for_completion+0x5b/0x190
       [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
       [ 8531.212412]  generic_shutdown_super+0x64/0x100
       [ 8531.213485]  kill_anon_super+0x14/0x30
       [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
       [ 8531.215539]  deactivate_locked_super+0x29/0x60
       [ 8531.216633]  cleanup_mnt+0x3b/0x70
       [ 8531.217497]  task_work_run+0x98/0xc0
       [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
       [ 8531.219324]  do_syscall_64+0x15b/0x180
       [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
       [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
       [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
       [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
       [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
       [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
       [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000
      
      Leaving a tree in the rb-tree:
      
      3853 void btrfs_free_fs_root(struct btrfs_root *root)
      3854 {
      3855         iput(root->ino_cache_inode);
      3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      
      CC: stable@vger.kernel.org
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      01634ac5
    • D
      Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io" · 4675f90e
      David Sterba 提交于
      commit 77b7aad195099e7c6da11e94b7fa6ef5e6fb0025 upstream.
      
      This reverts commit e73e81b6.
      
      This patch causes a few problems:
      
      - adds latency to btrfs_finish_ordered_io
      - as btrfs_finish_ordered_io is used for free space cache, generating
        more work from btrfs_btree_balance_dirty_nodelay could end up in the
        same workque, effectively deadlocking
      
      12260 kworker/u96:16+btrfs-freespace-write D
      [<0>] balance_dirty_pages+0x6e6/0x7ad
      [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
      [<0>] btrfs_finish_ordered_io+0x3da/0x770
      [<0>] normal_work_helper+0x1c5/0x5a0
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x46/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Transaction commit will wait on the freespace cache:
      
      838 btrfs-transacti D
      [<0>] btrfs_start_ordered_extent+0x154/0x1e0
      [<0>] btrfs_wait_ordered_range+0xbd/0x110
      [<0>] __btrfs_wait_cache_io+0x49/0x1a0
      [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
      [<0>] commit_cowonly_roots+0x215/0x2b0
      [<0>] btrfs_commit_transaction+0x37e/0x910
      [<0>] transaction_kthread+0x14d/0x180
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      And then writepages ends up waiting on transaction commit:
      
      9520 kworker/u96:13+flush-btrfs-1 D
      [<0>] wait_current_trans+0xac/0xe0
      [<0>] start_transaction+0x21b/0x4b0
      [<0>] cow_file_range_inline+0x10b/0x6b0
      [<0>] cow_file_range.isra.69+0x329/0x4a0
      [<0>] run_delalloc_range+0x105/0x3c0
      [<0>] writepage_delalloc+0x119/0x180
      [<0>] __extent_writepage+0x10c/0x390
      [<0>] extent_write_cache_pages+0x26f/0x3d0
      [<0>] extent_writepages+0x4f/0x80
      [<0>] do_writepages+0x17/0x60
      [<0>] __writeback_single_inode+0x59/0x690
      [<0>] writeback_sb_inodes+0x291/0x4e0
      [<0>] __writeback_inodes_wb+0x87/0xb0
      [<0>] wb_writeback+0x3bb/0x500
      [<0>] wb_workfn+0x40d/0x610
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x1e0/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Eventually, we have every process in the system waiting on
      balance_dirty_pages(), and nobody is able to make progress on page
      writeback.
      
      The original patch tried to fix an OOM condition, that happened on 4.4 but no
      success reproducing that on later kernels (4.19 and 4.20). This is more likely
      a problem in OOM itself.
      
      Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/Reported-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org # 4.18+
      CC: ethanlien <ethanlien@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4675f90e
  10. 17 1月, 2019 4 次提交
    • F
      Btrfs: use nofs context when initializing security xattrs to avoid deadlock · 7a1b9b76
      Filipe Manana 提交于
      commit 827aa18e7b903c5ff3b3cd8fec328a99b1dbd411 upstream.
      
      When initializing the security xattrs, we are holding a transaction handle
      therefore we need to use a GFP_NOFS context in order to avoid a deadlock
      with reclaim in case it's triggered.
      
      Fixes: 39a27ec1 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a1b9b76
    • F
      Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation · 79aa5c0d
      Filipe Manana 提交于
      commit 9a6f209e36500efac51528132a3e3083586eda5f upstream.
      
      If the quota enable and snapshot creation ioctls are called concurrently
      we can get into a deadlock where the task enabling quotas will deadlock
      on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
      twice, or the task creating a snapshot tries to commit the transaction
      while the task enabling quota waits for the former task to commit the
      transaction while holding the mutex. The following time diagrams show how
      both cases happen.
      
      First scenario:
      
                 CPU 0                                    CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
          btrfs_start_transaction()
      
                                                   btrfs_ioctl()
                                                    btrfs_ioctl_snap_create_v2
                                                     create_snapshot()
                                                      --> adds snapshot to the
                                                          list pending_snapshots
                                                          of the current
                                                          transaction
      
          btrfs_commit_transaction()
           create_pending_snapshots()
             create_pending_snapshot()
              qgroup_account_snapshot()
               btrfs_qgroup_inherit()
      	   mutex_lock(fs_info->qgroup_ioctl_lock)
      	    --> deadlock, mutex already locked
      	        by this task at
      		btrfs_quota_enable()
      
      Second scenario:
      
                 CPU 0                                    CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
          btrfs_start_transaction()
      
                                                   btrfs_ioctl()
                                                    btrfs_ioctl_snap_create_v2
                                                     create_snapshot()
                                                      --> adds snapshot to the
                                                          list pending_snapshots
                                                          of the current
                                                          transaction
      
                                                      btrfs_commit_transaction()
                                                       --> waits for task at
                                                           CPU 0 to release
                                                           its transaction
                                                           handle
      
          btrfs_commit_transaction()
           --> sees another task started
               the transaction commit first
           --> releases its transaction
               handle
           --> waits for the transaction
               commit to be completed by
               the task at CPU 1
      
                                                       create_pending_snapshot()
                                                        qgroup_account_snapshot()
                                                         btrfs_qgroup_inherit()
                                                          mutex_lock(fs_info->qgroup_ioctl_lock)
                                                           --> deadlock, task at CPU 0
                                                               has the mutex locked but
                                                               it is waiting for us to
                                                               finish the transaction
                                                               commit
      
      So fix this by setting the quota enabled flag in fs_info after committing
      the transaction at btrfs_quota_enable(). This ends up serializing quota
      enable and snapshot creation as if the snapshot creation happened just
      before the quota enable request. The quota rescan task, scheduled after
      committing the transaction in btrfs_quote_enable(), will do the accounting.
      
      Fixes: 6426c7ad ("btrfs: qgroup: Fix qgroup accounting when creating snapshot")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      79aa5c0d
    • F
      Btrfs: fix access to available allocation bits when starting balance · 829431a2
      Filipe Manana 提交于
      commit 5a8067c0d17feb7579db0476191417b441a8996e upstream.
      
      The available allocation bits members from struct btrfs_fs_info are
      protected by a sequence lock, and when starting balance we access them
      incorrectly in two different ways:
      
      1) In the read sequence lock loop at btrfs_balance() we use the values we
         read from fs_info->avail_*_alloc_bits and we can immediately do actions
         that have side effects and can not be undone (printing a message and
         jumping to a label). This is wrong because a retry might be needed, so
         our actions must not have side effects and must be repeatable as long
         as read_seqretry() returns a non-zero value. In other words, we were
         essentially ignoring the sequence lock;
      
      2) Right below the read sequence lock loop, we were reading the values
         from avail_metadata_alloc_bits and avail_data_alloc_bits without any
         protection from concurrent writers, that is, reading them outside of
         the read sequence lock critical section.
      
      So fix this by making sure we only read the available allocation bits
      while in a read sequence lock critical section and that what we do in the
      critical section is repeatable (has nothing that can not be undone) so
      that any eventual retry that is needed is handled properly.
      
      Fixes: de98ced9 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
      Fixes: 14506127 ("btrfs: fix a bogus warning when converting only data or metadata")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      829431a2
    • F
      Btrfs: fix deadlock when using free space tree due to block group creation · d7068618
      Filipe Manana 提交于
      commit a6d8654d885d7d79a3fb82da64eaa489ca332a82 upstream.
      
      When modifying the free space tree we can end up COWing one of its extent
      buffers which in turn might result in allocating a new chunk, which in
      turn can result in flushing (finish creation) of pending block groups. If
      that happens we can deadlock because creating a pending block group needs
      to update the free space tree, and if any of the updates tries to modify
      the same extent buffer that we are COWing, we end up in a deadlock since
      we try to write lock twice the same extent buffer.
      
      So fix this by skipping pending block group creation if we are COWing an
      extent buffer from the free space tree. This is a case missed by commit
      5ce555578e091 ("Btrfs: fix deadlock when writing out free space caches").
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173
      Fixes: 5ce555578e091 ("Btrfs: fix deadlock when writing out free space caches")
      CC: stable@vger.kernel.org # 4.18+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7068618
  11. 10 1月, 2019 6 次提交
    • F
      Btrfs: send, fix race with transaction commits that create snapshots · 9eec74b4
      Filipe Manana 提交于
      commit be6821f82c3cc36e026f5afd10249988852b35ea upstream.
      
      If we create a snapshot of a snapshot currently being used by a send
      operation, we can end up with send failing unexpectedly (returning
      -ENOENT error to user space for example). The following diagram shows
      how this happens.
      
                  CPU 1                                   CPU2                                CPU3
      
       btrfs_ioctl_send()
        (...)
                                           create_snapshot()
                                            -> creates snapshot of a
                                               root used by the send
                                               task
                                            btrfs_commit_transaction()
                                             create_pending_snapshot()
        __get_inode_info()
         btrfs_search_slot()
          btrfs_search_slot_get_root()
           down_read commit_root_sem
      
           get reference on eb of the
           commit root
            -> eb with bytenr == X
      
           up_read commit_root_sem
      
                                              btrfs_cow_block(root node)
                                               btrfs_free_tree_block()
                                                -> creates delayed ref to
                                                   free the extent
      
                                             btrfs_run_delayed_refs()
                                              -> runs the delayed ref,
                                                 adds extent to
                                                 fs_info->pinned_extents
      
                                             btrfs_finish_extent_commit()
                                              unpin_extent_range()
                                               -> marks extent as free
                                                  in the free space cache
      
                                            transaction commit finishes
      
                                                                             btrfs_start_transaction()
                                                                              (...)
                                                                              btrfs_cow_block()
                                                                               btrfs_alloc_tree_block()
                                                                                btrfs_reserve_extent()
                                                                                 -> allocates extent at
                                                                                    bytenr == X
                                                                                btrfs_init_new_buffer(bytenr X)
                                                                                 btrfs_find_create_tree_block()
                                                                                  alloc_extent_buffer(bytenr X)
                                                                                   find_extent_buffer(bytenr X)
                                                                                    -> returns existing eb,
                                                                                       which the send task got
      
                                                                              (...)
                                                                               -> modifies content of the
                                                                                  eb with bytenr == X
      
          -> uses an eb that now
             belongs to some other
             tree and no more matches
             the commit root of the
             snapshot, resuts will be
             unpredictable
      
      The consequences of this race can be various, and can lead to searches in
      the commit root performed by the send task failing unexpectedly (unable to
      find inode items, returning -ENOENT to user space, for example) or not
      failing because an inode item with the same number was added to the tree
      that reused the metadata extent, in which case send can behave incorrectly
      in the worst case or just fail later for some reason.
      
      Fix this by performing a copy of the commit root's extent buffer when doing
      a search in the context of a send operation.
      
      CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e: Btrfs: move get root out of btrfs_search_slot to a helper
      CC: stable@vger.kernel.org # 4.4.x: f9ddfd05: Btrfs: remove unused check of skip_locking
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9eec74b4
    • J
      btrfs: run delayed items before dropping the snapshot · 6911b074
      Josef Bacik 提交于
      commit 0568e82dbe2510fc1fa664f58e5c997d3f1e649e upstream.
      
      With my delayed refs patches in place we started seeing a large amount
      of aborts in __btrfs_free_extent:
      
       BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964  owner 1 offset 0
       Call Trace:
        ? btrfs_merge_delayed_refs+0xaf/0x340
        __btrfs_run_delayed_refs+0x6ea/0xfc0
        ? btrfs_set_path_blocking+0x31/0x60
        btrfs_run_delayed_refs+0xeb/0x180
        btrfs_commit_transaction+0x179/0x7f0
        ? btrfs_check_space_for_delayed_refs+0x30/0x50
        ? should_end_transaction.isra.19+0xe/0x40
        btrfs_drop_snapshot+0x41c/0x7c0
        btrfs_clean_one_deleted_snapshot+0xb5/0xd0
        cleaner_kthread+0xf6/0x120
        kthread+0xf8/0x130
        ? btree_invalidatepage+0x90/0x90
        ? kthread_bind+0x10/0x10
        ret_from_fork+0x35/0x40
      
      This was because btrfs_drop_snapshot depends on the root not being
      modified while it's dropping the snapshot.  It will unlock the root node
      (and really every node) as it walks down the tree, only to re-lock it
      when it needs to do something.  This is a problem because if we modify
      the tree we could cow a block in our path, which frees our reference to
      that block.  Then once we get back to that shared block we'll free our
      reference to it again, and get ENOENT when trying to lookup our extent
      reference to that block in __btrfs_free_extent.
      
      This is ultimately happening because we have delayed items left to be
      processed for our deleted snapshot _after_ all of the inodes are closed
      for the snapshot.  We only run the delayed inode item if we're deleting
      the inode, and even then we do not run the delayed insertions or delayed
      removals.  These can be run at any point after our final inode does its
      last iput, which is what triggers the snapshot deletion.  We can end up
      with the snapshot deletion happening and then have the delayed items run
      on that file system, resulting in the above problem.
      
      This problem has existed forever, however my patches made it much easier
      to hit as I wake up the cleaner much more often to deal with delayed
      iputs, which made us more likely to start the snapshot dropping work
      before the transaction commits, which is when the delayed items would
      generally be run.  Before, generally speaking, we would run the delayed
      items, commit the transaction, and wakeup the cleaner thread to start
      deleting snapshots, which means we were less likely to hit this problem.
      You could still hit it if you had multiple snapshots to be deleted and
      ended up with lots of delayed items, but it was definitely harder.
      
      Fix for now by simply running all the delayed items before starting to
      drop the snapshot.  We could make this smarter in the future by making
      the delayed items per-root, and then simply drop any delayed items for
      roots that we are going to delete.  But for now just a quick and easy
      solution is the safest.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6911b074
    • F
      Btrfs: fix fsync of files with multiple hard links in new directories · 10b04210
      Filipe Manana 提交于
      commit 41bd60676923822de1df2c50b3f9a10171f4338a upstream.
      
      The log tree has a long standing problem that when a file is fsync'ed we
      only check for new ancestors, created in the current transaction, by
      following only the hard link for which the fsync was issued. We follow the
      ancestors using the VFS' dget_parent() API. This means that if we create a
      new link for a file in a directory that is new (or in an any other new
      ancestor directory) and then fsync the file using an old hard link, we end
      up not logging the new ancestor, and on log replay that new hard link and
      ancestor do not exist. In some cases, involving renames, the file will not
      exist at all.
      
      Example:
      
        mkfs.btrfs -f /dev/sdb
        mount /dev/sdb /mnt
      
        mkdir /mnt/A
        touch /mnt/foo
        ln /mnt/foo /mnt/A/bar
        xfs_io -c fsync /mnt/foo
      
        <power failure>
      
      In this example after log replay only the hard link named 'foo' exists
      and directory A does not exist, which is unexpected. In other major linux
      filesystems, such as ext4, xfs and f2fs for example, both hard links exist
      and so does directory A after mounting again the filesystem.
      
      Checking if any new ancestors are new and need to be logged was added in
      2009 by commit 12fcfd22 ("Btrfs: tree logging unlink/rename fixes"),
      however only for the ancestors of the hard link (dentry) for which the
      fsync was issued, instead of checking for all ancestors for all of the
      inode's hard links.
      
      So fix this by tracking the id of the last transaction where a hard link
      was created for an inode and then on fsync fallback to a full transaction
      commit when an inode has more than one hard link and at least one new hard
      link was created in the current transaction. This is the simplest solution
      since this is not a common use case (adding frequently hard links for
      which there's an ancestor created in the current transaction and then
      fsync the file). In case it ever becomes a common use case, a solution
      that consists of iterating the fs/subvol btree for each hard link and
      check if any ancestor is new, could be implemented.
      
      This solves many unexpected scenarios reported by Jayashree Mohan and
      Vijay Chidambaram, and for which there is a new test case for fstests
      under review.
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: NVijay Chidambaram <vvijay03@gmail.com>
      Reported-by: NJayashree Mohan <jayashree2912@gmail.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10b04210
    • L
      btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow · 7708a830
      Lu Fengqi 提交于
      commit 27a7ff554e8d349627a90bda275c527b7348adae upstream.
      
      The test case btrfs/001 with inode_cache mount option will encounter the
      following warning:
      
        WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs]
        CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G        W  O      4.20.0-rc4-custom+ #30
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
        Call Trace:
         ? free_extent_buffer+0x46/0x90 [btrfs]
         run_delalloc_nocow+0x455/0x900 [btrfs]
         btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
         writepage_delalloc+0xf9/0x150 [btrfs]
         __extent_writepage+0x125/0x3e0 [btrfs]
         extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
         ? __wake_up_common_lock+0x63/0xc0
         extent_writepages+0x50/0x80 [btrfs]
         do_writepages+0x41/0xd0
         ? __filemap_fdatawrite_range+0x9e/0xf0
         __filemap_fdatawrite_range+0xbe/0xf0
         btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
         __btrfs_write_out_cache+0x42c/0x480 [btrfs]
         btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
         btrfs_save_ino_cache+0x551/0x660 [btrfs]
         commit_fs_roots+0xc5/0x190 [btrfs]
         btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
         btrfs_mksubvol+0x48d/0x4d0 [btrfs]
         btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
         btrfs_ioctl+0x123f/0x3030 [btrfs]
      
      The file extent generation of the free space inode is equal to the last
      snapshot of the file root, so the inode will be passed to cow_file_rage.
      But the inode was created and its extents were preallocated in
      btrfs_save_ino_cache, there are no cow copies on disk.
      
      The preallocated extent is not yet in the extent tree, and
      btrfs_cross_ref_exist will ignore the -ENOENT returned by
      check_committed_ref, so we can directly write the inode to the disk.
      
      Fixes: 78d4295b ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path")
      CC: stable@vger.kernel.org # 4.18+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7708a830
    • A
      btrfs: dev-replace: go back to suspend state if another EXCL_OP is running · c1f90eb0
      Anand Jain 提交于
      commit 05c49e6bc1e8866ecfd674ebeeb58cdbff9145c2 upstream.
      
      In a secnario where balance and replace co-exists as below,
      
        - start balance
        - pause balance
        - start replace
        - reboot
      
      and when system restarts, balance resumes first. Then the replace is
      attempted to restart but will fail as the EXCL_OP lock is already held
      by the balance. If so place the replace state back to
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state.
      
      Fixes: 010a47bd ("btrfs: add proper safety check before resuming dev-replace")
      CC: stable@vger.kernel.org # 4.18+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1f90eb0
    • A
      btrfs: dev-replace: go back to suspended state if target device is missing · 28867a52
      Anand Jain 提交于
      commit 0d228ece59a35a9b9e8ff0d40653234a6d90f61e upstream.
      
      At the time of forced unmount we place the running replace to
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes
      back and expect the target device is missing.
      
      Then let the replace state continue to be in
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of
      BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub
      running as part of replace.
      
      Fixes: e93c89c1 ("Btrfs: add new sources for device replace code")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      28867a52
  12. 21 12月, 2018 1 次提交
    • O
      Btrfs: fix missing delayed iputs on unmount · b4c7c826
      Omar Sandoval 提交于
      [ Upstream commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d ]
      
      There's a race between close_ctree() and cleaner_kthread().
      close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
      sees it set, but this is racy; the cleaner might have already checked
      the bit and could be cleaning stuff. In particular, if it deletes unused
      block groups, it will create delayed iputs for the free space cache
      inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
      longer running delayed iputs after a commit. Therefore, if the cleaner
      creates more delayed iputs after delayed iputs are run in
      btrfs_commit_super(), we will leak inodes on unmount and get a busy
      inode crash from the VFS.
      
      Fix it by parking the cleaner before we actually close anything. Then,
      any remaining delayed iputs will always be handled in
      btrfs_commit_super(). This also ensures that the commit in close_ctree()
      is really the last commit, so we can get rid of the commit in
      cleaner_kthread().
      
      The fstest/generic/475 followed by 476 can trigger a crash that
      manifests as a slab corruption caused by accessing the freed kthread
      structure by a wake up function. Sample trace:
      
      [ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
      [ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
      [ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
      [ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
      [ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
      [ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
      [ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
      [ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
      [ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
      [ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
      [ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
      [ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
      [ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
      [ 5657.103159] Call Trace:
      [ 5657.103776]  shrink_inactive_list+0x194/0x410
      [ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
      [ 5657.105750]  shrink_node+0x62/0x1c0
      [ 5657.106529]  try_to_free_pages+0x1a4/0x500
      [ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
      [ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
      [ 5657.109348]  kmalloc_large_node+0x37/0x90
      [ 5657.110205]  __kmalloc_node+0x236/0x310
      [ 5657.111014]  kvmalloc_node+0x3e/0x70
      
      Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add trace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b4c7c826
  13. 17 12月, 2018 1 次提交
    • R
      Btrfs: send, fix infinite loop due to directory rename dependencies · 91f6a9aa
      Robbie Ko 提交于
      [ Upstream commit a4390aee72713d9e73f1132bcdeb17d72fbbf974 ]
      
      When doing an incremental send, due to the need of delaying directory move
      (rename) operations we can end up in infinite loop at
      apply_children_dir_moves().
      
      An example scenario that triggers this problem is described below, where
      directory names correspond to the numbers of their respective inodes.
      
      Parent snapshot:
      
       .
       |--- 261/
             |--- 271/
                   |--- 266/
                         |--- 259/
                         |--- 260/
                         |     |--- 267
                         |
                         |--- 264/
                         |     |--- 258/
                         |           |--- 257/
                         |
                         |--- 265/
                         |--- 268/
                         |--- 269/
                         |     |--- 262/
                         |
                         |--- 270/
                         |--- 272/
                         |     |--- 263/
                         |     |--- 275/
                         |
                         |--- 274/
                               |--- 273/
      
      Send snapshot:
      
       .
       |-- 275/
            |-- 274/
                 |-- 273/
                      |-- 262/
                           |-- 269/
                                |-- 258/
                                     |-- 271/
                                          |-- 268/
                                               |-- 267/
                                                    |-- 270/
                                                         |-- 259/
                                                         |    |-- 265/
                                                         |
                                                         |-- 272/
                                                              |-- 257/
                                                                   |-- 260/
                                                                   |-- 264/
                                                                        |-- 263/
                                                                             |-- 261/
                                                                                  |-- 266/
      
      When processing inode 257 we delay its move (rename) operation because its
      new parent in the send snapshot, inode 272, was not yet processed. Then
      when processing inode 272, we delay the move operation for that inode
      because inode 274 is its ancestor in the send snapshot. Finally we delay
      the move operation for inode 274 when processing it because inode 275 is
      its new parent in the send snapshot and was not yet moved.
      
      When finishing processing inode 275, we start to do the move operations
      that were previously delayed (at apply_children_dir_moves()), resulting in
      the following iterations:
      
      1) We issue the move operation for inode 274;
      
      2) Because inode 262 depended on the move operation of inode 274 (it was
         delayed because 274 is its ancestor in the send snapshot), we issue the
         move operation for inode 262;
      
      3) We issue the move operation for inode 272, because it was delayed by
         inode 274 too (ancestor of 272 in the send snapshot);
      
      4) We issue the move operation for inode 269 (it was delayed by 262);
      
      5) We issue the move operation for inode 257 (it was delayed by 272);
      
      6) We issue the move operation for inode 260 (it was delayed by 272);
      
      7) We issue the move operation for inode 258 (it was delayed by 269);
      
      8) We issue the move operation for inode 264 (it was delayed by 257);
      
      9) We issue the move operation for inode 271 (it was delayed by 258);
      
      10) We issue the move operation for inode 263 (it was delayed by 264);
      
      11) We issue the move operation for inode 268 (it was delayed by 271);
      
      12) We verify if we can issue the move operation for inode 270 (it was
          delayed by 271). We detect a path loop in the current state, because
          inode 267 needs to be moved first before we can issue the move
          operation for inode 270. So we delay again the move operation for
          inode 270, this time we will attempt to do it after inode 267 is
          moved;
      
      13) We issue the move operation for inode 261 (it was delayed by 263);
      
      14) We verify if we can issue the move operation for inode 266 (it was
          delayed by 263). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12);
      
      15) We issue the move operation for inode 267 (it was delayed by 268);
      
      16) We verify if we can issue the move operation for inode 266 (it was
          delayed by 270). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12). So here we added
          again the same delayed move operation that we added in step 14;
      
      17) We attempt again to see if we can issue the move operation for inode
          266, and as in step 16, we realize we can not due to a path loop in
          the current state due to a dependency on inode 270. Again we delay
          inode's 266 rename to happen after inode's 270 move operation, adding
          the same dependency to the empty stack that we did in steps 14 and 16.
          The next iteration will pick the same move dependency on the stack
          (the only entry) and realize again there is still a path loop and then
          again the same dependency to the stack, over and over, resulting in
          an infinite loop.
      
      So fix this by preventing adding the same move dependency entries to the
      stack by removing each pending move record from the red black tree of
      pending moves. This way the next call to get_pending_dir_moves() will
      not return anything for the current parent inode.
      
      A test case for fstests, with this reproducer, follows soon.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [Wrote changelog with example and more clear explanation]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      91f6a9aa
  14. 08 12月, 2018 1 次提交
    • Q
      btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable · b0234f15
      Qu Wenruo 提交于
      commit 10950929e994c5ecee149ff0873388d3c98f12b5 upstream.
      
      [BUG]
      A completely valid btrfs will refuse to mount, with error message like:
        BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
          bg_start=12018974720 bg_len=10888413184, invalid block group size, \
          have 10888413184 expect (0, 10737418240]
      
      This has been reported several times as the 4.19 kernel is now being
      used. The filesystem refuses to mount, but is otherwise ok and booting
      4.18 is a workaround.
      
      Btrfs check returns no error, and all kernels used on this fs is later
      than 2011, which should all have the 10G size limit commit.
      
      [CAUSE]
      For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
      stripe stripe bump up.
      
      __btrfs_alloc_chunk()
      |- max_stripe_size = 1G
      |- max_chunk_size = 10G
      |- data_stripe = 11
      |- if (1G * 11 > 10G) {
             stripe_size = 976128930;
             stripe_size = round_up(976128930, SZ_16M) = 989855744
      
      However the final stripe_size (989855744) * 11 = 10888413184, which is
      still larger than 10G.
      
      [FIX]
      For the comprehensive check, we need to do the full check at chunk read
      time, and rely on bg <-> chunk mapping to do the check.
      
      We could just skip the length check for now.
      
      Fixes: fce466ea ("btrfs: tree-checker: Verify block_group_item")
      Cc: stable@vger.kernel.org # v4.19+
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0234f15