1. 15 9月, 2015 1 次提交
    • F
      Btrfs: fix read corruption of compressed and shared extents · 005efedf
      Filipe Manana 提交于
      If a file has a range pointing to a compressed extent, followed by
      another range that points to the same compressed extent and a read
      operation attempts to read both ranges (either completely or part of
      them), the pages that correspond to the second range are incorrectly
      filled with zeroes.
      
      Consider the following example:
      
        File layout
        [0 - 8K]                      [8K - 24K]
            |                             |
            |                             |
         points to extent X,         points to extent X,
         offset 4K, length of 8K     offset 0, length 16K
      
        [extent X, compressed length = 4K uncompressed length = 16K]
      
      If a readpages() call spans the 2 ranges, a single bio to read the extent
      is submitted - extent_io.c:submit_extent_page() would only create a new
      bio to cover the second range pointing to the extent if the extent it
      points to had a different logical address than the extent associated with
      the first range. This has a consequence of the compressed read end io
      handler (compression.c:end_compressed_bio_read()) finish once the extent
      is decompressed into the pages covering the first range, leaving the
      remaining pages (belonging to the second range) filled with zeroes (done
      by compression.c:btrfs_clear_biovec_end()).
      
      So fix this by submitting the current bio whenever we find a range
      pointing to a compressed extent that was preceded by a range with a
      different extent map. This is the simplest solution for this corner
      case. Making the end io callback populate both ranges (or more, if we
      have multiple pointing to the same extent) is a much more complex
      solution since each bio is tightly coupled with a single extent map and
      the extent maps associated to the ranges pointing to the shared extent
      can have different offsets and lengths.
      
      The following test case for fstests triggers the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        test_clone_and_read_compressed_extent()
        {
            local mount_opts=$1
      
            _scratch_mkfs >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # Create a test file with a single extent that is compressed (the
            # data we write into it is highly compressible no matter which
            # compression algorithm is used, zlib or lzo).
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K"        \
                            -c "pwrite -S 0xbb 4K 8K"        \
                            -c "pwrite -S 0xcc 12K 4K"       \
                            $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Now clone our extent into an adjacent offset.
            $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
                $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
            # Same as before but for this file we clone the extent into a lower
            # file offset.
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K"         \
                            -c "pwrite -S 0xbb 12K 8K"        \
                            -c "pwrite -S 0xcc 20K 4K"        \
                            $SCRATCH_MNT/bar | _filter_xfs_io
      
            $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
                $SCRATCH_MNT/bar $SCRATCH_MNT/bar
      
            echo "File digests before unmounting filesystem:"
            md5sum $SCRATCH_MNT/foo | _filter_scratch
            md5sum $SCRATCH_MNT/bar | _filter_scratch
      
            # Evicting the inode or clearing the page cache before reading
            # again the file would also trigger the bug - reads were returning
            # all bytes in the range corresponding to the second reference to
            # the extent with a value of 0, but the correct data was persisted
            # (it was a bug exclusively in the read path). The issue happened
            # only if the same readpages() call targeted pages belonging to the
            # first and second ranges that point to the same compressed extent.
            _scratch_remount
      
            echo "File digests after mounting filesystem again:"
            # Must match the same digests we got before.
            md5sum $SCRATCH_MNT/foo | _filter_scratch
            md5sum $SCRATCH_MNT/bar | _filter_scratch
        }
      
        echo -e "\nTesting with zlib compression..."
        test_clone_and_read_compressed_extent "-o compress=zlib"
      
        _scratch_unmount
      
        echo -e "\nTesting with lzo compression..."
        test_clone_and_read_compressed_extent "-o compress=lzo"
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      005efedf
  2. 22 8月, 2015 1 次提交
    • C
      btrfs: fix compile when block cgroups are not enabled · 3a9508b0
      Chris Mason 提交于
      bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
      This adds an ifdef around them.  It's not perfect, but our
      use of bi_ioc is being removed in the 4.3 merge window.
      
      The bi_css usage really should go into bio_clone, but I want to make
      sure that doesn't introduce problems for other bio_clone use cases.
      Signed-off-by: NChris Mason <clm@fb.com>
      3a9508b0
  3. 20 8月, 2015 1 次提交
    • M
      btrfs: Prevent from early transaction abort · d1b5c567
      Michal Hocko 提交于
      Btrfs relies on GFP_NOFS allocation when committing the transaction but
      this allocation context is rather weak wrt. reclaim capabilities. The
      page allocator currently tries hard to not fail these allocations if
      they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
      currently but there is an attempt to move away from the default no-fail
      behavior and allow these allocation to fail more eagerly. And this would
      lead to a pre-mature transaction abort as follows:
      
      [   55.328093] Call Trace:
      [   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
      [   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
      [   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
      [   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
      [   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
      [   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
      [   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
      [   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
      [   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
      [   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
      [   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
      [   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
      [   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
      [   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
      [   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
      [   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
      [   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
      [   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
      [   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
      [   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
      [   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
      [   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
      [   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
      [   55.384280] ------------[ cut here ]------------
      [   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
      [...]
      [   55.384337] Call Trace:
      [   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
      [   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
      [   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
      [   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
      [   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
      [   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
      [   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
      [   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
      [   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
      [   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
      [   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
      [   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
      [   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
      [   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
      [   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
      [   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
      [...]
      [   55.384608] ---[ end trace c29799da1d4dd621 ]---
      [   55.437323] BTRFS info (device hdb1): forced readonly
      [   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry
      
      Fix this by being explicit about the no-fail behavior of this allocation
      path and use __GFP_NOFAIL.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d1b5c567
  4. 14 8月, 2015 1 次提交
  5. 09 8月, 2015 1 次提交
    • C
      Btrfs: add support for blkio controllers · da2f0f74
      Chris Mason 提交于
      This attaches accounting information to bios as we submit them so the
      new blkio controllers can throttle on btrfs filesystems.
      
      Not much is required, we're just associating bios with blkcgs during clone,
      calling wbc_init_bio()/wbc_account_io() during writepages submission,
      and attaching the bios to the current context during direct IO.
      
      Finally if we are splitting bios during btrfs_map_bio, this attaches
      accounting information to the split.
      
      The end result is able to throttle nicely on single disk filesystems.  A
      little more work is required for multi-device filesystems.
      Signed-off-by: NChris Mason <clm@fb.com>
      da2f0f74
  6. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  7. 03 6月, 2015 2 次提交
  8. 19 5月, 2015 1 次提交
  9. 11 5月, 2015 1 次提交
    • F
      Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON · 062c19e9
      Filipe Manana 提交于
      There's a race between releasing extent buffers that are flagged as stale
      and recycling them that makes us it the following BUG_ON at
      btrfs_release_extent_buffer_page:
      
          BUG_ON(extent_buffer_under_io(eb))
      
      The BUG_ON is triggered because the extent buffer has the flag
      EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
      dirty by another concurrent task.
      
      Here follows a sequence of steps that leads to the BUG_ON.
      
            CPU 0                                                    CPU 1                                                CPU 2
      
      path->nodes[0] == eb X
      X->refs == 2 (1 for the tree, 1 for the path)
      btrfs_header_generation(X) == current trans id
      flag EXTENT_BUFFER_DIRTY set on X
      
      btrfs_release_path(path)
          unlocks X
      
                                                            reads eb X
                                                               X->refs incremented to 3
                                                            locks eb X
                                                            btrfs_del_items(X)
                                                               X becomes empty
                                                               clean_tree_block(X)
                                                                   clear EXTENT_BUFFER_DIRTY from X
                                                               btrfs_del_leaf(X)
                                                                   unlocks X
                                                                   extent_buffer_get(X)
                                                                      X->refs incremented to 4
                                                                   btrfs_free_tree_block(X)
                                                                      X's range is not pinned
                                                                      X's range added to free
                                                                        space cache
                                                                   free_extent_buffer_stale(X)
                                                                      lock X->refs_lock
                                                                      set EXTENT_BUFFER_STALE on X
                                                                      release_extent_buffer(X)
                                                                          X->refs decremented to 3
                                                                          unlocks X->refs_lock
                                                            btrfs_release_path()
                                                               unlocks X
                                                               free_extent_buffer(X)
                                                                   X->refs becomes 2
      
                                                                                                            __btrfs_cow_block(Y)
                                                                                                                btrfs_alloc_tree_block()
                                                                                                                    btrfs_reserve_extent()
                                                                                                                        find_free_extent()
                                                                                                                            gets offset == X->start
                                                                                                                    btrfs_init_new_buffer(X->start)
                                                                                                                        btrfs_find_create_tree_block(X->start)
                                                                                                                            alloc_extent_buffer(X->start)
                                                                                                                                find_extent_buffer(X->start)
                                                                                                                                    finds eb X in radix tree
      
          free_extent_buffer(X)
              lock X->refs_lock
                  test X->refs == 2
                  test bit EXTENT_BUFFER_STALE is set
                  test !extent_buffer_under_io(eb)
      
                                                                                                                                    increments X->refs to 3
                                                                                                                                    mark_extent_buffer_accessed(X)
                                                                                                                                        check_buffer_tree_ref(X)
                                                                                                                                          --> does nothing,
                                                                                                                                              X->refs >= 2 and
                                                                                                                                              EXTENT_BUFFER_TREE_REF
                                                                                                                                              is set in X
                                                                                                                    clear EXTENT_BUFFER_STALE from X
                                                                                                                    locks X
                                                                                                                btrfs_mark_buffer_dirty()
                                                                                                                    set_extent_buffer_dirty(X)
                                                                                                                        check_buffer_tree_ref(X)
                                                                                                                           --> does nothing, X->refs >= 2 and
                                                                                                                               EXTENT_BUFFER_TREE_REF is set
                                                                                                                        sets EXTENT_BUFFER_DIRTY on X
      
                  test and clear EXTENT_BUFFER_TREE_REF
                  decrements X->refs to 2
              release_extent_buffer(X)
                  decrements X->refs to 1
                  unlock X->refs_lock
      
                                                                                                            unlock X
                                                                                                            free_extent_buffer(X)
                                                                                                                lock X->refs_lock
                                                                                                                release_extent_buffer(X)
                                                                                                                    decrements X->refs to 0
                                                                                                                    btrfs_release_extent_buffer_page(X)
                                                                                                                         BUG_ON(extent_buffer_under_io(X))
                                                                                                                             --> EXTENT_BUFFER_DIRTY set on X
      
      Fix this by making find_extent buffer wait for any ongoing task currently
      executing free_extent_buffer()/free_extent_buffer_stale() if the extent
      buffer has the stale flag set.
      A more clean alternative would be to always increment the extent buffer's
      reference count while holding its refs_lock spinlock but find_extent_buffer
      is a performance critical area and that would cause lock contention whenever
      multiple tasks search for the same extent buffer concurrently.
      
      A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
      btrfs patches backported from newer kernels) was hitting this often:
      
      [1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
      (...)
      [1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
      [1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
      [1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
      [1212302.540792] RIP: 0010:[<ffffffffa0507081>]  [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
      (...)
      [1212302.630008] Call Trace:
      [1212302.630008]  [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
      [1212302.630008]  [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
      [1212302.630008]  [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
      [1212302.630008]  [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
      [1212302.630008]  [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
      [1212302.630008]  [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
      [1212302.630008]  [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
      [1212302.630008]  [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
      [1212302.630008]  [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
      [1212302.630008]  [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
      [1212302.630008]  [<ffffffff811b9e28>] evict+0xa8/0x190
      [1212302.630008]  [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0
      
      I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
      integration branch from about a month ago, running the following stress
      test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):
      
        while true; do
           mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
           mount /dev/sdd /mnt
           snapshot_cmd="btrfs subvolume snapshot -r /mnt"
           snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
           fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
           umount /mnt
        done
      
      Which usually triggers the BUG_ON within less than 24 hours:
      
      [49558.618097] ------------[ cut here ]------------
      [49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
      (...)
      [49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G        W      3.19.0-btrfs-next-7+ #3
      [49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
      [49558.620031] RIP: 0010:[<ffffffffa0476b1a>]  [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
      (...)
      [49558.620031] Call Trace:
      [49558.620031]  [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
      [49558.620031]  [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
      [49558.620031]  [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
      [49558.620031]  [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
      [49558.620031]  [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
      [49558.620031]  [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
      [49558.620031]  [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
      [49558.728054]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
      [49558.728054]  [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
      [49558.728054]  [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
      [49558.728054]  [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
      [49558.728054]  [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
      [49558.728054]  [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
      [49558.728054]  [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
      [49558.728054]  [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
      [49558.728054]  [<ffffffff81155939>] iterate_supers+0x76/0xc4
      [49558.728054]  [<ffffffff811798e8>] sys_sync+0x55/0x83
      [49558.728054]  [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      062c19e9
  10. 30 4月, 2015 1 次提交
  11. 26 4月, 2015 1 次提交
    • O
      btrfs: fix race on ENOMEM in alloc_extent_buffer · 5ca64f45
      Omar Sandoval 提交于
      Consider the following interleaving of overlapping calls to
      alloc_extent_buffer:
      
      Call 1:
      
      - Successfully allocates a few pages with find_or_create_page
      - find_or_create_page fails, goto free_eb
      - Unlocks the allocated pages
      
      Call 2:
      - Calls find_or_create_page and gets a page in call 1's extent_buffer
      - Finds that the page is already associated with an extent_buffer
      - Grabs a reference to the half-written extent_buffer and calls
        mark_extent_buffer_accessed on it
      
      mark_extent_buffer_accessed will then try to call mark_page_accessed on
      a null page and panic.
      
      The fix is to decrement the reference count on the half-written
      extent_buffer before unlocking the pages so call 2 won't use it. We
      should also set exists = NULL in the case that we don't use exists to
      avoid accidentally returning a freed extent_buffer in an error case.
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5ca64f45
  12. 27 3月, 2015 1 次提交
  13. 18 3月, 2015 1 次提交
  14. 15 2月, 2015 1 次提交
    • J
      Btrfs: account for large extents with enospc · dcab6a3b
      Josef Bacik 提交于
      On our gluster boxes we stream large tar balls of backups onto our fses.  With
      160gb of ram this means we get really large contiguous ranges of dirty data, but
      the way our ENOSPC stuff works is that as long as it's contiguous we only hold
      metadata reservation for one extent.  The problem is we limit our extents to
      128mb, so we'll end up with at least 800 extents so our enospc accounting is
      quite a bit lower than what we need.  To keep track of this make sure we
      increase outstanding_extents for every multiple of the max extent size so we can
      be sure to have enough reserved metadata space.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcab6a3b
  15. 12 2月, 2015 1 次提交
  16. 03 2月, 2015 1 次提交
    • N
      btrfs: clear bio reference after submit_one_bio() · 289454ad
      Naohiro Aota 提交于
      After submit_one_bio(), `bio' can go away. However submit_extent_page()
      leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
      It will cause invalid paging request when submit_extent_page() is called
      next time.
      
      I reproduced ENOMEM case with the following script (need
      CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).
      
        #!/bin/bash
      
        dmesgout=dmesg.txt
        start=100000
        end=300000
        step=1000
      
        # btrfs options
        device=/dev/vdb1
        directory=/mnt/btrfs
      
        # fault-injection options
        percent=100
        times=3
      
        mkdir -p $directory || exit 1
        mount -o compress $device $directory || exit 1
      
        rm -f $directory/file || exit 1
        dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1
      
        for interval in `seq $start $step $end`; do
                dmesg -C
                echo 1 > /proc/sys/vm/drop_caches
                sync
                export FAILCMD_TYPE=fail_page_alloc
                ./failcmd.sh -p $percent -t $times -i $interval \
                        --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \
                        -- \
                        cat $directory/file > /dev/null
                dmesg > ${dmesgout}
                if grep -q BUG: ${dmesgout}; then
                        cat ${dmesgout}
                        exit 1
                fi
        done
      
        umount $directory
        exit 0
      Signed-off-by: NNaohiro Aota <naota@elisp.net>
      Tested-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      289454ad
  17. 22 1月, 2015 2 次提交
    • Z
      Btrfs: add ref_count and free function for btrfs_bio · 6e9606d2
      Zhao Lei 提交于
      1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
         to keep btrfs_bio's memory in raid56 recovery implement.
      2: free function for bbio will make code clean and flexible, plus
         forced data type checking in compile.
      
      Changelog v1->v2:
       Rename following by David Sterba's suggestion:
       put_btrfs_bio() -> btrfs_put_bio()
       get_btrfs_bio() -> btrfs_get_bio()
       bbio->ref_count -> bbio->refs
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e9606d2
    • D
      btrfs: switch extent_state state to unsigned · 9ee49a04
      David Sterba 提交于
      Currently there's a 4B hole in the structure between refs and state and there
      are only 16 bits used so we can make it unsigned. This will get a better
      packing and may save some stack space for local variables.
      
      The size of extent_state gets reduced by 8B and there are usually a lot
      of slab objects.
      
      struct extent_state {
      	u64                        start;                /*     0     8 */
      	u64                        end;                  /*     8     8 */
      	struct rb_node             rb_node;              /*    16    24 */
      	wait_queue_head_t          wq;                   /*    40    24 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	atomic_t                   refs;                 /*    64     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	long unsigned int          state;                /*    72     8 */
      	u64                        private;              /*    80     8 */
      
      	/* size: 88, cachelines: 2, members: 7 */
      	/* sum members: 84, holes: 1, sum holes: 4 */
      	/* last cacheline: 24 bytes */
      };
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ee49a04
  18. 20 1月, 2015 1 次提交
  19. 13 12月, 2014 3 次提交
  20. 21 11月, 2014 4 次提交
    • F
      Btrfs: avoid premature -ENOMEM in clear_extent_bit() · c7bc6319
      Filipe Manana 提交于
      We try to allocate an extent state structure before acquiring the extent
      state tree's spinlock as we might need a new one later and therefore avoid
      doing later an atomic allocation while holding the tree's spinlock. However
      we returned -ENOMEM if that initial non-atomic allocation failed, which is
      a bit excessive since we might end up not needing the pre-allocated extent
      state at all - for the case where the tree doesn't have any extent states
      that cover the input range and cover too any other range. Therefore don't
      return -ENOMEM if that pre-allocation fails.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c7bc6319
    • F
      Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early · c8fd3de7
      Filipe Manana 提交于
      We try to allocate an extent state before acquiring the tree's spinlock
      just in case we end up needing to split an existing extent state into two.
      If that allocation failed, we would return -ENOMEM.
      However, our only single caller (transaction/log commit code), passes in
      an extent state that was cached from a call to find_first_extent_bit() and
      that has a very high chance to match exactly the input range (always true
      for a transaction commit and very often, but not always, true for a log
      commit) - in this case we end up not needing at all that initial extent
      state used for an eventual split. Therefore just don't return -ENOMEM if
      we can't allocate the temporary extent state, since we might not need it
      at all, and if we end up needing one, we'll do it later anyway.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c8fd3de7
    • F
      Btrfs: make find_first_extent_bit be able to cache any state · e38e2ed7
      Filipe Manana 提交于
      Right now the only caller of find_first_extent_bit() that is interested
      in caching extent states (transaction or log commit), never gets an extent
      state cached. This is because find_first_extent_bit() only caches states
      that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
      the transaction/log commit caller always passes a tree that doesn't have
      ever extent states with any of those flags (they can only have one of the
      following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
      
      This change together with the following one in the patch series (titled
      "Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
      help reduce significantly the chances of calls to convert_extent_bit()
      fail with -ENOMEM when called from the transaction/log commit code.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e38e2ed7
    • F
      Btrfs: set page and mapping error on compressed write failure · 704de49d
      Filipe Manana 提交于
      If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
      we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
      but we don't tag the pages, nor the inode's mapping, with an error. This makes it
      impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
      for e.g.) know that there was an error.
      
      Note that the return value of submit_compressed_extents() is useless, as that function
      is executed by a workqueue task and not directly by the fill_delalloc callback. This
      means the writepage/s callbacks of the inode's address space operations don't get that
      return value.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      704de49d
  21. 04 10月, 2014 3 次提交
    • F
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana 提交于
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      656f30db
    • L
      Btrfs: fix crash of btrfs_release_extent_buffer_page · 81465028
      Liu Bo 提交于
      This is actually inspired by Filipe's patch.  When write_one_eb() fails on
      submit_extent_page(), it'll give up writing this eb and mark it with
      EXTENT_BUFFER_IOERR.  So if it's not the last page that encounter the failure,
      there are some left pages which remain DIRTY, and if a later COW on this eb
      happens, ie. eb is COWed and freed, it'd run into BUG_ON in
      btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));
      
      This adds the missing clear_page_dirty_for_io() for the rest pages of eb.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      81465028
    • F
      Btrfs: add missing end_page_writeback on submit_extent_page failure · 55e3bd2e
      Filipe Manana 提交于
      If submit_extent_page() fails in write_one_eb(), we end up with the current
      page not marked dirty anymore, unlocked and marked for writeback. But we never
      end up calling end_page_writeback() against the page, which will make calls to
      filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
      for the writeback bit to be cleared from the page.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      55e3bd2e
  22. 02 10月, 2014 2 次提交
  23. 18 9月, 2014 8 次提交
    • M
      Btrfs: cleanup the read failure record after write or when the inode is freeing · f612496b
      Miao Xie 提交于
      After the data is written successfully, we should cleanup the read failure record
      in that range because
      - If we set data COW for the file, the range that the failure record pointed to is
        mapped to a new place, so it is invalid.
      - If we set no data COW for the file, and if there is no error during writting,
        the corrupted data is corrected, so the failure record can be removed. And if
        some errors happen on the mirrors, we also needn't worry about it because the
        failure record will be recreated if we read the same place again.
      
      Sometimes, we may fail to correct the data, so the failure records will be left
      in the tree, we need free them when we free the inode or the memory leak happens.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f612496b
    • M
      Btrfs: implement repair function when direct read fails · 8b110e39
      Miao Xie 提交于
      This patch implement data repair function when direct read fails.
      
      The detail of the implementation is:
      - When we find the data is not right, we try to read the data from the other
        mirror.
      - When the io on the mirror ends, we will insert the endio work into the
        dedicated btrfs workqueue, not common read endio workqueue, because the
        original endio work is still blocked in the btrfs endio workqueue, if we
        insert the endio work of the io on the mirror into that workqueue, deadlock
        would happen.
      - After we get right data, we write it back to the corrupted mirror.
      - And if the data on the new mirror is still corrupted, we will try next
        mirror until we read right data or all the mirrors are traversed.
      - After the above work, we set the uptodate flag according to the result.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8b110e39
    • M
      Btrfs: modify clean_io_failure and make it suit direct io · 1203b681
      Miao Xie 提交于
      We could not use clean_io_failure in the direct IO path because it got the
      filesystem information from the page structure, but the page in the direct
      IO bio didn't have the filesystem information in its structure. So we need
      modify it and pass all the information it need by parameters.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1203b681
    • M
      Btrfs: modify repair_io_failure and make it suit direct io · ffdd2018
      Miao Xie 提交于
      The original code of repair_io_failure was just used for buffered read,
      because it got some filesystem data from page structure, it is safe for
      the page in the page cache. But when we do a direct read, the pages in bio
      are not in the page cache, that is there is no filesystem data in the page
      structure. In order to implement direct read data repair, we need modify
      repair_io_failure and pass all filesystem data it need by function
      parameters.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ffdd2018
    • M
      Btrfs: split bio_readpage_error into several functions · 2fe6303e
      Miao Xie 提交于
      The data repair function of direct read will be implemented later, and some code
      in bio_readpage_error will be reused, so split bio_readpage_error into
      several functions which will be used in direct read repair later.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2fe6303e
    • M
      454ff3de
    • M
      Btrfs: fix missing error handler if submiting re-read bio fails · 6c387ab2
      Miao Xie 提交于
      We forgot to free failure record and bio after submitting re-read bio failed,
      fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6c387ab2
    • M
      Btrfs: do file data check by sub-bio's self · c1dc0896
      Miao Xie 提交于
      Direct IO splits the original bio to several sub-bios because of the limit of
      raid stripe, and the filesystem will wait for all sub-bios and then run final
      end io process.
      
      But it was very hard to implement the data repair when dio read failure happens,
      because at the final end io function, we didn't know which mirror the data was
      read from. So in order to implement the data repair, we have to move the file data
      check in the final end io function to the sub-bio end io function, in which we can
      get the mirror number of the device we access. This patch did this work as the
      first step of the direct io data repair implementation.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c1dc0896