• F
    btrfs: fix a race between scrub and block group removal/allocation · 2473d24f
    Filipe Manana 提交于
    When scrub is verifying the extents of a block group for a device, it is
    possible that the corresponding block group gets removed and its logical
    address and device extents get used for a new block group allocation.
    When this happens scrub incorrectly reports that errors were detected
    and, if the the new block group has a different profile then the old one,
    deleted block group, we can crash due to a null pointer dereference.
    Possibly other unexpected and weird consequences can happen as well.
    
    Consider the following sequence of actions that leads to the null pointer
    dereference crash when scrub is running in parallel with balance:
    
    1) Balance sets block group X to read-only mode and starts relocating it.
       Block group X is a metadata block group, has a raid1 profile (two
       device extents, each one in a different device) and a logical address
       of 19424870400;
    
    2) Scrub is running and finds device extent E, which belongs to block
       group X. It enters scrub_stripe() to find all extents allocated to
       block group X, the search is done using the extent tree;
    
    3) Balance finishes relocating block group X and removes block group X;
    
    4) Balance starts relocating another block group and when trying to
       commit the current transaction as part of the preparation step
       (prepare_to_relocate()), it blocks because scrub is running;
    
    5) The scrub task finds the metadata extent at the logical address
       19425001472 and marks the pages of the extent to be read by a bio
       (struct scrub_bio). The extent item's flags, which have the bit
       BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
       scrub_page). It is these flags in the scrub pages that tells the
       bio's end io function (scrub_bio_end_io_worker) which type of extent
       it is dealing with. At this point we end up with 4 pages in a bio
       which is ready for submission (the metadata extent has a size of
       16Kb, so that gives 4 pages on x86);
    
    6) At the next iteration of scrub_stripe(), scrub checks that there is a
       pause request from the relocation task trying to commit a transaction,
       therefore it submits the pending bio and pauses, waiting for the
       transaction commit to complete before resuming;
    
    7) The relocation task commits the transaction. The device extent E, that
       was used by our block group X, is now available for allocation, since
       the commit root for the device tree was swapped by the transaction
       commit;
    
    8) Another task doing a direct IO write allocates a new data block group Y
       which ends using device extent E. This new block group Y also ends up
       getting the same logical address that block group X had: 19424870400.
       This happens because block group X was the block group with the highest
       logical address and, when allocating Y, find_next_chunk() returns the
       end offset of the current last block group to be used as the logical
       address for the new block group, which is
    
            18351128576 + 1073741824 = 19424870400
    
       So our new block group Y has the same logical address and device extent
       that block group X had. However Y is a data block group, while X was
       a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
    
    9) After allocating block group Y, the direct IO submits a bio to write
       to device extent E;
    
    10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
        extent E, which now correspond to the data written by the task that
        did a direct IO write. Then at the end io function associated with
        the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
        which calls scrub_checksum(). This later function checks the flags
        of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
        is set in the flags, so it assumes it has a metadata extent and
        then calls scrub_checksum_tree_block(). That functions returns an
        error, since interpreting data as a metadata extent causes the
        checksum verification to fail.
    
        So this makes scrub_checksum() call scrub_handle_errored_block(),
        which determines 'failed_mirror_index' to be 1, since the device
        extent E was allocated as the second mirror of block group X.
    
        It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
        'sblocks_for_recheck', and all the memory is initialized to zeroes by
        kcalloc().
    
        After that it calls scrub_setup_recheck_block(), which is responsible
        for filling each of those structures. However, when that function
        calls btrfs_map_sblock() against the logical address of the metadata
        extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
        the current block group Y. However block group Y has a raid0 profile
        and not a raid1 profile like X had, so the following call returns 1:
    
           scrub_nr_raid_mirrors(bbio)
    
        And as a result scrub_setup_recheck_block() only initializes the
        first (index 0) scrub_block structure in 'sblocks_for_recheck'.
    
        Then scrub_recheck_block() is called by scrub_handle_errored_block()
        with the second (index 1) scrub_block structure as the argument,
        because 'failed_mirror_index' was previously set to 1.
        This scrub_block was not initialized by scrub_setup_recheck_block(),
        so it has zero pages, its 'page_count' member is 0 and its 'pagev'
        page array has all members pointing to NULL.
    
        Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
        we have a NULL pointer dereference when accessing the flags of the first
        page, as pavev[0] is NULL:
    
        static void scrub_recheck_block_checksum(struct scrub_block *sblock)
        {
            (...)
            if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
                scrub_checksum_data(sblock);
            (...)
        }
    
        Producing a stack trace like the following:
    
        [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
        [542998.010238] #PF: supervisor read access in kernel mode
        [542998.010878] #PF: error_code(0x0000) - not-present page
        [542998.011516] PGD 0 P4D 0
        [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G    B   W         5.6.0-rc7-btrfs-next-58 #1
        [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
        [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
        [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
        [542998.018474] Code: 4c 89 e6 ...
        [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
        [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
        [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
        [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
        [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
        [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
        [542998.027237] FS:  0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
        [542998.028327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
        [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [542998.032380] Call Trace:
        [542998.032752]  scrub_recheck_block+0x162/0x400 [btrfs]
        [542998.033500]  ? __alloc_pages_nodemask+0x31e/0x460
        [542998.034228]  scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
        [542998.035170]  scrub_bio_end_io_worker+0x100/0x520 [btrfs]
        [542998.035991]  btrfs_work_helper+0xaa/0x720 [btrfs]
        [542998.036735]  process_one_work+0x26d/0x6a0
        [542998.037275]  worker_thread+0x4f/0x3e0
        [542998.037740]  ? process_one_work+0x6a0/0x6a0
        [542998.038378]  kthread+0x103/0x140
        [542998.038789]  ? kthread_create_worker_on_cpu+0x70/0x70
        [542998.039419]  ret_from_fork+0x3a/0x50
        [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
        [542998.047288] CR2: 0000000000000028
        [542998.047724] ---[ end trace bde186e176c7f96a ]---
    
    This issue has been around for a long time, possibly since scrub exists.
    The last time I ran into it was over 2 years ago. After recently fixing
    fstests to pass the "--full-balance" command line option to btrfs-progs
    when doing balance, several tests started to more heavily exercise balance
    with fsstress, scrub and other operations in parallel, and therefore
    started to hit this issue again (with btrfs/061 for example).
    
    Fix this by having scrub increment the 'trimming' counter of the block
    group, which pins the block group in such a way that it guarantees neither
    its logical address nor device extents can be reused by future block group
    allocations until we decrement the 'trimming' counter. Also make sure that
    on each iteration of scrub_stripe() we stop scrubbing the block group if
    it was removed already.
    
    A later patch in the series will rename the block group's 'trimming'
    counter and its helpers to a more generic name, since now it is not used
    exclusively for pinning while trimming anymore.
    
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: NFilipe Manana <fdmanana@suse.com>
    Signed-off-by: NDavid Sterba <dsterba@suse.com>
    2473d24f
scrub.c 108.2 KB