1. 25 5月, 2020 3 次提交
    • F
      btrfs: scrub, only lookup for csums if we are dealing with a data extent · 89490303
      Filipe Manana 提交于
      When scrubbing a stripe, whenever we find an extent we lookup for its
      checksums in the checksum tree. However we do it even for metadata extents
      which don't have checksum items stored in the checksum tree, that is
      only for data extents.
      
      So make the lookup for checksums only if we are processing with a data
      extent.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      89490303
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af
    • F
      btrfs: fix a race between scrub and block group removal/allocation · 2473d24f
      Filipe Manana 提交于
      When scrub is verifying the extents of a block group for a device, it is
      possible that the corresponding block group gets removed and its logical
      address and device extents get used for a new block group allocation.
      When this happens scrub incorrectly reports that errors were detected
      and, if the the new block group has a different profile then the old one,
      deleted block group, we can crash due to a null pointer dereference.
      Possibly other unexpected and weird consequences can happen as well.
      
      Consider the following sequence of actions that leads to the null pointer
      dereference crash when scrub is running in parallel with balance:
      
      1) Balance sets block group X to read-only mode and starts relocating it.
         Block group X is a metadata block group, has a raid1 profile (two
         device extents, each one in a different device) and a logical address
         of 19424870400;
      
      2) Scrub is running and finds device extent E, which belongs to block
         group X. It enters scrub_stripe() to find all extents allocated to
         block group X, the search is done using the extent tree;
      
      3) Balance finishes relocating block group X and removes block group X;
      
      4) Balance starts relocating another block group and when trying to
         commit the current transaction as part of the preparation step
         (prepare_to_relocate()), it blocks because scrub is running;
      
      5) The scrub task finds the metadata extent at the logical address
         19425001472 and marks the pages of the extent to be read by a bio
         (struct scrub_bio). The extent item's flags, which have the bit
         BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
         scrub_page). It is these flags in the scrub pages that tells the
         bio's end io function (scrub_bio_end_io_worker) which type of extent
         it is dealing with. At this point we end up with 4 pages in a bio
         which is ready for submission (the metadata extent has a size of
         16Kb, so that gives 4 pages on x86);
      
      6) At the next iteration of scrub_stripe(), scrub checks that there is a
         pause request from the relocation task trying to commit a transaction,
         therefore it submits the pending bio and pauses, waiting for the
         transaction commit to complete before resuming;
      
      7) The relocation task commits the transaction. The device extent E, that
         was used by our block group X, is now available for allocation, since
         the commit root for the device tree was swapped by the transaction
         commit;
      
      8) Another task doing a direct IO write allocates a new data block group Y
         which ends using device extent E. This new block group Y also ends up
         getting the same logical address that block group X had: 19424870400.
         This happens because block group X was the block group with the highest
         logical address and, when allocating Y, find_next_chunk() returns the
         end offset of the current last block group to be used as the logical
         address for the new block group, which is
      
              18351128576 + 1073741824 = 19424870400
      
         So our new block group Y has the same logical address and device extent
         that block group X had. However Y is a data block group, while X was
         a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
      
      9) After allocating block group Y, the direct IO submits a bio to write
         to device extent E;
      
      10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
          extent E, which now correspond to the data written by the task that
          did a direct IO write. Then at the end io function associated with
          the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
          which calls scrub_checksum(). This later function checks the flags
          of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
          is set in the flags, so it assumes it has a metadata extent and
          then calls scrub_checksum_tree_block(). That functions returns an
          error, since interpreting data as a metadata extent causes the
          checksum verification to fail.
      
          So this makes scrub_checksum() call scrub_handle_errored_block(),
          which determines 'failed_mirror_index' to be 1, since the device
          extent E was allocated as the second mirror of block group X.
      
          It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
          'sblocks_for_recheck', and all the memory is initialized to zeroes by
          kcalloc().
      
          After that it calls scrub_setup_recheck_block(), which is responsible
          for filling each of those structures. However, when that function
          calls btrfs_map_sblock() against the logical address of the metadata
          extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
          the current block group Y. However block group Y has a raid0 profile
          and not a raid1 profile like X had, so the following call returns 1:
      
             scrub_nr_raid_mirrors(bbio)
      
          And as a result scrub_setup_recheck_block() only initializes the
          first (index 0) scrub_block structure in 'sblocks_for_recheck'.
      
          Then scrub_recheck_block() is called by scrub_handle_errored_block()
          with the second (index 1) scrub_block structure as the argument,
          because 'failed_mirror_index' was previously set to 1.
          This scrub_block was not initialized by scrub_setup_recheck_block(),
          so it has zero pages, its 'page_count' member is 0 and its 'pagev'
          page array has all members pointing to NULL.
      
          Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
          we have a NULL pointer dereference when accessing the flags of the first
          page, as pavev[0] is NULL:
      
          static void scrub_recheck_block_checksum(struct scrub_block *sblock)
          {
              (...)
              if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
                  scrub_checksum_data(sblock);
              (...)
          }
      
          Producing a stack trace like the following:
      
          [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
          [542998.010238] #PF: supervisor read access in kernel mode
          [542998.010878] #PF: error_code(0x0000) - not-present page
          [542998.011516] PGD 0 P4D 0
          [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
          [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G    B   W         5.6.0-rc7-btrfs-next-58 #1
          [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
          [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
          [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
          [542998.018474] Code: 4c 89 e6 ...
          [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
          [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
          [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
          [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
          [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
          [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
          [542998.027237] FS:  0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
          [542998.028327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
          [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          [542998.032380] Call Trace:
          [542998.032752]  scrub_recheck_block+0x162/0x400 [btrfs]
          [542998.033500]  ? __alloc_pages_nodemask+0x31e/0x460
          [542998.034228]  scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
          [542998.035170]  scrub_bio_end_io_worker+0x100/0x520 [btrfs]
          [542998.035991]  btrfs_work_helper+0xaa/0x720 [btrfs]
          [542998.036735]  process_one_work+0x26d/0x6a0
          [542998.037275]  worker_thread+0x4f/0x3e0
          [542998.037740]  ? process_one_work+0x6a0/0x6a0
          [542998.038378]  kthread+0x103/0x140
          [542998.038789]  ? kthread_create_worker_on_cpu+0x70/0x70
          [542998.039419]  ret_from_fork+0x3a/0x50
          [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
          [542998.047288] CR2: 0000000000000028
          [542998.047724] ---[ end trace bde186e176c7f96a ]---
      
      This issue has been around for a long time, possibly since scrub exists.
      The last time I ran into it was over 2 years ago. After recently fixing
      fstests to pass the "--full-balance" command line option to btrfs-progs
      when doing balance, several tests started to more heavily exercise balance
      with fsstress, scrub and other operations in parallel, and therefore
      started to hit this issue again (with btrfs/061 for example).
      
      Fix this by having scrub increment the 'trimming' counter of the block
      group, which pins the block group in such a way that it guarantees neither
      its logical address nor device extents can be reused by future block group
      allocations until we decrement the 'trimming' counter. Also make sure that
      on each iteration of scrub_stripe() we stop scrubbing the block group if
      it was removed already.
      
      A later patch in the series will rename the block group's 'trimming'
      counter and its helpers to a more generic name, since now it is not used
      exclusively for pinning while trimming anymore.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2473d24f
  2. 24 3月, 2020 5 次提交
  3. 24 1月, 2020 1 次提交
    • Q
      btrfs: scrub: Require mandatory block group RO for dev-replace · 1bbb97b8
      Qu Wenruo 提交于
      [BUG]
      For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
      looped runs can lead to random failure, where scrub finds csum error.
      
      The possibility is not high, around 1/20 to 1/100, but it's causing data
      corruption.
      
      The bug is observable after commit b12de528 ("btrfs: scrub: Don't
      check free space before marking a block group RO")
      
      [CAUSE]
      Dev-replace has two source of writes:
      
      - Write duplication
        All writes to source device will also be duplicated to target device.
      
        Content:	Not yet persisted data/meta
      
      - Scrub copy
        Dev-replace reused scrub code to iterate through existing extents, and
        copy the verified data to target device.
      
        Content:	Previously persisted data and metadata
      
      The difference in contents makes the following race possible:
      	Regular Writer		|	Dev-replace
      -----------------------------------------------------------------
        ^                             |
        | Preallocate one data extent |
        | at bytenr X, len 1M		|
        v				|
        ^ Commit transaction		|
        | Now extent [X, X+1M) is in  |
        v commit root			|
       ================== Dev replace starts =========================
        				| ^
      				| | Scrub extent [X, X+1M)
      				| | Read [X, X+1M)
      				| | (The content are mostly garbage
      				| |  since it's preallocated)
        ^				| v
        | Write back happens for	|
        | extent [X, X+512K)		|
        | New data writes to both	|
        | source and target dev.	|
        v				|
      				| ^
      				| | Scrub writes back extent [X, X+1M)
      				| | to target device.
      				| | This will over write the new data in
      				| | [X, X+512K)
      				| v
      
      This race can only happen for nocow writes. Thus metadata and data cow
      writes are safe, as COW will never overwrite extents of previous
      transaction (in commit root).
      
      This behavior can be confirmed by disabling all fallocate related calls
      in fsstress (*), then all related tests can pass a 2000 run loop.
      
      *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
      		   -f collapse=0 -f punch=0 -f resvsp=0"
         I didn't expect resvsp ioctl will fallback to fallocate in VFS...
      
      [FIX]
      Make dev-replace to require mandatory block group RO, and wait for current
      nocow writes before calling scrub_chunk().
      
      This patch will mostly revert commit 76a8efa1 ("btrfs: Continue replace
      when set_block_ro failed") for dev-replace path.
      
      The side effect is, dev-replace can be more strict on avaialble space, but
      definitely worth to avoid data corruption.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 76a8efa1 ("btrfs: Continue replace when set_block_ro failed")
      Fixes: b12de528 ("btrfs: scrub: Don't check free space before marking a block group RO")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1bbb97b8
  4. 20 1月, 2020 1 次提交
    • D
      btrfs: handle empty block_group removal for async discard · 6e80d4f8
      Dennis Zhou 提交于
      block_group removal is a little tricky. It can race with the extent
      allocator, the cleaner thread, and balancing. The current path is for a
      block_group to be added to the unused_bgs list. Then, when the cleaner
      thread comes around, it starts a transaction and then proceeds with
      removing the block_group. Extents that are pinned are subsequently
      removed from the pinned trees and then eventually a discard is issued
      for the entire block_group.
      
      Async discard introduces another player into the game, the discard
      workqueue. While it has none of the racing issues, the new problem is
      ensuring we don't leave free space untrimmed prior to forgetting the
      block_group.  This is handled by placing fully free block_groups on a
      separate discard queue. This is necessary to maintain discarding order
      as in the future we will slowly trim even fully free block_groups. The
      ordering helps us make progress on the same block_group rather than say
      the last fully freed block_group or needing to search through the fully
      freed block groups at the beginning of a list and insert after.
      
      The new order of events is a fully freed block group gets placed on the
      unused discard queue first. Once it's processed, it will be placed on
      the unusued_bgs list and then the original sequence of events will
      happen, just without the final whole block_group discard.
      
      The mount flags can change when processing unused_bgs, so when flipping
      from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
      discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
      free block groups on the discard_list to the unused_bg queue which will
      do the final discard for us.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6e80d4f8
  5. 19 11月, 2019 6 次提交
    • F
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana 提交于
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      042528f8
    • Q
      btrfs: scrub: Don't check free space before marking a block group RO · b12de528
      Qu Wenruo 提交于
      [BUG]
      When running btrfs/072 with only one online CPU, it has a pretty high
      chance to fail:
      
        btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
        - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
            --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
            +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
            @@ -1,2 +1,3 @@
             QA output created by 072
             Silence is golden
            +Scrub find errors in "-m dup -d single" test
            ...
      
      And with the following call trace:
      
        BTRFS info (device dm-5): scrub: started on devid 1
        ------------[ cut here ]------------
        BTRFS: Transaction aborted (error -27)
        WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        Call Trace:
         __btrfs_end_transaction+0xdb/0x310 [btrfs]
         btrfs_end_transaction+0x10/0x20 [btrfs]
         btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
         scrub_enumerate_chunks+0x264/0x940 [btrfs]
         btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
         btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
         do_vfs_ioctl+0x636/0xaa0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x43/0x50
         do_syscall_64+0x79/0xe0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        ---[ end trace 166c865cec7688e7 ]---
      
      [CAUSE]
      The error number -27 is -EFBIG, returned from the following call chain:
      btrfs_end_transaction()
      |- __btrfs_end_transaction()
         |- btrfs_create_pending_block_groups()
            |- btrfs_finish_chunk_alloc()
               |- btrfs_add_system_chunk()
      
      This happens because we have used up all space of
      btrfs_super_block::sys_chunk_array.
      
      The root cause is, we have the following bad loop of creating tons of
      system chunks:
      
      1. The only SYSTEM chunk is being scrubbed
         It's very common to have only one SYSTEM chunk.
      2. New SYSTEM bg will be allocated
         As btrfs_inc_block_group_ro() will check if we have enough space
         after marking current bg RO. If not, then allocate a new chunk.
      3. New SYSTEM bg is still empty, will be reclaimed
         During the reclaim, we will mark it RO again.
      4. That newly allocated empty SYSTEM bg get scrubbed
         We go back to step 2, as the bg is already mark RO but still not
         cleaned up yet.
      
      If the cleaner kthread doesn't get executed fast enough (e.g. only one
      CPU), then we will get more and more empty SYSTEM chunks, using up all
      the space of btrfs_super_block::sys_chunk_array.
      
      [FIX]
      Since scrub/dev-replace doesn't always need to allocate new extent,
      especially chunk tree extent, so we don't really need to do chunk
      pre-allocation.
      
      To break above spiral, here we introduce a new parameter to
      btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
      need extra chunk pre-allocation.
      
      For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
      @do_chunk_alloc=false.
      This should keep unnecessary empty chunks from popping up for scrub.
      
      Also, since there are two parameters for btrfs_inc_block_group_ro(),
      add more comment for it.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b12de528
    • D
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba 提交于
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32da5386
    • D
      btrfs: clean up locking name in scrub_enumerate_chunks() · 3ec17a67
      Dan Carpenter 提交于
      The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
      the same lock but Smatch is not clever enough to figure that out so it
      leads to static checker warnings.  It's better to use it consistently
      anyway.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ec17a67
    • D
      btrfs: add dedicated members for start and length of a block group · b3470b5d
      David Sterba 提交于
      The on-disk format of block group item makes use of the key that stores
      the offset and length. This is further used in the code, although this
      makes thing harder to understand. The key is also packed so the
      offset/length is not properly aligned as u64.
      
      Add start (key.objectid) and length (key.offset) members to block group
      and remove the embedded key.  When the item is searched or written, a
      local variable for key is used.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3470b5d
    • D
      btrfs: move block_group_item::used to block group · bf38be65
      David Sterba 提交于
      For unknown reasons, the member 'used' in the block group struct is
      stored in the b-tree item and accessed everywhere using the special
      accessor helper. Let's unify it and make it a regular member and only
      update the item before writing it to the tree.
      
      The item is still being used for flags and chunk_objectid, there's some
      duplication until the item is removed in following patches.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf38be65
  6. 18 11月, 2019 2 次提交
  7. 09 9月, 2019 1 次提交
  8. 02 7月, 2019 1 次提交
  9. 01 7月, 2019 4 次提交
  10. 30 4月, 2019 2 次提交
  11. 25 2月, 2019 8 次提交
    • D
      btrfs: init csum_list before possible free · e49be14b
      Dan Robertson 提交于
      The scrub_ctx csum_list member must be initialized before scrub_free_ctx
      is called. If the csum_list is not initialized beforehand, the
      list_empty call in scrub_free_csums will result in a null deref if the
      allocation fails in the for loop.
      
      Fixes: a2de733c ("btrfs: scrub")
      CC: stable@vger.kernel.org # 3.0+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDan Robertson <dan@dlrobertson.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e49be14b
    • D
      btrfs: scrub: add assertions for worker pointers · c8352942
      David Sterba 提交于
      The scrub worker pointers are not NULL iff the scrub is running, so
      reset them back once the last reference is dropped. Add assertions to
      the initial phase of scrub to verify that.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c8352942
    • A
      btrfs: scrub: convert scrub_workers_refcnt to refcount_t · ff09c4ca
      Anand Jain 提交于
      Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
      we get the extra checks. All reference changes are still done under
      scrub_lock.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff09c4ca
    • A
      btrfs: scrub: add scrub_lock lockdep check in scrub_workers_get · eb4318e5
      Anand Jain 提交于
      scrub_workers_refcnt is protected by scrub_lock, add lockdep_assert_held()
      in scrub_workers_get().
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Suggested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb4318e5
    • A
      btrfs: scrub: fix circular locking dependency warning · 1cec3f27
      Anand Jain 提交于
      This fixes a longstanding lockdep warning triggered by
      fstests/btrfs/011.
      
      Circular locking dependency check reports warning[1], that's because the
      btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock
      held. The test case leading to this warning:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /btrfs
        $ btrfs scrub start -B /btrfs
      
      In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy
      of the scrub workers are needed. So once we have incremented and decremented
      the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the
      scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this
      patch drops the scrub_lock before calling btrfs_destroy_workqueue().
      
        [359.258534] ======================================================
        [359.260305] WARNING: possible circular locking dependency detected
        [359.261938] 5.0.0-rc6-default #461 Not tainted
        [359.263135] ------------------------------------------------------
        [359.264672] btrfs/20975 is trying to acquire lock:
        [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540
        [359.268416]
        [359.268416] but task is already holding lock:
        [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.272418]
        [359.272418] which lock already depends on the new lock.
        [359.272418]
        [359.274692]
        [359.274692] the existing dependency chain (in reverse order) is:
        [359.276671]
        [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}:
        [359.278187]        __mutex_lock+0x86/0x9c0
        [359.279086]        btrfs_scrub_pause+0x31/0x100 [btrfs]
        [359.280421]        btrfs_commit_transaction+0x1e4/0x9e0 [btrfs]
        [359.281931]        close_ctree+0x30b/0x350 [btrfs]
        [359.283208]        generic_shutdown_super+0x64/0x100
        [359.284516]        kill_anon_super+0x14/0x30
        [359.285658]        btrfs_kill_super+0x12/0xa0 [btrfs]
        [359.286964]        deactivate_locked_super+0x29/0x60
        [359.288242]        cleanup_mnt+0x3b/0x70
        [359.289310]        task_work_run+0x98/0xc0
        [359.290428]        exit_to_usermode_loop+0x83/0x90
        [359.291445]        do_syscall_64+0x15b/0x180
        [359.292598]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.294011]
        [359.294011] -> #2 (sb_internal#2){.+.+}:
        [359.295432]        __sb_start_write+0x113/0x1d0
        [359.296394]        start_transaction+0x369/0x500 [btrfs]
        [359.297471]        btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs]
        [359.298629]        normal_work_helper+0xcd/0x530 [btrfs]
        [359.299698]        process_one_work+0x246/0x610
        [359.300898]        worker_thread+0x3c/0x390
        [359.302020]        kthread+0x116/0x130
        [359.303053]        ret_from_fork+0x24/0x30
        [359.304152]
        [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}:
        [359.306100]        process_one_work+0x21f/0x610
        [359.307302]        worker_thread+0x3c/0x390
        [359.308465]        kthread+0x116/0x130
        [359.309357]        ret_from_fork+0x24/0x30
        [359.310229]
        [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}:
        [359.311812]        lock_acquire+0x90/0x180
        [359.312929]        flush_workqueue+0xaa/0x540
        [359.313845]        drain_workqueue+0xa1/0x180
        [359.314761]        destroy_workqueue+0x17/0x240
        [359.315754]        btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.317245]        scrub_workers_put+0x2c/0x60 [btrfs]
        [359.318585]        btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.319944]        btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.321622]        btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.322908]        do_vfs_ioctl+0xa2/0x6d0
        [359.324021]        ksys_ioctl+0x3a/0x70
        [359.325066]        __x64_sys_ioctl+0x16/0x20
        [359.326236]        do_syscall_64+0x54/0x180
        [359.327379]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.328772]
        [359.328772] other info that might help us debug this:
        [359.328772]
        [359.330990] Chain exists of:
        [359.330990]   (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock
        [359.330990]
        [359.334376]  Possible unsafe locking scenario:
        [359.334376]
        [359.336020]        CPU0                    CPU1
        [359.337070]        ----                    ----
        [359.337821]   lock(&fs_info->scrub_lock);
        [359.338506]                                lock(sb_internal#2);
        [359.339506]                                lock(&fs_info->scrub_lock);
        [359.341461]   lock((wq_completion)"%s-%s""btrfs", name);
        [359.342437]
        [359.342437]  *** DEADLOCK ***
        [359.342437]
        [359.343745] 1 lock held by btrfs/20975:
        [359.344788]  #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.346778]
        [359.346778] stack backtrace:
        [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461
        [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [359.350501] Call Trace:
        [359.350931]  dump_stack+0x67/0x90
        [359.351676]  print_circular_bug.isra.37.cold.56+0x15c/0x195
        [359.353569]  check_prev_add.constprop.44+0x4f9/0x750
        [359.354849]  ? check_prev_add.constprop.44+0x286/0x750
        [359.356505]  __lock_acquire+0xb84/0xf10
        [359.357505]  lock_acquire+0x90/0x180
        [359.358271]  ? flush_workqueue+0x87/0x540
        [359.359098]  flush_workqueue+0xaa/0x540
        [359.359912]  ? flush_workqueue+0x87/0x540
        [359.360740]  ? drain_workqueue+0x1e/0x180
        [359.361565]  ? drain_workqueue+0xa1/0x180
        [359.362391]  drain_workqueue+0xa1/0x180
        [359.363193]  destroy_workqueue+0x17/0x240
        [359.364539]  btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.365673]  scrub_workers_put+0x2c/0x60 [btrfs]
        [359.366618]  btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.367594]  ? start_transaction+0xa1/0x500 [btrfs]
        [359.368679]  btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.369545]  btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.370186]  ? __lock_acquire+0x263/0xf10
        [359.370777]  ? kvm_clock_read+0x14/0x30
        [359.371392]  ? kvm_sched_clock_read+0x5/0x10
        [359.372248]  ? sched_clock+0x5/0x10
        [359.372786]  ? sched_clock_cpu+0xc/0xc0
        [359.373662]  ? do_vfs_ioctl+0xa2/0x6d0
        [359.374552]  do_vfs_ioctl+0xa2/0x6d0
        [359.375378]  ? do_sigaction+0xff/0x250
        [359.376233]  ksys_ioctl+0x3a/0x70
        [359.376954]  __x64_sys_ioctl+0x16/0x20
        [359.377772]  do_syscall_64+0x54/0x180
        [359.378841]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.380422] RIP: 0033:0x7f5429296a97
      
      Backporting to older kernels: scrub_nocow_workers must be freed the same
      way as the others.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      [ update changelog ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cec3f27
    • A
      btrfs: scrub: print messages when started or finished · d1e14420
      Anand Jain 提交于
      The kernel log messages help debugging and audit, add them for scrub
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1e14420
    • A
      btrfs: merge btrfs_find_device and find_device · 09ba3bc9
      Anand Jain 提交于
      Both btrfs_find_device() and find_device() does the same thing except
      that the latter does not take the seed device onto account in the device
      scanning context. We can merge them.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      09ba3bc9
    • A
      btrfs: refactor btrfs_find_device() take fs_devices as argument · e4319cd9
      Anand Jain 提交于
      btrfs_find_device() accepts fs_info as an argument and retrieves
      fs_devices from fs_info.
      
      Instead use fs_devices, so that this function can be used in non-mount
      (during device scanning) context as well.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4319cd9
  12. 17 12月, 2018 6 次提交