1. 17 12月, 2018 9 次提交
  2. 06 11月, 2018 2 次提交
    • F
      Btrfs: fix deadlock on tree root leaf when finding free extent · 4222ea71
      Filipe Manana 提交于
      When we are writing out a free space cache, during the transaction commit
      phase, we can end up in a deadlock which results in a stack trace like the
      following:
      
       schedule+0x28/0x80
       btrfs_tree_read_lock+0x8e/0x120 [btrfs]
       ? finish_wait+0x80/0x80
       btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
       btrfs_search_slot+0xf6/0x9f0 [btrfs]
       ? evict_refill_and_join+0xd0/0xd0 [btrfs]
       ? inode_insert5+0x119/0x190
       btrfs_lookup_inode+0x3a/0xc0 [btrfs]
       ? kmem_cache_alloc+0x166/0x1d0
       btrfs_iget+0x113/0x690 [btrfs]
       __lookup_free_space_inode+0xd8/0x150 [btrfs]
       lookup_free_space_inode+0x5b/0xb0 [btrfs]
       load_free_space_cache+0x7c/0x170 [btrfs]
       ? cache_block_group+0x72/0x3b0 [btrfs]
       cache_block_group+0x1b3/0x3b0 [btrfs]
       ? finish_wait+0x80/0x80
       find_free_extent+0x799/0x1010 [btrfs]
       btrfs_reserve_extent+0x9b/0x180 [btrfs]
       btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
       __btrfs_cow_block+0x11d/0x500 [btrfs]
       btrfs_cow_block+0xdc/0x180 [btrfs]
       btrfs_search_slot+0x3bd/0x9f0 [btrfs]
       btrfs_lookup_inode+0x3a/0xc0 [btrfs]
       ? kmem_cache_alloc+0x166/0x1d0
       btrfs_update_inode_item+0x46/0x100 [btrfs]
       cache_save_setup+0xe4/0x3a0 [btrfs]
       btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
       btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
      
      At cache_save_setup() we need to update the inode item of a block group's
      cache which is located in the tree root (fs_info->tree_root), which means
      that it may result in COWing a leaf from that tree. If that happens we
      need to find a free metadata extent and while looking for one, if we find
      a block group which was not cached yet we attempt to load its cache by
      calling cache_block_group(). However this function will try to load the
      inode of the free space cache, which requires finding the matching inode
      item in the tree root - if that inode item is located in the same leaf as
      the inode item of the space cache we are updating at cache_save_setup(),
      we end up in a deadlock, since we try to obtain a read lock on the same
      extent buffer that we previously write locked.
      
      So fix this by using the tree root's commit root when searching for a
      block group's free space cache inode item when we are attempting to load
      a free space cache. This is safe since block groups once loaded stay in
      memory forever, as well as their caches, so after they are first loaded
      we will never need to read their inode items again. For new block groups,
      once they are created they get their ->cached field set to
      BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
      Reported-by: NAndrew Nelson <andrew.s.nelson@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
      Fixes: 9d66e233 ("Btrfs: load free space cache if it exists")
      Tested-by: NAndrew Nelson <andrew.s.nelson@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4222ea71
    • R
      Btrfs: fix cur_offset in the error case for nocow · 506481b2
      Robbie Ko 提交于
      When the cow_file_range fails, the related resources are unlocked
      according to the range [start..end), so the unlock cannot be repeated in
      run_delalloc_nocow.
      
      In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
      not updated correctly, so move the cur_offset update before
      cow_file_range.
      
        kernel BUG at mm/page-writeback.c:2663!
        Internal error: Oops - BUG: 0 [#1] SMP
        CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
        Hardware name: Realtek_RTD1296 (DT)
        Workqueue: writeback wb_workfn (flush-btrfs-1)
        task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
        PC is at clear_page_dirty_for_io+0x1bc/0x1e8
        LR is at clear_page_dirty_for_io+0x14/0x1e8
        pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
        sp : ffffffc02e9af4f0
        Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
        Call trace:
        [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
        [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
        [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
        [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
        [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
        [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
        [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
        [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
        [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
        [<ffffffc00033d758>] do_writepages+0x30/0x88
        [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
        [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
        [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
        [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
        [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
        [<ffffffc0002b0014>] process_one_work+0x1dc/0x388
        [<ffffffc0002b02f0>] worker_thread+0x130/0x500
        [<ffffffc0002b6344>] kthread+0x10c/0x110
        [<ffffffc000284590>] ret_from_fork+0x10/0x40
        Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      506481b2
  3. 19 10月, 2018 2 次提交
  4. 17 10月, 2018 1 次提交
  5. 15 10月, 2018 17 次提交
  6. 11 10月, 2018 1 次提交
  7. 23 8月, 2018 1 次提交
    • F
      Btrfs: sync log after logging new name · d4682ba0
      Filipe Manana 提交于
      When we add a new name for an inode which was logged in the current
      transaction, we update the inode in the log so that its new name and
      ancestors are added to the log. However when we do this we do not persist
      the log, so the changes remain in memory only, and as a consequence, any
      ancestors that were created in the current transaction are updated such
      that future calls to btrfs_inode_in_log() return true. This leads to a
      subsequent fsync against such new ancestor directories returning
      immediately, without persisting the log, therefore after a power failure
      the new ancestor directories do not exist, despite fsync being called
      against them explicitly.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ mkdir /mnt/A/C
        $ touch /mnt/B/foo
        $ xfs_io -c "fsync" /mnt/B/foo
        $ ln /mnt/B/foo /mnt/A/C/foo
        $ xfs_io -c "fsync" /mnt/A
        <power failure>
      
      After the power failure, directory "A" does not exist, despite the explicit
      fsync on it.
      
      Instead of fixing this by changing the behaviour of the explicit fsync on
      directory "A" to persist the log instead of doing nothing, make the logging
      of the new file name (which happens when creating a hard link or renaming)
      persist the log. This approach not only is simpler, not requiring addition
      of new fields to the inode in memory structure, but also gives us the same
      behaviour as ext4, xfs and f2fs (possibly other filesystems too).
      
      A test case for fstests follows soon.
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      Reported-by: NVijay Chidambaram <vvijay03@gmail.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4682ba0
  8. 18 8月, 2018 1 次提交
    • R
      Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space · 8ecebf4d
      Robbie Ko 提交于
      Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced
      nocow writes to fallback to COW, during writeback, when a snapshot is
      created. This resulted in writes made before creating the snapshot to
      unexpectedly fail with ENOSPC during writeback when success (0) was
      returned to user space through the write system call.
      
      The steps leading to this problem are:
      
      1. When it's not possible to allocate data space for a write, the
         buffered write path checks if a NOCOW write is possible.  If it is,
         it will not reserve space and success (0) is returned to user space.
      
      2. Then when a snapshot is created, the root's will_be_snapshotted
         atomic is incremented and writeback is triggered for all inode's that
         belong to the root being snapshotted. Incrementing that atomic forces
         all previous writes to fallback to COW during writeback (running
         delalloc).
      
      3. This results in the writeback for the inodes to fail and therefore
         setting the ENOSPC error in their mappings, so that a subsequent
         fsync on them will report the error to user space. So it's not a
         completely silent data loss (since fsync will report ENOSPC) but it's
         a very unexpected and undesirable behaviour, because if a clean
         shutdown/unmount of the filesystem happens without previous calls to
         fsync, it is expected to have the data present in the files after
         mounting the filesystem again.
      
      So fix this by adding a new atomic named snapshot_force_cow to the
      root structure which prevents this behaviour and works the following way:
      
      1. It is incremented when we start to create a snapshot after triggering
         writeback and before waiting for writeback to finish.
      
      2. This new atomic is now what is used by writeback (running delalloc)
         to decide whether we need to fallback to COW or not. Because we
         incremented this new atomic after triggering writeback in the
         snapshot creation ioctl, we ensure that all buffered writes that
         happened before snapshot creation will succeed and not fallback to
         COW (which would make them fail with ENOSPC).
      
      3. The existing atomic, will_be_snapshotted, is kept because it is used
         to force new buffered writes, that start after we started
         snapshotting, to reserve data space even when NOCOW is possible.
         This makes these writes fail early with ENOSPC when there's no
         available space to allocate, preventing the unexpected behaviour of
         writeback later failing with ENOSPC due to a fallback to COW mode.
      
      Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting")
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ecebf4d
  9. 06 8月, 2018 6 次提交