1. 24 3月, 2020 16 次提交
  2. 21 3月, 2020 1 次提交
  3. 14 3月, 2020 1 次提交
    • F
      btrfs: fix log context list corruption after rename whiteout error · 236ebc20
      Filipe Manana 提交于
      During a rename whiteout, if btrfs_whiteout_for_rename() returns an error
      we can end up returning from btrfs_rename() with the log context object
      still in the root's log context list - this happens if 'sync_log' was
      set to true before we called btrfs_whiteout_for_rename() and it is
      dangerous because we end up with a corrupt linked list (root->log_ctxs)
      as the log context object was allocated on the stack.
      
      After btrfs_rename() returns, any task that is running btrfs_sync_log()
      concurrently can end up crashing because that linked list is traversed by
      btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in
      the same issue that commit e6c61710 ("Btrfs: fix log context list
      corruption after rename exchange operation") fixed.
      
      Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      236ebc20
  4. 03 3月, 2020 1 次提交
    • O
      btrfs: fix RAID direct I/O reads with alternate csums · e7a04894
      Omar Sandoval 提交于
      btrfs_lookup_and_bind_dio_csum() does pointer arithmetic which assumes
      32-bit checksums. If using a larger checksum, this leads to spurious
      failures when a direct I/O read crosses a stripe. This is easy
      to reproduce:
      
        # mkfs.btrfs -f --checksum blake2 -d raid0 /dev/vdc /dev/vdd
        ...
        # mount /dev/vdc /mnt
        # cd /mnt
        # dd if=/dev/urandom of=foo bs=1M count=1 status=none
        # dd if=foo of=/dev/null bs=1M iflag=direct status=none
        dd: error reading 'foo': Input/output error
        # dmesg | tail -1
        [  135.821568] BTRFS warning (device vdc): csum failed root 5 ino 257 off 421888 ...
      
      Fix it by using the actual checksum size.
      
      Fixes: 1e25a2e3 ("btrfs: don't assume ordered sums to be 4 bytes")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7a04894
  5. 21 2月, 2020 1 次提交
    • F
      Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof · a5ae50de
      Filipe Manana 提交于
      While logging the prealloc extents of an inode during a fast fsync we call
      btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
      holding a read lock on a leaf of the inode's root (not the log root, the
      fs/subvol root), and then that function locks the file range in the inode's
      iotree. This can lead to a deadlock when:
      
      * the fsync is ranged
      
      * the file has prealloc extents beyond eof
      
      * writeback for a range different from the fsync range starts
        during the fsync
      
      * the size of the file is not sector size aligned
      
      Because when finishing an ordered extent we lock first a file range and
      then try to COW the fs/subvol tree to insert an extent item.
      
      The following diagram shows how the deadlock can happen.
      
                 CPU 1                                        CPU 2
      
        btrfs_sync_file()
          --> for range [0, 1MiB)
      
          --> inode has a size of
              1MiB and has 1 prealloc
              extent beyond the
              i_size, starting at offset
              4MiB
      
          flushes all delalloc for the
          range [0MiB, 1MiB) and waits
          for the respective ordered
          extents to complete
      
                                                    --> before task at CPU 1 locks the
                                                        inode, a write into file range
                                                        [1MiB, 2MiB + 1KiB) is made
      
                                                    --> i_size is updated to 2MiB + 1KiB
      
                                                    --> writeback is started for that
                                                        range, [1MiB, 2MiB + 4KiB)
                                                        --> end offset rounded up to
                                                            be sector size aligned
      
          btrfs_log_dentry_safe()
            btrfs_log_inode_parent()
              btrfs_log_inode()
      
                btrfs_log_changed_extents()
                  btrfs_log_prealloc_extents()
                    --> does a search on the
                        inode's root
                    --> holds a read lock on
                        leaf X
      
                                                    btrfs_finish_ordered_io()
                                                      --> locks range [1MiB, 2MiB + 4KiB)
                                                          --> end offset rounded up
                                                              to be sector size aligned
      
                                                      --> tries to cow leaf X, through
                                                          insert_reserved_file_extent()
                                                          --> already locked by the
                                                              task at CPU 1
      
                    btrfs_truncate_inode_items()
      
                      --> gets an i_size of
                          2MiB + 1KiB, which is
                          not sector size
                          aligned
      
                      --> tries to lock file
                          range [2MiB, (u64)-1)
                          --> the start range
                              is rounded down
                              from 2MiB + 1K
                              to 2MiB to be sector
                              size aligned
      
                          --> but the subrange
                              [2MiB, 2MiB + 4KiB) is
                              already locked by
                              task at CPU 2 which
                              is waiting to get a
                              write lock on leaf X
                              for which we are
                              holding a read lock
      
                                      *** deadlock ***
      
      This results in a stack trace like the following, triggered by test case
      generic/561 from fstests:
      
        [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
        [ 2779.979536]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2779.990136] kworker/u8:6    D    0   247      2 0x80004000
        [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        [ 2779.990466] Call Trace:
        [ 2779.990491]  ? __schedule+0x384/0xa30
        [ 2779.990521]  schedule+0x33/0xe0
        [ 2779.990616]  btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
        [ 2779.990632]  ? remove_wait_queue+0x60/0x60
        [ 2779.990730]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
        [ 2779.990782]  btrfs_search_slot+0x510/0x1000 [btrfs]
        [ 2779.990869]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
        [ 2779.990944]  __btrfs_drop_extents+0x161/0x1060 [btrfs]
        [ 2779.990987]  ? mark_held_locks+0x6d/0xc0
        [ 2779.990994]  ? __slab_alloc.isra.49+0x99/0x100
        [ 2779.991060]  ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
        [ 2779.991145]  insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
        [ 2779.991222]  ? start_transaction+0xdd/0x5c0 [btrfs]
        [ 2779.991291]  btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
        [ 2779.991405]  btrfs_work_helper+0xaa/0x720 [btrfs]
        [ 2779.991432]  process_one_work+0x26d/0x6a0
        [ 2779.991460]  worker_thread+0x4f/0x3e0
        [ 2779.991481]  ? process_one_work+0x6a0/0x6a0
        [ 2779.991489]  kthread+0x103/0x140
        [ 2779.991499]  ? kthread_create_worker_on_cpu+0x70/0x70
        [ 2779.991515]  ret_from_fork+0x3a/0x50
        (...)
        [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
        [ 2780.027480]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2780.030035] fsstress        D    0 17375  17373 0x00004000
        [ 2780.030038] Call Trace:
        [ 2780.030044]  ? __schedule+0x384/0xa30
        [ 2780.030052]  schedule+0x33/0xe0
        [ 2780.030075]  lock_extent_bits+0x20c/0x320 [btrfs]
        [ 2780.030094]  ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
        [ 2780.030098]  ? rcu_read_lock_sched_held+0x59/0xa0
        [ 2780.030102]  ? remove_wait_queue+0x60/0x60
        [ 2780.030122]  btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
        [ 2780.030151]  ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
        [ 2780.030165]  ? btrfs_search_slot+0x379/0x1000 [btrfs]
        [ 2780.030195]  btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
        [ 2780.030202]  ? do_raw_spin_unlock+0x49/0xc0
        [ 2780.030215]  ? btrfs_get_num_csums+0x10/0x10 [btrfs]
        [ 2780.030239]  btrfs_log_inode+0xf83/0x1124 [btrfs]
        [ 2780.030251]  ? __mutex_unlock_slowpath+0x45/0x2a0
        [ 2780.030275]  btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        [ 2780.030282]  ? dget_parent+0xa1/0x370
        [ 2780.030309]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [ 2780.030329]  btrfs_sync_file+0x3f3/0x490 [btrfs]
        [ 2780.030339]  do_fsync+0x38/0x60
        [ 2780.030343]  __x64_sys_fdatasync+0x13/0x20
        [ 2780.030345]  do_syscall_64+0x5c/0x280
        [ 2780.030348]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
        [ 2780.030361] Code: Bad RIP value.
        [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
        [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
        [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
        [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
        [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
      
      So fix this by making btrfs_truncate_inode_items() not lock the range in
      the inode's iotree when the target root is a log root, since it's not
      needed to lock the range for log roots as the protection from the inode's
      lock and log_mutex are all that's needed.
      
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5ae50de
  6. 19 2月, 2020 6 次提交
  7. 17 2月, 2020 1 次提交
    • J
      btrfs: don't set path->leave_spinning for truncate · 52e29e33
      Josef Bacik 提交于
      The only time we actually leave the path spinning is if we're truncating
      a small amount and don't actually free an extent, which is not a common
      occurrence.  We have to set the path blocking in order to add the
      delayed ref anyway, so the first extent we find we set the path to
      blocking and stay blocking for the duration of the operation.  With the
      upcoming file extent map stuff there will be another case that we have
      to have the path blocking, so just swap to blocking always.
      
      Note: this patch also fixes a warning after 28553fa9 ("Btrfs: fix
      race between shrinking truncate and fiemap") got merged that inserts
      extent locks around truncation so the path must not leave spinning locks
      after btrfs_search_slot.
      
        [70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
        [70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
        [70.794863] 5 locks held by rsync/1141:
        [70.794876]  #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
        [70.795030]  #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
        [70.795051]  #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
        [70.795124]  #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
        [70.795203]  #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
        [70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
        [70.795362] Call Trace:
        [70.795374]  dump_stack+0x71/0xa0
        [70.795445]  ___might_sleep.part.96.cold.106+0xa6/0xb6
        [70.795459]  kmem_cache_alloc+0x1d3/0x290
        [70.795471]  alloc_extent_state+0x22/0x1c0
        [70.795544]  __clear_extent_bit+0x3ba/0x580
        [70.795557]  ? _raw_spin_unlock_irq+0x24/0x30
        [70.795569]  btrfs_truncate_inode_items+0x339/0xe50
        [70.795647]  btrfs_evict_inode+0x269/0x540
        [70.795659]  ? dput.part.38+0x29/0x460
        [70.795671]  evict+0xcd/0x190
        [70.795682]  __dentry_kill+0xd6/0x180
        [70.795754]  dput.part.38+0x2ad/0x460
        [70.795765]  do_renameat2+0x3cb/0x540
        [70.795777]  __x64_sys_rename+0x1c/0x20
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52e29e33
  8. 13 2月, 2020 7 次提交
    • A
      btrfs: sysfs, move device id directories to UUID/devinfo · 1b9867eb
      Anand Jain 提交于
      Originally it was planned to create device id directories under
      UUID/devinfo, but it got under UUID/devices by mistake. We really want
      it under definfo so the bare device node names are not mixed with device
      ids and are easy to enumerate.
      
      Fixes: 668e48af ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b9867eb
    • A
      btrfs: sysfs, add UUID/devinfo kobject · a013d141
      Anand Jain 提交于
      Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
      by the id (unlike /devices).
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a013d141
    • F
      Btrfs: fix race between shrinking truncate and fiemap · 28553fa9
      Filipe Manana 提交于
      When there is a fiemap executing in parallel with a shrinking truncate
      we can end up in a situation where we have extent maps for which we no
      longer have corresponding file extent items. This is generally harmless
      and at the moment the only consequences are missing file extent items
      representing holes after we expand the file size again after the
      truncate operation removed the prealloc extent items, and stale
      information for future fiemap calls (reporting extents that no longer
      exist or may have been reallocated to other files for example).
      
      Consider the following example:
      
      1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
         and a 1MiB prealloc extent at file offset 128KiB;
      
      2) Task A starts doing a shrinking truncate of our inode to reduce it to
         a size of 64KiB. Before it searches the subvolume tree for file
         extent items to delete, it drops all the extent maps in the range
         from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
      
      3) Task B starts doing a fiemap against our inode. When looking up for
         the inode's extent maps in the range from 128KiB to (u64)-1, it
         doesn't find any in the inode's extent map tree, since they were
         removed by task A.  Because it didn't find any in the extent map
         tree, it scans the inode's subvolume tree for file extent items, and
         it finds the 1MiB prealloc extent at file offset 128KiB, then it
         creates an extent map based on that file extent item and adds it to
         inode's extent map tree (this ends up being done by
         btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
         get_extent_skip_holes());
      
      4) Task A then drops the prealloc extent at file offset 128KiB and
         shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
         truncation operation finishes and we end up with an extent map
         representing a 1MiB prealloc extent at file offset 128KiB, despite we
         don't have any more that extent;
      
      After this the two types of problems we have are:
      
      1) Future calls to fiemap always report that a 1MiB prealloc extent
         exists at file offset 128KiB. This is stale information, no longer
         correct;
      
      2) If the size of the file is increased, by a truncate operation that
         increases the file size or by a write into a file offset > 64KiB for
         example, we end up not inserting file extent items to represent holes
         for any range between 128KiB and 128KiB + 1MiB, since the hole
         expansion function, btrfs_cont_expand() will skip hole insertion for
         any range for which an extent map exists that represents a prealloc
         extent. This causes fsck to complain about missing file extent items
         when not using the NO_HOLES feature.
      
      The second issue could be often triggered by test case generic/561 from
      fstests, which runs fsstress and duperemove in parallel, and duperemove
      does frequent fiemap calls.
      
      Essentially the problems happens because fiemap does not acquire the
      inode's lock while truncate does, and fiemap locks the file range in the
      inode's iotree while truncate does not. So fix the issue by making
      btrfs_truncate_inode_items() lock the file range from the new file size
      to (u64)-1, so that it serializes with fiemap.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28553fa9
    • D
      btrfs: log message when rw remount is attempted with unclean tree-log · 10a3a3ed
      David Sterba 提交于
      A remount to a read-write filesystem is not safe when there's tree-log
      to be replayed. Files that could be opened until now might be affected
      by the changes in the tree-log.
      
      A regular mount is needed to replay the log so the filesystem presents
      the consistent view with the pending changes included.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      10a3a3ed
    • D
      btrfs: print message when tree-log replay starts · e8294f2f
      David Sterba 提交于
      There's no logged information about tree-log replay although this is
      something that points to previous unclean unmount. Other filesystems
      report that as well.
      Suggested-by: NChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e8294f2f
    • F
      Btrfs: fix race between using extent maps and merging them · ac05ca91
      Filipe Manana 提交于
      We have a few cases where we allow an extent map that is in an extent map
      tree to be merged with other extents in the tree. Such cases include the
      unpinning of an extent after the respective ordered extent completed or
      after logging an extent during a fast fsync. This can lead to subtle and
      dangerous problems because when doing the merge some other task might be
      using the same extent map and as consequence see an inconsistent state of
      the extent map - for example sees the new length but has seen the old start
      offset.
      
      With luck this triggers a BUG_ON(), and not some silent bug, such as the
      following one in __do_readpage():
      
        $ cat -n fs/btrfs/extent_io.c
        3061  static int __do_readpage(struct extent_io_tree *tree,
        3062                           struct page *page,
        (...)
        3127                  em = __get_extent_map(inode, page, pg_offset, cur,
        3128                                        end - cur + 1, get_extent, em_cached);
        3129                  if (IS_ERR_OR_NULL(em)) {
        3130                          SetPageError(page);
        3131                          unlock_extent(tree, cur, end);
        3132                          break;
        3133                  }
        3134                  extent_offset = cur - em->start;
        3135                  BUG_ON(extent_map_end(em) <= cur);
        (...)
      
      Consider the following example scenario, where we end up hitting the
      BUG_ON() in __do_readpage().
      
      We have an inode with a size of 8KiB and 2 extent maps:
      
        extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
                  a previous transaction
      
        extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
                  persisted but writeback started for it already. The extent map
      	    is pinned since there's writeback and an ordered extent in
      	    progress, so it can not be merged with extent map A yet
      
      The following sequence of steps leads to the BUG_ON():
      
      1) The ordered extent for extent B completes, the respective page gets its
         writeback bit cleared and the extent map is unpinned, at that point it
         is not yet merged with extent map A because it's in the list of modified
         extents;
      
      2) Due to memory pressure, or some other reason, the MM subsystem releases
         the page corresponding to extent B - btrfs_releasepage() is called and
         returns 1, meaning the page can be released as it's not dirty, not under
         writeback anymore and the extent range is not locked in the inode's
         iotree. However the extent map is not released, either because we are
         not in a context that allows memory allocations to block or because the
         inode's size is smaller than 16MiB - in this case our inode has a size
         of 8KiB;
      
      3) Task B needs to read extent B and ends up __do_readpage() through the
         btrfs_readpage() callback. At __do_readpage() it gets a reference to
         extent map B;
      
      4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
         while holding the write lock on the inode's extent map tree - this
         results in try_merge_map() being called and since it's possible to merge
         extent map B with extent map A now (the extent map B was removed from
         the list of modified extents), the merging begins - it sets extent map
         B's start offset to 0 (was 4KiB), but before it increments the map's
         length to 8KiB (4kb + 4KiB), task A is at:
      
         BUG_ON(extent_map_end(em) <= cur);
      
         The call to extent_map_end() sees the extent map has a start of 0
         and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
         the BUG_ON() is triggered.
      
      So it's dangerous to modify an extent map that is in the tree, because some
      other task might have got a reference to it before and still using it, and
      needs to see a consistent map while using it. Generally this is very rare
      since most paths that lookup and use extent maps also have the file range
      locked in the inode's iotree. The fsync path is pretty much the only
      exception where we don't do it to avoid serialization with concurrent
      reads.
      
      Fix this by not allowing an extent map do be merged if if it's being used
      by tasks other then the one attempting to merge the extent map (when the
      reference count of the extent map is greater than 2).
      Reported-by: Nryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
      Reported-by: NKoki Mitani <koki.mitani.xg@hco.ntt.co.jp>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac05ca91
    • W
      btrfs: ref-verify: fix memory leaks · f311ade3
      Wenwen Wang 提交于
      In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
      kmalloc(), respectively. In the following code, if an error occurs, the
      execution will be redirected to 'out' or 'out_unlock' and the function will
      be exited. However, on some of the paths, 'ref' and 'ra' are not
      deallocated, leading to memory leaks. For example, if 'action' is
      BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
      value indicates an error, the execution will be redirected to 'out'. But,
      'ref' is not deallocated on this path, causing a memory leak.
      
      To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
      the function when an error is encountered.
      
      CC: stable@vger.kernel.org # 4.15+
      Signed-off-by: NWenwen Wang <wenwen@cs.uga.edu>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f311ade3
  9. 03 2月, 2020 1 次提交
    • J
      btrfs: do not zero f_bavail if we have available space · d55966c4
      Josef Bacik 提交于
      There was some logic added a while ago to clear out f_bavail in statfs()
      if we did not have enough free metadata space to satisfy our global
      reserve.  This was incorrect at the time, however didn't really pose a
      problem for normal file systems because we would often allocate chunks
      if we got this low on free metadata space, and thus wouldn't really hit
      this case unless we were actually full.
      
      Fast forward to today and now we are much better about not allocating
      metadata chunks all of the time.  Couple this with d792b0f1 ("btrfs:
      always reserve our entire size for the global reserve") which now means
      we'll easily have a larger global reserve than our free space, we are
      now more likely to trip over this while still having plenty of space.
      
      Fix this by skipping this logic if the global rsv's space_info is not
      full.  space_info->full is 0 unless we've attempted to allocate a chunk
      for that space_info and that has failed.  If this happens then the space
      for the global reserve is definitely sacred and we need to report
      b_avail == 0, but before then we can just use our calculated b_avail.
      Reported-by: NMartin Steigerwald <martin@lichtvoll.de>
      Fixes: ca8a51b3 ("btrfs: statfs: report zero available if metadata are exhausted")
      CC: stable@vger.kernel.org # 4.5+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d55966c4
  10. 01 2月, 2020 1 次提交
  11. 31 1月, 2020 4 次提交
    • F
      Btrfs: send, fix emission of invalid clone operations within the same file · 9722b101
      Filipe Manana 提交于
      When doing an incremental send and a file has extents shared with itself
      at different file offsets, it's possible for send to emit clone operations
      that will fail at the destination because the source range goes beyond the
      file's current size. This happens when the file size has increased in the
      send snapshot, there is a hole between the shared extents and both shared
      extents are at file offsets which are greater the file's size in the
      parent snapshot.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
        $ btrfs send -f /tmp/1.snap /mnt/sdb/base
      
        # Create a 320K extent at file offset 512K.
        $ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
      
        # Clone part of that 320K extent into a lower file offset (192K).
        # This file offset is greater than the file's size in the parent
        # snapshot (64K). Also the clone range is a bit behind the offset of
        # the 320K extent so that we leave a hole between the shared extents.
        $ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
        $ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/1.snap /mnt/sdc
        $ btrfs receive -f /tmp/2.snap /mnt/sdc
        ERROR: failed to clone extents to foobar: Invalid argument
      
      The problem is that after processing the extent at file offset 256K, which
      refers to the first 128K of the 320K extent created by the buffered write
      operations, we have 'cur_inode_next_write_offset' set to 384K, which
      corresponds to the end offset of the partially shared extent (256K + 128K)
      and to the current file size in the receiver. Then when we process the
      extent at offset 512K, we do extent backreference iteration to figure out
      if we can clone the extent from some other inode or from the same inode,
      and we consider the extent at offset 256K of the same inode as a valid
      source for a clone operation, which is not correct because at that point
      the current file size in the receiver is 384K, which corresponds to the
      end of last processed extent (at file offset 256K), so using a clone
      source range from 256K to 256K + 320K is invalid because that goes past
      the current size of the file (384K) - this makes the receiver get an
      -EINVAL error when attempting the clone operation.
      
      So fix this by excluding clone sources that have a range that goes beyond
      the current file size in the receiver when iterating extent backreferences.
      
      A test case for fstests follows soon.
      
      Fixes: 11f2069c ("Btrfs: send, allow clone operations within the same file")
      CC: stable@vger.kernel.org # 5.5+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9722b101
    • J
      btrfs: do not do delalloc reservation under page lock · f4b1363c
      Josef Bacik 提交于
      We ran into a deadlock in production with the fixup worker.  The stack
      traces were as follows:
      
      Thread responsible for the writeout, waiting on the page lock
      
        [<0>] io_schedule+0x12/0x40
        [<0>] __lock_page+0x109/0x1e0
        [<0>] extent_write_cache_pages+0x206/0x360
        [<0>] extent_writepages+0x40/0x60
        [<0>] do_writepages+0x31/0xb0
        [<0>] __writeback_single_inode+0x3d/0x350
        [<0>] writeback_sb_inodes+0x19d/0x3c0
        [<0>] __writeback_inodes_wb+0x5d/0xb0
        [<0>] wb_writeback+0x231/0x2c0
        [<0>] wb_workfn+0x308/0x3c0
        [<0>] process_one_work+0x1e0/0x390
        [<0>] worker_thread+0x2b/0x3c0
        [<0>] kthread+0x113/0x130
        [<0>] ret_from_fork+0x35/0x40
        [<0>] 0xffffffffffffffff
      
      Thread of the fixup worker who is holding the page lock
      
        [<0>] start_delalloc_inodes+0x241/0x2d0
        [<0>] btrfs_start_delalloc_roots+0x179/0x230
        [<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
        [<0>] btrfs_check_data_free_space+0x53/0xa0
        [<0>] btrfs_delalloc_reserve_space+0x20/0x70
        [<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
        [<0>] normal_work_helper+0x11c/0x360
        [<0>] process_one_work+0x1e0/0x390
        [<0>] worker_thread+0x2b/0x3c0
        [<0>] kthread+0x113/0x130
        [<0>] ret_from_fork+0x35/0x40
        [<0>] 0xffffffffffffffff
      
      Thankfully the stars have to align just right to hit this.  First you
      have to end up in the fixup worker, which is tricky by itself (my
      reproducer does DIO reads into a MMAP'ed region, so not a common
      operation).  Then you have to have less than a page size of free data
      space and 0 unallocated space so you go down the "commit the transaction
      to free up pinned space" path.  This was accomplished by a random
      balance that was running on the host.  Then you get this deadlock.
      
      I'm still in the process of trying to force the deadlock to happen on
      demand, but I've hit other issues.  I can still trigger the fixup worker
      path itself so this patch has been tested in that regard, so the normal
      case is fine.
      
      Fixes: 87826df0 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4b1363c
    • J
      btrfs: drop the -EBUSY case in __extent_writepage_io · 5ab58055
      Josef Bacik 提交于
      Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
      do not need this -EBUSY case.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ab58055
    • C
      Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker · 25f3c502
      Chris Mason 提交于
      For COW, btrfs expects pages dirty pages to have been through a few setup
      steps.  This includes reserving space for the new block allocations and marking
      the range in the state tree for delayed allocation.
      
      A few places outside btrfs will dirty pages directly, especially when unmapping
      mmap'd pages.  In order for these to properly go through COW, we run them
      through a fixup worker to wait for stable pages, and do the delalloc prep.
      
      87826df0 added a window where the dirty pages were cleaned, but pending
      more action from the fixup worker.  We clear_page_dirty_for_io() before
      we call into writepage, so the page is no longer dirty.  The commit
      changed it so now we leave the page clean between unlocking it here and
      the fixup worker starting at some point in the future.
      
      During this window, page migration can jump in and relocate the page.  Once our
      fixup work actually starts, it finds page->mapping is NULL and we end up
      freeing the page without ever writing it.
      
      This leads to crc errors and other exciting problems, since it screws up the
      whole statemachine for waiting for ordered extents.  The fix here is to keep
      the page dirty while we're waiting for the fixup worker to get to work.
      This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
      if we queued the page up for fixup, which will cause the writepage
      function to redirty the page.
      
      Because we now expect the page to be dirty once it gets to the fixup
      worker we must adjust the error cases to call clear_page_dirty_for_io()
      on the page.  That is the bulk of the patch, but it is not the fix, the
      fix is the -EAGAIN from btrfs_writepage_cow_fixup.  We cannot separate
      these two changes out because the error conditions change with the new
      expectations.
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25f3c502