1. 30 4月, 2019 9 次提交
    • A
      btrfs: modify local copy of btrfs_inode flags · d2b8fcfe
      Anand Jain 提交于
      Instead of updating the binode::flags directly, update a local copy, and
      then at the point of no error, store copy it to the binode::flags.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2b8fcfe
    • A
      btrfs: drop useless inode i_flags copy and restore · 11d3cd5c
      Anand Jain 提交于
      The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used
      btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the
      inode::i_flags update functions such as
      btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called
      in btrfs_ioctl_setflags() instead of
      btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the
      inode::i_flags remains unmodified until the thread has checked all the
      conditions. So drop the saved inode::i_flags in out_i_flags.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11d3cd5c
    • A
      btrfs: start transaction in btrfs_ioctl_setflags() · ff9fef55
      Anand Jain 提交于
      Inode attribute can be set through the FS_IOC_SETFLAGS ioctl.  This
      flags also includes compression attribute for which we would set/reset
      the compression extended attribute. While doing this there is a bit of
      duplicate code, the following things happens twice:
      
      - start/end_transaction
      - inode_inc_iversion()
      - current_time update to inode->i_ctime
      - and btrfs_update_inode()
      
      These are updated both at btrfs_ioctl_setflags() and btrfs_set_props()
      as well.  This patch merges these two duplicate codes at
      btrfs_ioctl_setflags().
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff9fef55
    • A
      btrfs: refactor btrfs_set_props to validate externally · f22125e5
      Anand Jain 提交于
      In preparation to merge multiple transactions when setting the
      compression flags, split btrfs_set_props() validation part outside of
      it.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f22125e5
    • F
      Btrfs: fix race between send and deduplication that lead to failures and crashes · 62d54f3a
      Filipe Manana 提交于
      Send operates on read only trees and expects them to never change while it
      is using them. This is part of its initial design, and this expection is
      due to two different reasons:
      
      1) When it was introduced, no operations were allowed to modifiy read-only
         subvolumes/snapshots (including defrag for example).
      
      2) It keeps send from having an impact on other filesystem operations.
         Namely send does not need to keep locks on the trees nor needs to hold on
         to transaction handles and delay transaction commits. This ends up being
         a consequence of the former reason.
      
      However the deduplication feature was introduced later (on September 2013,
      while send was introduced in July 2012) and it allowed for deduplication
      with destination files that belong to read-only trees (subvolumes and
      snapshots).
      
      That means that having a send operation (either full or incremental) running
      in parallel with a deduplication that has the destination inode in one of
      the trees used by the send operation, can result in tree nodes and leaves
      getting freed and reused while send is using them. This problem is similar
      to the problem solved for the root nodes getting freed and reused when a
      snapshot is made against one tree that is currenly being used by a send
      operation, fixed in commits [1] and [2]. These commits explain in detail
      how the problem happens and the explanation is valid for any node or leaf
      that is not the root of a tree as well. This problem was also discussed
      and explained recently in a thread [3].
      
      The problem is very easy to reproduce when using send with large trees
      (snapshots) and just a few concurrent deduplication operations that target
      files in the trees used by send. A stress test case is being sent for
      fstests that triggers the issue easily. The most common error to hit is
      the send ioctl return -EIO with the following messages in dmesg/syslog:
      
       [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
       [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
      
      The first one is very easy to hit while the second one happens much less
      frequently, except for very large trees (in that test case, snapshots
      with 100000 files having large xattrs to get deep and wide trees).
      Less frequently, at least one BUG_ON can be hit:
      
       [1631742.130080] ------------[ cut here ]------------
       [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
       [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
       [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
       [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
       [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
       [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
       [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
       [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
       [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
       [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
       [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [1631742.141272] Call Trace:
       [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
       [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
       [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
       [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
       [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
       [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
       [1631742.146662]  ? reweight_entity+0x154/0x1a0
       [1631742.147135]  ? update_curr+0x20/0x2a0
       [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
       [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
       [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
       [1631742.149361]  ? __fget+0x113/0x200
       [1631742.149767]  ksys_ioctl+0x70/0x80
       [1631742.150159]  __x64_sys_ioctl+0x16/0x20
       [1631742.150543]  do_syscall_64+0x60/0x1b0
       [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [1631742.151326] RIP: 0033:0x7f4cd9f5add7
       (...)
       [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
       [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
       [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
       [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
       [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
      
      That BUG_ON happens because while send is using a node, that node is COWed
      by a concurrent deduplication, gets freed and gets reused as a leaf (because
      a transaction commit happened in between), so when it attempts to read a
      slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
      contents were wiped out and it now matches a leaf (which can even belong to
      some other tree now), hitting the BUG_ON(level == 0).
      
      Fix this concurrency issue by not allowing send and deduplication to run
      in parallel if both operate on the same readonly trees, returning EAGAIN
      to user space and logging an exlicit warning in dmesg/syslog.
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
      [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62d54f3a
    • Q
      btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() · 82fa113f
      Qu Wenruo 提交于
      Use the new btrfs_ref structure and replace parameter list to clean up
      the usage of owner and level to distinguish the extent types.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      82fa113f
    • G
      btrfs: Perform locking/unlocking in btrfs_remap_file_range() · 7984ae52
      Goldwyn Rodrigues 提交于
      Move code to make it more readable, so as locking and unlocking is
      done in the same function. The generic checks that are now performed in
      the locked section are unaffected.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7984ae52
    • A
      btrfs: refactor btrfs_set_prop and add btrfs_set_prop_trans · 262c96a3
      Anand Jain 提交于
      btrfs_set_prop() takes transaction pointer as the first argument,
      however in ioctl.c for the purpose of setting the compression property,
      we call btrfs_set_prop() with NULL transaction pointer. Down in
      the call chain  btrfs_setxattr() starts transaction to update the
      attribute and also to update the inode.
      
      So for clarity, create btrfs_set_prop_trans() with no transaction
      pointer as argument, in preparation to start transaction here instead of
      doing it down the call chain at btrfs_setxattr().
      
      Also now the btrfs_set_prop() is a static function.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      262c96a3
    • A
      btrfs: merge _btrfs_set_prop helpers · 7715da84
      Anand Jain 提交于
      btrfs_set_prop() is a redirect to __btrfs_set_prop() with the
      transaction handle equal to NULL.  __btrfs_set_prop() in turn passes
      this to do_setxattr() which then transaction is actually created.
      
      Instead merge  __btrfs_set_prop() to btrfs_set_prop(), and update the
      caller with NULL argument.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7715da84
  2. 29 3月, 2019 1 次提交
  3. 27 2月, 2019 1 次提交
    • F
      Btrfs: fix deadlock between clone/dedupe and rename · 4ea748e1
      Filipe Manana 提交于
      Reflinking (clone/dedupe) and rename are operations that operate on two
      inodes and therefore need to lock them in the same order to avoid ABBA
      deadlocks. It happens that Btrfs' reflink implementation always locked
      them in a different order from VFS's lock_two_nondirectories() helper,
      which is used by the rename code in VFS, resulting in ABBA type deadlocks.
      
      Btrfs' locking order:
      
        static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2)
        {
               if (inode1 < inode2)
                      swap(inode1, inode2);
      
               inode_lock_nested(inode1, I_MUTEX_PARENT);
               inode_lock_nested(inode2, I_MUTEX_CHILD);
        }
      
      VFS's locking order:
      
        void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
        {
              if (inode1 > inode2)
                      swap(inode1, inode2);
      
              if (inode1 && !S_ISDIR(inode1->i_mode))
                      inode_lock(inode1);
              if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1)
                      inode_lock_nested(inode2, I_MUTEX_NONDIR2);
      }
      
      Fix this by killing the btrfs helper function that does the double inode
      locking and replace it with VFS's helper lock_two_nondirectories().
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 416161db ("btrfs: offline dedupe")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4ea748e1
  4. 25 2月, 2019 11 次提交
  5. 09 1月, 2019 2 次提交
    • F
      Btrfs: fix race between reflink/dedupe and relocation · d8b55242
      Filipe Manana 提交于
      The recent rework that makes btrfs' remap_file_range operation use the
      generic helper generic_remap_file_range_prep() introduced a race between
      relocation and reflinking (for both cloning and deduplication) the file
      extents between the source and destination inodes.
      
      This happens because we no longer lock the source range anymore, and we do
      not lock it anymore because we wait for direct IO writes and writeback to
      complete early on the code path right after locking the inodes, which
      guarantees no other file operations interfere with the reflinking. However
      there is one exception which is relocation, since it replaces the byte
      number of file extents items in the fs tree after locking the range the
      file extent items represent. This is a problem because after finding each
      file extent to clone in the fs tree, the reflink process copies the file
      extent item into a local buffer, releases the search path, inserts new
      file extent items in the destination range and then increments the
      reference count for the extent mentioned in the file extent item that it
      previously copied to the buffer. If right after copying the file extent
      item into the buffer and releasing the path the relocation process
      updates the file extent item to point to the new extent, the reflink
      process ends up creating a delayed reference to increment the reference
      count of the old extent, for which the relocation process already created
      a delayed reference to drop it. This results in failure to run delayed
      references because we will attempt to increment the count of a reference
      that was already dropped. This is illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
                                              relocation is running
      
        btrfs_clone_files()
      
          btrfs_clone()
            --> finds extent item
                in source range
                point to extent
                at bytenr X
            --> copies it into a
                local buffer
            --> releases path
      
                                              replace_file_extents()
                                                --> successfully locks the
                                                    range represented by
                                                    the file extent item
                                                --> replaces disk_bytenr
                                                    field in the file
                                                    extent item with some
                                                    other value Y
                                                --> creates delayed reference
                                                    to increment reference
                                                    count for extent at
                                                    bytenr Y
                                                --> creates delayed reference
                                                    to drop the extent at
                                                    bytenr X
      
            --> starts transaction
            --> creates delayed
                reference to
                increment extent
                at bytenr X
      
                          <delayed references are run, due to a transaction
                           commit for example, and the transaction is aborted
                           with -EIO because we attempt to increment reference
                           count for the extent at bytenr X after we freed it>
      
      When this race is hit the running transaction ends up getting aborted with
      an -EIO error and a trace like the following is produced:
      
      [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G        W         4.20.0-rc6-btrfs-next-41 #1
      [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
      [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
      [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
      [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
      [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
      [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
      [ 4382.556315] FS:  00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
      [ 4382.563130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
      [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 4382.564887] Call Trace:
      [ 4382.565343]  insert_inline_extent_backref+0x55/0xe0 [btrfs]
      [ 4382.565796]  __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
      [ 4382.566249]  ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
      [ 4382.566702]  __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
      [ 4382.567162]  btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
      [ 4382.567623]  btrfs_commit_transaction+0x50/0x9c0 [btrfs]
      [ 4382.568112]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.568557]  ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
      [ 4382.569006]  create_subvol+0x3c8/0x830 [btrfs]
      [ 4382.569461]  ? btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.569906]  btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.570383]  ? rcu_sync_lockdep_assert+0xe/0x60
      [ 4382.570822]  ? __sb_start_write+0xd4/0x1c0
      [ 4382.571262]  ? mnt_want_write_file+0x24/0x50
      [ 4382.571712]  btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
      [ 4382.572155]  ? _copy_from_user+0x66/0x90
      [ 4382.572602]  btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
      [ 4382.573052]  btrfs_ioctl+0x7c1/0x30e0 [btrfs]
      [ 4382.573502]  ? mem_cgroup_commit_charge+0x8b/0x570
      [ 4382.573946]  ? do_raw_spin_unlock+0x49/0xc0
      [ 4382.574379]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.574803]  ? __handle_mm_fault+0xf29/0x12d0
      [ 4382.575215]  ? do_vfs_ioctl+0xa2/0x6f0
      [ 4382.575622]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [ 4382.576020]  do_vfs_ioctl+0xa2/0x6f0
      [ 4382.576405]  ksys_ioctl+0x70/0x80
      [ 4382.576776]  __x64_sys_ioctl+0x16/0x20
      [ 4382.577137]  do_syscall_64+0x60/0x1b0
      [ 4382.577488]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      (...)
      [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
      [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
      [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
      [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
      [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
      [ 4382.580792] irq event stamp: 0
      [ 4382.581106] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.581772] softirqs last  enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
      [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
      [ 4382.624295] BTRFS info (device sdc): forced readonly
      
      Fix this by locking the source range before searching for the file extent
      items in the fs tree, since the relocation process will try to lock the
      range a file extent item represents before updating it with the new extent
      location.
      
      Fixes: 34a28e3d ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8b55242
    • F
      Btrfs: fix race between cloning range ending at eof and writeback · f7fa1107
      Filipe Manana 提交于
      The recent rework that makes btrfs' remap_file_range operation use the
      generic helper generic_remap_file_range_prep() introduced a race between
      writeback and cloning a range that covers the eof extent of the source
      file into a destination offset that is greater then the same file's size.
      
      This happens because we now wait for writeback to complete before doing
      the truncation of the eof block, while previously we did the truncation
      and then waited for writeback to complete. This leads to a race between
      writeback of the truncated block and cloning the file extents in the
      source range, because we copy each file extent item we find in the fs
      root into a buffer, then release the path and then increment the reference
      count for the extent referred in that file extent item we copied, which
      can no longer exist if writeback of the truncated eof block completes
      after we copied the file extent item into the buffer and before we
      incremented the reference count. This is illustrated by the following
      diagram:
      
              CPU 1                                       CPU 2
      
        btrfs_clone_files()
          btrfs_cont_expand()
            btrfs_truncate_block()
               --> zeroes part of the
                   page containg eof,
                   marking it for
                  delalloc
      
          btrfs_clone()
            --> finds extent item
                covering eof,
                points to extent
                at bytenr X
            --> copies it into a
                local buffer
            --> releases path
      
                                              writeback starts
      
                                              btrfs_finish_ordered_io()
                                                insert_reserved_file_extent()
                                                  __btrfs_drop_extents()
                                                    --> creates delayed
                                                        reference to drop
                                                        the extent at
                                                        bytenr X
      
            --> starts transaction
            --> creates delayed
                reference to
                increment extent
                at bytenr X
      
                          <delayed references are run, due to a transaction
                           commit for example, and the transaction is aborted
                           with -EIO because we attempt to increment reference
                           count for the extent at bytenr X after we freed it>
      
      When this race is hit the running transaction ends up getting aborted with
      an -EIO error and a trace like the following is produced:
      
      [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G        W         4.20.0-rc6-btrfs-next-41 #1
      [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
      [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
      [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
      [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
      [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
      [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
      [ 4382.556315] FS:  00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
      [ 4382.563130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
      [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 4382.564887] Call Trace:
      [ 4382.565343]  insert_inline_extent_backref+0x55/0xe0 [btrfs]
      [ 4382.565796]  __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
      [ 4382.566249]  ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
      [ 4382.566702]  __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
      [ 4382.567162]  btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
      [ 4382.567623]  btrfs_commit_transaction+0x50/0x9c0 [btrfs]
      [ 4382.568112]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.568557]  ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
      [ 4382.569006]  create_subvol+0x3c8/0x830 [btrfs]
      [ 4382.569461]  ? btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.569906]  btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.570383]  ? rcu_sync_lockdep_assert+0xe/0x60
      [ 4382.570822]  ? __sb_start_write+0xd4/0x1c0
      [ 4382.571262]  ? mnt_want_write_file+0x24/0x50
      [ 4382.571712]  btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
      [ 4382.572155]  ? _copy_from_user+0x66/0x90
      [ 4382.572602]  btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
      [ 4382.573052]  btrfs_ioctl+0x7c1/0x30e0 [btrfs]
      [ 4382.573502]  ? mem_cgroup_commit_charge+0x8b/0x570
      [ 4382.573946]  ? do_raw_spin_unlock+0x49/0xc0
      [ 4382.574379]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.574803]  ? __handle_mm_fault+0xf29/0x12d0
      [ 4382.575215]  ? do_vfs_ioctl+0xa2/0x6f0
      [ 4382.575622]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [ 4382.576020]  do_vfs_ioctl+0xa2/0x6f0
      [ 4382.576405]  ksys_ioctl+0x70/0x80
      [ 4382.576776]  __x64_sys_ioctl+0x16/0x20
      [ 4382.577137]  do_syscall_64+0x60/0x1b0
      [ 4382.577488]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      (...)
      [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
      [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
      [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
      [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
      [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
      [ 4382.580792] irq event stamp: 0
      [ 4382.581106] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.581772] softirqs last  enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
      [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
      [ 4382.624295] BTRFS info (device sdc): forced readonly
      
      Fix this by waiting for writeback to complete after truncating the eof
      block.
      
      Fixes: 34a28e3d ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7fa1107
  6. 17 12月, 2018 4 次提交
    • F
      Btrfs: use generic_remap_file_range_prep() for cloning and deduplication · 34a28e3d
      Filipe Manana 提交于
      Since cloning and deduplication are no longer Btrfs specific operations, we
      now have generic code to handle parameter validation, compare file ranges
      used for deduplication, clear capabilities when cloning, etc. This change
      makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
      few bugs, such as:
      
      1) When cloning, the destination file's capabilities were not dropped
         (the fstest generic/513 tests this);
      
      2) We were not checking if the destination file is immutable;
      
      3) Not checking if either the source or destination files are swap
         files (swap file support is coming soon for Btrfs);
      
      4) System limits were not checked (resource limits and O_LARGEFILE).
      
      Note that the generic helper generic_remap_file_range_prep() does start
      and waits for writeback by calling filemap_write_and_wait_range(), however
      that is not enough for Btrfs for two reasons:
      
      1) With compression, we need to start writeback twice in order to get the
         pages marked for writeback and ordered extents created;
      
      2) filemap_write_and_wait_range() (and all its other variants) only waits
         for the IO to complete, but we need to wait for the ordered extents to
         finish, so that when we do the actual reflinking operations the file
         extent items are in the fs tree. This is also important due to the fact
         that the generic helper, for the deduplication case, compares the
         contents of the pages in the requested range, which might require
         reading extents from disk in the very unlikely case that pages get
         invalidated after writeback finishes (so the file extent items must be
         up to date in the fs tree).
      
      Since these reasons are specific to Btrfs we have to do it in the Btrfs
      code before calling generic_remap_file_range_prep(). This also results
      in a simpler way of dealing with existing delalloc in the source/target
      ranges, specially for the deduplication case where we used to lock all
      the pages first and then if we found any dealloc for the range, or
      ordered extent, we would unlock the pages trigger writeback and wait for
      ordered extents to complete, then lock all the pages again and check if
      deduplication can be done. So now we get a simpler approach: lock the
      inodes, then trigger writeback and then wait for ordered extents to
      complete.
      
      So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
      to eliminate duplicated code, fix a few bugs and benefit from future bug
      fixes done there - for example the recent clone and dedupe bugs involving
      reflinking a partial EOF block got a counterpart fix in the generic
      helper, since it affected all filesystems supporting these operations,
      so we no longer need special checks in Btrfs for them.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      34a28e3d
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • E
      btrfs: use tagged writepage to mitigate livelock of snapshot · 3cd24c69
      Ethan Lien 提交于
      Snapshot is expected to be fast. But if there are writers steadily
      creating dirty pages in our subvolume, the snapshot may take a very long
      time to complete. To fix the problem, we use tagged writepage for
      snapshot flusher as we do in the generic write_cache_pages(), so we can
      omit pages dirtied after the snapshot command.
      
      This does not change the semantics regarding which data get to the
      snapshot, if there are pages being dirtied during the snapshotting
      operation.  There's a sync called before snapshot is taken in old/new
      case, any IO in flight just after that may be in the snapshot but this
      depends on other system effects that might still sync the IO.
      
      We do a simple snapshot speed test on a Intel D-1531 box:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
      --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 1m58sec
      patched:  6.54sec
      
      This is the best case for this patch since for a sequential write case,
      we omit nearly all pages dirtied after the snapshot command.
      
      For a multi writers, random write test:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
      --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 15.83sec
      patched:  10.35sec
      
      The improvement is smaller compared to the sequential write case,
      since we omit only half of the pages dirtied after snapshot command.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3cd24c69
    • O
      Btrfs: prevent ioctls from interfering with a swap file · eede2bf3
      Omar Sandoval 提交于
      A later patch will implement swap file support for Btrfs, but before we
      do that, we need to make sure that the various Btrfs ioctls cannot
      change a swap file.
      
      When a swap file is active, we must make sure that the extents of the
      file are not moved and that they don't become shared. That means that
      the following are not safe:
      
      - chattr +c (enable compression)
      - reflink
      - dedupe
      - snapshot
      - defrag
      
      Don't allow those to happen on an active swap file.
      
      Additionally, balance, resize, device remove, and device replace are
      also unsafe if they affect an active swapfile. Add a red-black tree of
      block groups and devices which contain an active swapfile. Relocation
      checks each block group against this tree and skips it or errors out for
      balance or resize, respectively. Device remove and device replace check
      the tree for the device they will operate on.
      
      Note that we don't have to worry about chattr -C (disable nocow), which
      we ignore for non-empty files, because an active swapfile must be
      non-empty and can't be truncated. We also don't have to worry about
      autodefrag because it's only done on COW files. Truncate and fallocate
      are already taken care of by the generic code. Device add doesn't do
      relocation so it's not an issue, either.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eede2bf3
  7. 06 11月, 2018 2 次提交
    • F
      Btrfs: fix data corruption due to cloning of eof block · ac765f83
      Filipe Manana 提交于
      We currently allow cloning a range from a file which includes the last
      block of the file even if the file's size is not aligned to the block
      size. This is fine and useful when the destination file has the same size,
      but when it does not and the range ends somewhere in the middle of the
      destination file, it leads to corruption because the bytes between the EOF
      and the end of the block have undefined data (when there is support for
      discard/trimming they have a value of 0x00).
      
      Example:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ export foo_size=$((256 * 1024 + 100))
       $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar
      
       $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar
      
       $ od -A d -t x1 /mnt/bar
       0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
       *
       0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
       *
       0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
       0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       *
       0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
       *
       1048576
      
      The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
      (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
      of 0xb5.
      
      This is similar to the problem we had for deduplication that got recently
      fixed by commit de02b9f6 ("Btrfs: fix data corruption when
      deduplicating between different files").
      
      Fix this by not allowing such operations to be performed and return the
      errno -EINVAL to user space. This is what XFS is doing as well at the VFS
      level. This change however now makes us return -EINVAL instead of
      -EOPNOTSUPP for cases where the source range maps to an inline extent and
      the destination range's end is smaller then the destination file's size,
      since the detection of inline extents is done during the actual process of
      dropping file extent items (at __btrfs_drop_extents()). Returning the
      -EINVAL error is done early on and solely based on the input parameters
      (offsets and length) and destination file's size. This makes us consistent
      with XFS and anyone else supporting cloning since this case is now checked
      at a higher level in the VFS and is where the -EINVAL will be returned
      from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
      by commit 07d19dc9 ("vfs: avoid problematic remapping requests into
      partial EOF block"). So this change is more geared towards stable kernels,
      as it's unlikely the new VFS checks get removed intentionally.
      
      A test case for fstests follows soon, as well as an update to filter
      existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.
      
      CC: <stable@vger.kernel.org> # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac765f83
    • F
      Btrfs: fix infinite loop on inode eviction after deduplication of eof block · 11023d3f
      Filipe Manana 提交于
      If we attempt to deduplicate the last block of a file A into the middle of
      a file B, and file A's size is not a multiple of the block size, we end
      rounding the deduplication length to 0 bytes, to avoid the data corruption
      issue fixed by commit de02b9f6 ("Btrfs: fix data corruption when
      deduplicating between different files"). However a length of zero will
      cause the insertion of an extent state with a start value greater (by 1)
      then the end value, leading to a corrupt extent state that will trigger a
      warning and cause chaos such as an infinite loop during inode eviction.
      Example trace:
      
       [96049.833585] ------------[ cut here ]------------
       [96049.833714] WARNING: CPU: 0 PID: 24448 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
       [96049.833767] CPU: 0 PID: 24448 Comm: xfs_io Not tainted 4.19.0-rc7-btrfs-next-39 #1
       [96049.833768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [96049.833780] RIP: 0010:insert_state+0x101/0x120 [btrfs]
       [96049.833783] RSP: 0018:ffffafd2c3707af0 EFLAGS: 00010282
       [96049.833785] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
       [96049.833786] RDX: 0000000000000007 RSI: ffff99045c143230 RDI: ffff99047b2168a0
       [96049.833787] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
       [96049.833787] R10: ffffafd2c3707ab8 R11: 0000000000000000 R12: ffff9903b93b12c8
       [96049.833788] R13: 000000000004e000 R14: ffffafd2c3707b80 R15: ffffafd2c3707b78
       [96049.833790] FS:  00007f5c14e7d700(0000) GS:ffff99047b200000(0000) knlGS:0000000000000000
       [96049.833791] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [96049.833792] CR2: 00007f5c146abff8 CR3: 0000000115f4c004 CR4: 00000000003606f0
       [96049.833795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [96049.833796] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [96049.833796] Call Trace:
       [96049.833809]  __set_extent_bit+0x46c/0x6a0 [btrfs]
       [96049.833823]  lock_extent_bits+0x6b/0x210 [btrfs]
       [96049.833831]  ? _raw_spin_unlock+0x24/0x30
       [96049.833841]  ? test_range_bit+0xdf/0x130 [btrfs]
       [96049.833853]  lock_extent_range+0x8e/0x150 [btrfs]
       [96049.833864]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
       [96049.833875]  btrfs_extent_same_range+0x14e/0x550 [btrfs]
       [96049.833885]  ? rcu_read_lock_sched_held+0x3f/0x70
       [96049.833890]  ? __kmalloc_node+0x2b0/0x2f0
       [96049.833899]  ? btrfs_dedupe_file_range+0x19a/0x280 [btrfs]
       [96049.833909]  btrfs_dedupe_file_range+0x270/0x280 [btrfs]
       [96049.833916]  vfs_dedupe_file_range_one+0xd9/0xe0
       [96049.833919]  vfs_dedupe_file_range+0x131/0x1b0
       [96049.833924]  do_vfs_ioctl+0x272/0x6e0
       [96049.833927]  ? __fget+0x113/0x200
       [96049.833931]  ksys_ioctl+0x70/0x80
       [96049.833933]  __x64_sys_ioctl+0x16/0x20
       [96049.833937]  do_syscall_64+0x60/0x1b0
       [96049.833939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [96049.833941] RIP: 0033:0x7f5c1478ddd7
       [96049.833943] RSP: 002b:00007ffe15b196a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [96049.833945] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5c1478ddd7
       [96049.833946] RDX: 00005625ece322d0 RSI: 00000000c0189436 RDI: 0000000000000004
       [96049.833947] RBP: 0000000000000000 R08: 00007f5c14a46f48 R09: 0000000000000040
       [96049.833948] R10: 0000000000000541 R11: 0000000000000202 R12: 0000000000000000
       [96049.833949] R13: 0000000000000000 R14: 0000000000000004 R15: 00005625ece322d0
       [96049.833954] irq event stamp: 6196
       [96049.833956] hardirqs last  enabled at (6195): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.833958] hardirqs last disabled at (6196): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.833959] softirqs last  enabled at (6114): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.833964] softirqs last disabled at (6095): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.833965] ---[ end trace db7b05f01b7fa10c ]---
       [96049.935816] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.935822] irq event stamp: 6584
       [96049.935823] hardirqs last  enabled at (6583): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.935825] hardirqs last disabled at (6584): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.935827] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.935828] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.935829] ---[ end trace db7b05f01b7fa123 ]---
       [96049.935840] ------------[ cut here ]------------
       [96049.936065] WARNING: CPU: 1 PID: 24463 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
       [96049.936107] CPU: 1 PID: 24463 Comm: umount Tainted: G        W         4.19.0-rc7-btrfs-next-39 #1
       [96049.936108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [96049.936117] RIP: 0010:insert_state+0x101/0x120 [btrfs]
       [96049.936119] RSP: 0018:ffffafd2c3637bc0 EFLAGS: 00010282
       [96049.936120] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
       [96049.936121] RDX: 0000000000000007 RSI: ffff990445cf88e0 RDI: ffff99047b2968a0
       [96049.936122] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
       [96049.936123] R10: ffffafd2c3637b88 R11: 0000000000000000 R12: ffff9904574301e8
       [96049.936124] R13: 000000000004e000 R14: ffffafd2c3637c50 R15: ffffafd2c3637c48
       [96049.936125] FS:  00007fe4b87e72c0(0000) GS:ffff99047b280000(0000) knlGS:0000000000000000
       [96049.936126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [96049.936128] CR2: 00005562e52618d8 CR3: 00000001151c8005 CR4: 00000000003606e0
       [96049.936129] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [96049.936131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [96049.936131] Call Trace:
       [96049.936141]  __set_extent_bit+0x46c/0x6a0 [btrfs]
       [96049.936154]  lock_extent_bits+0x6b/0x210 [btrfs]
       [96049.936167]  btrfs_evict_inode+0x1e1/0x5a0 [btrfs]
       [96049.936172]  evict+0xbf/0x1c0
       [96049.936174]  dispose_list+0x51/0x80
       [96049.936176]  evict_inodes+0x193/0x1c0
       [96049.936180]  generic_shutdown_super+0x3f/0x110
       [96049.936182]  kill_anon_super+0xe/0x30
       [96049.936189]  btrfs_kill_super+0x13/0x100 [btrfs]
       [96049.936191]  deactivate_locked_super+0x3a/0x70
       [96049.936193]  cleanup_mnt+0x3b/0x80
       [96049.936195]  task_work_run+0x93/0xc0
       [96049.936198]  exit_to_usermode_loop+0xfa/0x100
       [96049.936201]  do_syscall_64+0x17f/0x1b0
       [96049.936202]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [96049.936204] RIP: 0033:0x7fe4b80cfb37
       [96049.936206] RSP: 002b:00007ffff092b688 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       [96049.936207] RAX: 0000000000000000 RBX: 00005562e5259060 RCX: 00007fe4b80cfb37
       [96049.936208] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005562e525faa0
       [96049.936209] RBP: 00005562e525faa0 R08: 00005562e525f770 R09: 0000000000000015
       [96049.936210] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fe4b85d1e64
       [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.936216] irq event stamp: 6616
       [96049.936219] hardirqs last  enabled at (6615): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.936219] hardirqs last disabled at (6616): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.936222] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.936222] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.936223] ---[ end trace db7b05f01b7fa124 ]---
      
      The second stack trace, from inode eviction, is repeated forever due to
      the infinite loop during eviction.
      
      This is the same type of problem fixed way back in 2015 by commit
      113e8283 ("Btrfs: fix inode eviction infinite loop after extent_same
      ioctl") and commit ccccf3d6 ("Btrfs: fix inode eviction infinite loop
      after cloning into it").
      
      So fix this by returning immediately if the deduplication range length
      gets rounded down to 0 bytes, as there is nothing that needs to be done in
      such case.
      
      Example reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ xfs_io -f -c "pwrite -S 0xe6 0 100" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xe6 0 1M" /mnt/bar
      
       # Unmount the filesystem and mount it again so that we start without any
       # extent state records when we ask for the deduplication.
       $ umount /mnt
       $ mount /dev/sdb /mnt
      
       $ xfs_io -c "dedupe /mnt/foo 0 500K 100" /mnt/bar
      
       # This unmount triggers the infinite loop.
       $ umount /mnt
      
      A test case for fstests will follow soon.
      
      Fixes: de02b9f6 ("Btrfs: fix data corruption when deduplicating between different files")
      CC: <stable@vger.kernel.org> # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11023d3f
  8. 30 10月, 2018 2 次提交
  9. 15 10月, 2018 4 次提交
  10. 23 8月, 2018 1 次提交
    • F
      Btrfs: fix data corruption when deduplicating between different files · de02b9f6
      Filipe Manana 提交于
      If we deduplicate extents between two different files we can end up
      corrupting data if the source range ends at the size of the source file,
      the source file's size is not aligned to the filesystem's block size
      and the destination range does not go past the size of the destination
      file size.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
        # The first byte with a value of 0xae starts at an offset (2518890)
        # which is not a multiple of the sector size.
        $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo
      
        # Confirm the file content is full of bytes with values 0x6b and 0xae.
        $ od -t x1 /mnt/foo
        0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
        *
        11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
        11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
        *
        11777540 ae ae ae ae ae ae ae ae
        11777550
      
        # Create a second file with a length not aligned to the sector size,
        # whose bytes all have the value 0x6b, so that its extent(s) can be
        # deduplicated with the first file.
        $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar
      
        # Now deduplicate the entire second file into a range of the first file
        # that also has all bytes with the value 0x6b. The destination range's
        # end offset must not be aligned to the sector size and must be less
        # then the offset of the first byte with the value 0xae (byte at offset
        # 2518890).
        $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo
      
        # The bytes in the range starting at offset 2515659 (end of the
        # deduplication range) and ending at offset 2519040 (start offset
        # rounded up to the block size) must all have the value 0xae (and not
        # replaced with 0x00 values). In other words, we should have exactly
        # the same data we had before we asked for deduplication.
        $ od -t x1 /mnt/foo
        0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
        *
        11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
        11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
        *
        11777540 ae ae ae ae ae ae ae ae
        11777550
      
        # Unmount the filesystem and mount it again. This guarantees any file
        # data in the page cache is dropped.
        $ umount /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ od -t x1 /mnt/foo
        0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
        *
        11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
        11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
        *
        11777540 ae ae ae ae ae ae ae ae
        11777550
      
        # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
        # value of 0xae, data corruption happened due to the deduplication
        # operation.
      
      So fix this by rounding down, to the sector size, the length used for the
      deduplication when the following conditions are met:
      
        1) Source file's range ends at its i_size;
        2) Source file's i_size is not aligned to the sector size;
        3) Destination range does not cross the i_size of the destination file.
      
      Fixes: e1d227a4 ("btrfs: Handle unaligned length in extent_same")
      CC: stable@vger.kernel.org # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de02b9f6
  11. 18 8月, 2018 1 次提交
    • R
      Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space · 8ecebf4d
      Robbie Ko 提交于
      Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced
      nocow writes to fallback to COW, during writeback, when a snapshot is
      created. This resulted in writes made before creating the snapshot to
      unexpectedly fail with ENOSPC during writeback when success (0) was
      returned to user space through the write system call.
      
      The steps leading to this problem are:
      
      1. When it's not possible to allocate data space for a write, the
         buffered write path checks if a NOCOW write is possible.  If it is,
         it will not reserve space and success (0) is returned to user space.
      
      2. Then when a snapshot is created, the root's will_be_snapshotted
         atomic is incremented and writeback is triggered for all inode's that
         belong to the root being snapshotted. Incrementing that atomic forces
         all previous writes to fallback to COW during writeback (running
         delalloc).
      
      3. This results in the writeback for the inodes to fail and therefore
         setting the ENOSPC error in their mappings, so that a subsequent
         fsync on them will report the error to user space. So it's not a
         completely silent data loss (since fsync will report ENOSPC) but it's
         a very unexpected and undesirable behaviour, because if a clean
         shutdown/unmount of the filesystem happens without previous calls to
         fsync, it is expected to have the data present in the files after
         mounting the filesystem again.
      
      So fix this by adding a new atomic named snapshot_force_cow to the
      root structure which prevents this behaviour and works the following way:
      
      1. It is incremented when we start to create a snapshot after triggering
         writeback and before waiting for writeback to finish.
      
      2. This new atomic is now what is used by writeback (running delalloc)
         to decide whether we need to fallback to COW or not. Because we
         incremented this new atomic after triggering writeback in the
         snapshot creation ioctl, we ensure that all buffered writes that
         happened before snapshot creation will succeed and not fallback to
         COW (which would make them fail with ENOSPC).
      
      3. The existing atomic, will_be_snapshotted, is kept because it is used
         to force new buffered writes, that start after we started
         snapshotting, to reserve data space even when NOCOW is possible.
         This makes these writes fail early with ENOSPC when there's no
         available space to allocate, preventing the unexpected behaviour of
         writeback later failing with ENOSPC due to a fallback to COW mode.
      
      Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting")
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ecebf4d
  12. 06 8月, 2018 2 次提交