1. 19 9月, 2019 1 次提交
    • F
      Btrfs: fix assertion failure during fsync and use of stale transaction · 7cbd49cf
      Filipe Manana 提交于
      commit 410f954cb1d1c79ae485dd83a175f21954fd87cd upstream.
      
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cbd49cf
  2. 26 7月, 2019 2 次提交
    • F
      Btrfs: fix fsync not persisting dentry deletions due to inode evictions · fffedf5c
      Filipe Manana 提交于
      commit 803f0f64d17769071d7287d9e3e3b79a3e1ae937 upstream.
      
      In order to avoid searches on a log tree when unlinking an inode, we check
      if the inode being unlinked was logged in the current transaction, as well
      as the inode of its parent directory. When any of the inodes are logged,
      we proceed to delete directory items and inode reference items from the
      log, to ensure that if a subsequent fsync of only the inode being unlinked
      or only of the parent directory when the other is not fsync'ed as well,
      does not result in the entry still existing after a power failure.
      
      That check however is not reliable when one of the inodes involved (the
      one being unlinked or its parent directory's inode) is evicted, since the
      logged_trans field is transient, that is, it is not stored on disk, so it
      is lost when the inode is evicted and loaded into memory again (which is
      set to zero on load). As a consequence the checks currently being done by
      btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
      return true if the inode was evicted before, regardless of the inode
      having been logged or not before (and in the current transaction), this
      results in the dentry being unlinked still existing after a log replay
      if after the unlink operation only one of the inodes involved is fsync'ed.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Keep an open file descriptor on our directory while we evict inodes.
        # We just want to evict the file's inode, the directory's inode must not
        # be evicted.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time to background process to chdir to our test
        # directory.
        $ sleep 0.5
      
        # Trigger eviction of the file's inode.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Unlink our file and fsync the parent directory. After a power failure
        # we don't expect to see the file anymore, since we fsync'ed the parent
        # directory.
        $ rm -f $SCRATCH_MNT/dir/foo
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ ls /mnt/dir
        foo
        $
         --> file still there, unlink not persisted despite explicit fsync on dir
      
      Fix this by checking if the inode has the full_sync bit set in its runtime
      flags as well, since that bit is set everytime an inode is loaded from
      disk, or for other less common cases such as after a shrinking truncate
      or failure to allocate extent maps for holes, and gets cleared after the
      first fsync. Also consider the inode as possibly logged only if it was
      last modified in the current transaction (besides having the full_fsync
      flag set).
      
      Fixes: 3a5f1d45 ("Btrfs: Optimize btree walking while logging inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fffedf5c
    • F
      Btrfs: fix data loss after inode eviction, renaming it, and fsync it · 110850ff
      Filipe Manana 提交于
      commit d1d832a0b51dd9570429bb4b81b2a6c1759e681a upstream.
      
      When we log an inode, regardless of logging it completely or only that it
      exists, we always update it as logged (logged_trans and last_log_commit
      fields of the inode are updated). This is generally fine and avoids future
      attempts to log it from having to do repeated work that brings no value.
      
      However, if we write data to a file, then evict its inode after all the
      dealloc was flushed (and ordered extents completed), rename the file and
      fsync it, we end up not logging the new extents, since the rename may
      result in logging that the inode exists in case the parent directory was
      logged before. The following reproducer shows and explains how this can
      happen:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ touch /mnt/dir/bar
      
        # Do a direct IO write instead of a buffered write because with a
        # buffered write we would need to make sure dealloc gets flushed and
        # complete before we do the inode eviction later, and we can not do that
        # from user space with call to things such as sync(2) since that results
        # in a transaction commit as well.
        $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
      
        # Keep the directory dir in use while we evict inodes. We want our file
        # bar's inode to be evicted but we don't want our directory's inode to
        # be evicted (if it were evicted too, we would not be able to reproduce
        # the issue since the first fsync below, of file foo, would result in a
        # transaction commit.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time for the background process to chdir.
        $ sleep 0.1
      
        # Evict all inodes, except the inode for the directory dir because it is
        # currently in use by our background process.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # fsync file foo, which ends up persisting information about the parent
        # directory because it is a new inode.
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Rename bar, this results in logging that this inode exists (inode item,
        # names, xattrs) because the parent directory is in the log.
        $ mv /mnt/dir/bar /mnt/dir/baz
      
        # Now fsync baz, which ends up doing absolutely nothing because of the
        # rename operation which logged that the inode exists only.
        $ xfs_io -c fsync /mnt/dir/baz
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/dir/baz
        0000000
      
          --> Empty file, data we wrote is missing.
      
      Fix this by not updating last_sub_trans of an inode when we are logging
      only that it exists and the inode was not yet logged since it was loaded
      from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
      return false for this scenario and make us log the inode. The logged_trans
      of the inode is still always setsince that alone is used to track if names
      need to be deleted as part of unlink operations.
      
      Fixes: 257c62e1 ("Btrfs: avoid tree log commit when there are no changes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      110850ff
  3. 09 6月, 2019 2 次提交
    • F
      Btrfs: fix fsync not persisting changed attributes of a directory · a8107111
      Filipe Manana 提交于
      commit 60d9f50308e5df19bc18c2fefab0eba4a843900a upstream.
      
      While logging an inode we follow its ancestors and for each one we mark
      it as logged in the current transaction, even if we have not logged it.
      As a consequence if we change an attribute of an ancestor, such as the
      UID or GID for example, and then explicitly fsync it, we end up not
      logging the inode at all despite returning success to user space, which
      results in the attribute being lost if a power failure happens after
      the fsync.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ chown 6007:6007 /mnt/dir
      
        $ sync
      
        $ chown 9003:9003 /mnt/dir
        $ touch /mnt/dir/file
        $ xfs_io -c fsync /mnt/dir/file
      
        # fsync our directory after fsync'ing the new file, should persist the
        # new values for the uid and gid.
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ stat -c %u:%g /mnt/dir
        6007:6007
      
          --> should be 9003:9003, the uid and gid were not persisted, despite
              the explicit fsync on the directory prior to the power failure
      
      Fix this by not updating the logged_trans field of ancestor inodes when
      logging an inode, since we have not logged them. Let only future calls to
      btrfs_log_inode() to mark inodes as logged.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8107111
    • F
      Btrfs: fix race updating log root item during fsync · 37fe0383
      Filipe Manana 提交于
      commit 06989c799f04810f6876900d4760c0edda369cf7 upstream.
      
      When syncing the log, the final phase of a fsync operation, we need to
      either create a log root's item or update the existing item in the log
      tree of log roots, and that depends on the current value of the log
      root's log_transid - if it's 1 we need to create the log root item,
      otherwise it must exist already and we update it. Since there is no
      synchronization between updating the log_transid and checking it for
      deciding whether the log root's item needs to be created or updated, we
      end up with a tiny race window that results in attempts to update the
      item to fail because the item was not yet created:
      
                    CPU 1                                    CPU 2
      
        btrfs_sync_log()
      
          lock root->log_mutex
      
          set log root's log_transid to 1
      
          unlock root->log_mutex
      
                                                     btrfs_sync_log()
      
                                                       lock root->log_mutex
      
                                                       sets log root's
                                                       log_transid to 2
      
                                                       unlock root->log_mutex
      
          update_log_root()
      
            sees log root's log_transid
            with a value of 2
      
              calls btrfs_update_root(),
              which fails with -EUCLEAN
              and causes transaction abort
      
      Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
      the recent commit 7ac1e464c4d47 ("btrfs: Don't panic when we can't find a
      root key") we just abort the current transaction.
      
      A sample trace of the BUG_ON() on a SLE12 kernel:
      
        ------------[ cut here ]------------
        kernel BUG at ../fs/btrfs/root-tree.c:157!
        Oops: Exception in kernel mode, sig: 5 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        (...)
        Supported: Yes, External
        CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G                 X 4.4.156-94.57-default #1
        task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
        NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
        REGS: c00000ff42b0b860 TRAP: 0700   Tainted: G                 X  (4.4.156-94.57-default)
        MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 22444484  XER: 20000000
        CFAR: d000000036aba66c SOFTE: 1
        GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
        GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
        GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
        GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
        GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
        GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
        GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
        GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
        NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
        LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
        Call Trace:
        [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
        [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
        [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
        [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
        [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
        [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
        [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
        Instruction dump:
        7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
        e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
        ---[ end trace 8f2dc8f919cabab8 ]---
      
      So fix this by doing the check of log_transid and updating or creating the
      log root's item while holding the root's log_mutex.
      
      Fixes: 7237f183 ("Btrfs: fix tree logs parallel sync")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37fe0383
  4. 31 5月, 2019 1 次提交
    • F
      Btrfs: avoid fallback to transaction commit during fsync of files with holes · 4f9a774d
      Filipe Manana 提交于
      commit ebb929060aeb162417b4c1307e63daee47b208d9 upstream.
      
      When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
      file that has holes and has file extent items spanning two or more leafs,
      we can end up falling to back to a full transaction commit due to a logic
      bug that leads to failure to insert a duplicate file extent item that is
      meant to represent a hole between the last file extent item of a leaf and
      the first file extent item in the next leaf. The failure (EEXIST error)
      leads to a transaction commit (as most errors when logging an inode do).
      
      For example, we have the two following leafs:
      
      Leaf N:
      
        -----------------------------------------------
        | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
        -----------------------------------------------
        The file extent item at the end of leaf N has a length of 4Kb,
        representing the file range from 64K to 68K - 1.
      
      Leaf N + 1:
      
        -----------------------------------------------
        | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
        -----------------------------------------------
        The file extent item at the first slot of leaf N + 1 has a length of
        4Kb too, representing the file range from 72K to 76K - 1.
      
      During the full fsync path, when we are at tree-log.c:copy_items() with
      leaf N as a parameter, after processing the last file extent item, that
      represents the extent at offset 64K, we take a look at the first file
      extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
      between the two extents, and therefore we insert a file extent item
      representing that hole, starting at file offset 68K and ending at offset
      72K - 1. However we don't update the value of *last_extent, which is used
      to represent the end offset (plus 1, non-inclusive end) of the last file
      extent item inserted in the log, so it stays with a value of 68K and not
      with a value of 72K.
      
      Then, when copy_items() is called for leaf N + 1, because the value of
      *last_extent is smaller then the offset of the first extent item in the
      leaf (68K < 72K), we look at the last file extent item in the previous
      leaf (leaf N) and see it there's a 4K gap between it and our first file
      extent item (again, 68K < 72K), so we decide to insert a file extent item
      representing the hole, starting at file offset 68K and ending at offset
      72K - 1, this insertion will fail with -EEXIST being returned from
      btrfs_insert_file_extent() because we already inserted a file extent item
      representing a hole for this offset (68K) in the previous call to
      copy_items(), when processing leaf N.
      
      The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
      which falls back to a full transaction commit.
      
      Fix this by adjusting *last_extent after inserting a hole when we had to
      look at the next leaf.
      
      Fixes: 4ee3fad3 ("Btrfs: fix fsync after hole punching when using no-holes feature")
      Cc: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f9a774d
  5. 03 4月, 2019 3 次提交
    • F
      Btrfs: fix assertion failure on fsync with NO_HOLES enabled · fd1b2536
      Filipe Manana 提交于
      commit 0ccc3876e4b2a1559a4dbe3126dda4459d38a83b upstream.
      
      Back in commit a89ca6f2 ("Btrfs: fix fsync after truncate when
      no_holes feature is enabled") I added an assertion that is triggered when
      an inline extent is found to assert that the length of the (uncompressed)
      data the extent represents is the same as the i_size of the inode, since
      that is true most of the time I couldn't find or didn't remembered about
      any exception at that time. Later on the assertion was expanded twice to
      deal with a case of a compressed inline extent representing a range that
      matches the sector size followed by an expanding truncate, and another
      case where fallocate can update the i_size of the inode without adding
      or updating existing extents (if the fallocate range falls entirely within
      the first block of the file). These two expansion/fixes of the assertion
      were done by commit 7ed586d0a8241 ("Btrfs: fix assertion on fsync of
      regular file when using no-holes feature") and commit 6399fb5a
      ("Btrfs: fix assertion failure during fsync in no-holes mode").
      These however missed the case where an falloc expands the i_size of an
      inode to exactly the sector size and inline extent exists, for example:
      
       $ mkfs.btrfs -f -O no-holes /dev/sdc
       $ mount /dev/sdc /mnt
      
       $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
       wrote 1096/1096 bytes at offset 0
       1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
      
       $ xfs_io -c "falloc 1096 3000" /mnt/foobar
       $ xfs_io -c "fsync" /mnt/foobar
       Segmentation fault
      
       $ dmesg
       [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
       [701253.602962] ------------[ cut here ]------------
       [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
       [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
       [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
       (...)
       [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
       [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
       [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
       [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
       [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
       [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
       [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
       [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
       [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [701253.609608] Call Trace:
       [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
       [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
       [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
       [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
       [701253.612010]  ? do_sysinfo+0xb0/0xf0
       [701253.612552]  do_fsync+0x38/0x60
       [701253.612988]  __x64_sys_fsync+0x10/0x20
       [701253.613360]  do_syscall_64+0x60/0x1b0
       [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [701253.614103] RIP: 0033:0x7f14da4e66d0
       (...)
       [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
       [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
       [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
       [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
       [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
       (...)
       [701253.619941] ---[ end trace e088d74f132b6da5 ]---
      
      Updating the assertion again to allow for this particular case would result
      in a meaningless assertion, plus there is currently no risk of logging
      content that would result in any corruption after a log replay if the size
      of the data encoded in an inline extent is greater than the inode's i_size
      (which is not currently possibe either with or without compression),
      therefore just remove the assertion.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd1b2536
    • J
      btrfs: remove WARN_ON in log_dir_items · b57220cc
      Josef Bacik 提交于
      commit 2cc8334270e281815c3850c3adea363c51f21e0d upstream.
      
      When Filipe added the recursive directory logging stuff in
      2f2ff0ee ("Btrfs: fix metadata inconsistencies after directory
      fsync") he specifically didn't take the directory i_mutex for the
      children directories that we need to log because of lockdep.  This is
      generally fine, but can lead to this WARN_ON() tripping if we happen to
      run delayed deletion's in between our first search and our second search
      of dir_item/dir_indexes for this directory.  We expect this to happen,
      so the WARN_ON() isn't necessary.  Drop the WARN_ON() and add a comment
      so we know why this case can happen.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b57220cc
    • F
      Btrfs: fix incorrect file size after shrinking truncate and fsync · 22dcb30f
      Filipe Manana 提交于
      commit bf504110bc8aa05df48b0e5f0aa84bfb81e0574b upstream.
      
      If we do a shrinking truncate against an inode which is already present
      in the respective log tree and then rename it, as part of logging the new
      name we end up logging an inode item that reflects the old size of the
      file (the one which we previously logged) and not the new smaller size.
      The decision to preserve the size previously logged was added by commit
      1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
      inode") in order to avoid data loss after replaying the log. However that
      decision is only needed for the case the logged inode size is smaller then
      the current size of the inode, as explained in that commit's change log.
      If the current size of the inode is smaller then the previously logged
      size, we know a shrinking truncate happened and therefore need to use
      that smaller size.
      
      Example to trigger the problem:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
        $ xfs_io -c "fsync" /mnt/foo
        $ xfs_io -c "truncate 3000" /mnt/foo
      
        $ mv /mnt/foo /mnt/bar
        $ xfs_io -c "fsync" /mnt/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/bar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0008000
      
      Once we rename the file, we log its name (and inode item), and because
      the inode was already logged before in the current transaction, we log it
      with a size of 8000 bytes because that is the size we previously logged
      (with the first fsync). As part of the rename, besides logging the inode,
      we do also sync the log, which is done since commit d4682ba0
      ("Btrfs: sync log after logging new name"), so the next fsync against our
      inode is effectively a no-op, since no new changes happened since the
      rename operation. Even if did not sync the log during the rename
      operation, the same problem (fize size of 8000 bytes instead of 3000
      bytes) would be visible after replaying the log if the log ended up
      getting synced to disk through some other means, such as for example by
      fsyncing some other modified file. In the example above the fsync after
      the rename operation is there just because not every filesystem may
      guarantee logging/journalling the inode (and syncing the log/journal)
      during the rename operation, for example it is needed for f2fs, but not
      for ext4 and xfs.
      
      Fix this scenario by, when logging a new name (which is triggered by
      rename and link operations), using the current size of the inode instead
      of the previously logged inode size.
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: NSeulbae Kim <seulbae@gatech.edu>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22dcb30f
  6. 10 1月, 2019 1 次提交
    • F
      Btrfs: fix fsync of files with multiple hard links in new directories · 10b04210
      Filipe Manana 提交于
      commit 41bd60676923822de1df2c50b3f9a10171f4338a upstream.
      
      The log tree has a long standing problem that when a file is fsync'ed we
      only check for new ancestors, created in the current transaction, by
      following only the hard link for which the fsync was issued. We follow the
      ancestors using the VFS' dget_parent() API. This means that if we create a
      new link for a file in a directory that is new (or in an any other new
      ancestor directory) and then fsync the file using an old hard link, we end
      up not logging the new ancestor, and on log replay that new hard link and
      ancestor do not exist. In some cases, involving renames, the file will not
      exist at all.
      
      Example:
      
        mkfs.btrfs -f /dev/sdb
        mount /dev/sdb /mnt
      
        mkdir /mnt/A
        touch /mnt/foo
        ln /mnt/foo /mnt/A/bar
        xfs_io -c fsync /mnt/foo
      
        <power failure>
      
      In this example after log replay only the hard link named 'foo' exists
      and directory A does not exist, which is unexpected. In other major linux
      filesystems, such as ext4, xfs and f2fs for example, both hard links exist
      and so does directory A after mounting again the filesystem.
      
      Checking if any new ancestors are new and need to be logged was added in
      2009 by commit 12fcfd22 ("Btrfs: tree logging unlink/rename fixes"),
      however only for the ancestors of the hard link (dentry) for which the
      fsync was issued, instead of checking for all ancestors for all of the
      inode's hard links.
      
      So fix this by tracking the id of the last transaction where a hard link
      was created for an inode and then on fsync fallback to a full transaction
      commit when an inode has more than one hard link and at least one new hard
      link was created in the current transaction. This is the simplest solution
      since this is not a common use case (adding frequently hard links for
      which there's an ancestor created in the current transaction and then
      fsync the file). In case it ever becomes a common use case, a solution
      that consists of iterating the fs/subvol btree for each hard link and
      check if any ancestor is new, could be implemented.
      
      This solves many unexpected scenarios reported by Jayashree Mohan and
      Vijay Chidambaram, and for which there is a new test case for fstests
      under review.
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: NVijay Chidambaram <vvijay03@gmail.com>
      Reported-by: NJayashree Mohan <jayashree2912@gmail.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10b04210
  7. 21 11月, 2018 1 次提交
    • F
      Btrfs: fix missing data checksums after a ranged fsync (msync) · fa625a48
      Filipe Manana 提交于
      commit 008c6753f7e070c77c70d708a6bf0255b4381763 upstream.
      
      Recently we got a massive simplification for fsync, where for the fast
      path we no longer log new extents while their respective ordered extents
      are still running.
      
      However that simplification introduced a subtle regression for the case
      where we use a ranged fsync (msync). Consider the following example:
      
                     CPU 0                                    CPU 1
      
                                                  mmap write to range [2Mb, 4Mb[
        mmap write to range [512Kb, 1Mb[
        msync range [512K, 1Mb[
          --> triggers fast fsync
              (BTRFS_INODE_NEEDS_FULL_SYNC
               not set)
          --> creates extent map A for this
              range and adds it to list of
              modified extents
          --> starts ordered extent A for
              this range
          --> waits for it to complete
      
                                                  writeback triggered for range
                                                  [2Mb, 4Mb[
                                                    --> create extent map B and
                                                        adds it to the list of
                                                        modified extents
                                                    --> creates ordered extent B
      
          --> start looking for and logging
              modified extents
          --> logs extent maps A and B
          --> finds checksums for extent A
              in the csum tree, but not for
              extent B
        fsync (msync) finishes
      
                                                    --> ordered extent B
                                                        finishes and its
                                                        checksums are added
                                                        to the csum tree
      
                                      <power cut>
      
      After replaying the log, we have the extent covering the range [2Mb, 4Mb[
      but do not have the data checksum items covering that file range.
      
      This happens because at the very beginning of an fsync (btrfs_sync_file())
      we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
      wait for any ordered extents in that range to complete before we start
      logging the extents. However if right before we start logging the extent
      in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
      such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
      or msync (btrfs_sync_file() starts writeback before acquiring the inode's
      lock), an ordered extent is created for that other range and a new extent
      map is created to represent that range and added to the inode's list of
      modified extents.
      
      That means that we will see that other extent in that list when collecting
      extents for logging (done at btrfs_log_changed_extents()) and log the
      extent before the respective ordered extent finishes - namely before the
      checksum items are added to the checksums tree, which is where
      log_extent_csums() looks for the checksums, therefore making us log an
      extent without logging its checksums. Before that massive simplification
      of fsync, this wasn't a problem because besides looking for checkums in
      the checksums tree, we also looked for them in any ordered extent still
      running.
      
      The consequence of data checksums missing for a file range is that users
      attempting to read the affected file range will get -EIO errors and dmesg
      reports the following:
      
       [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
       [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1
      
      So fix this by skipping extents outside of our logging range at
      btrfs_log_changed_extents() and leaving them on the list of modified
      extents so that any subsequent ranged fsync may collect them if needed.
      Also, if we find a hole extent outside of the range still log it, just
      to prevent having gaps between extent items after replaying the log,
      otherwise fsck will complain when we are not using the NO_HOLES feature
      (fstest btrfs/056 triggers such case).
      
      Fixes: e7175a69 ("btrfs: remove the wait ordered logic in the log_one_extent path")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa625a48
  8. 14 11月, 2018 5 次提交
    • J
      btrfs: move the dio_sem higher up the callchain · 51c62a33
      Josef Bacik 提交于
      commit c495144b upstream.
      
      We're getting a lockdep splat because we take the dio_sem under the
      log_mutex.  What we really need is to protect fsync() from logging an
      extent map for an extent we never waited on higher up, so just guard the
      whole thing with dio_sem.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted
      ------------------------------------------------------
      aio-dio-invalid/30928 is trying to acquire lock:
      0000000092621cfd (&mm->mmap_sem){++++}, at: get_user_pages_unlocked+0x5a/0x1e0
      
      but task is already holding lock:
      00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #5 (&ei->dio_sem){++++}:
             lock_acquire+0xbd/0x220
             down_write+0x51/0xb0
             btrfs_log_changed_extents+0x80/0xa40
             btrfs_log_inode+0xbaf/0x1000
             btrfs_log_inode_parent+0x26f/0xa80
             btrfs_log_dentry_safe+0x50/0x70
             btrfs_sync_file+0x357/0x540
             do_fsync+0x38/0x60
             __ia32_sys_fdatasync+0x12/0x20
             do_fast_syscall_32+0x9a/0x2f0
             entry_SYSENTER_compat+0x84/0x96
      
      -> #4 (&ei->log_mutex){+.+.}:
             lock_acquire+0xbd/0x220
             __mutex_lock+0x86/0xa10
             btrfs_record_unlink_dir+0x2a/0xa0
             btrfs_unlink+0x5a/0xc0
             vfs_unlink+0xb1/0x1a0
             do_unlinkat+0x264/0x2b0
             do_fast_syscall_32+0x9a/0x2f0
             entry_SYSENTER_compat+0x84/0x96
      
      -> #3 (sb_internal#2){.+.+}:
             lock_acquire+0xbd/0x220
             __sb_start_write+0x14d/0x230
             start_transaction+0x3e6/0x590
             btrfs_evict_inode+0x475/0x640
             evict+0xbf/0x1b0
             btrfs_run_delayed_iputs+0x6c/0x90
             cleaner_kthread+0x124/0x1a0
             kthread+0x106/0x140
             ret_from_fork+0x3a/0x50
      
      -> #2 (&fs_info->cleaner_delayed_iput_mutex){+.+.}:
             lock_acquire+0xbd/0x220
             __mutex_lock+0x86/0xa10
             btrfs_alloc_data_chunk_ondemand+0x197/0x530
             btrfs_check_data_free_space+0x4c/0x90
             btrfs_delalloc_reserve_space+0x20/0x60
             btrfs_page_mkwrite+0x87/0x520
             do_page_mkwrite+0x31/0xa0
             __handle_mm_fault+0x799/0xb00
             handle_mm_fault+0x7c/0xe0
             __do_page_fault+0x1d3/0x4a0
             async_page_fault+0x1e/0x30
      
      -> #1 (sb_pagefaults){.+.+}:
             lock_acquire+0xbd/0x220
             __sb_start_write+0x14d/0x230
             btrfs_page_mkwrite+0x6a/0x520
             do_page_mkwrite+0x31/0xa0
             __handle_mm_fault+0x799/0xb00
             handle_mm_fault+0x7c/0xe0
             __do_page_fault+0x1d3/0x4a0
             async_page_fault+0x1e/0x30
      
      -> #0 (&mm->mmap_sem){++++}:
             __lock_acquire+0x42e/0x7a0
             lock_acquire+0xbd/0x220
             down_read+0x48/0xb0
             get_user_pages_unlocked+0x5a/0x1e0
             get_user_pages_fast+0xa4/0x150
             iov_iter_get_pages+0xc3/0x340
             do_direct_IO+0xf93/0x1d70
             __blockdev_direct_IO+0x32d/0x1c20
             btrfs_direct_IO+0x227/0x400
             generic_file_direct_write+0xcf/0x180
             btrfs_file_write_iter+0x308/0x58c
             aio_write+0xf8/0x1d0
             io_submit_one+0x3a9/0x620
             __ia32_compat_sys_io_submit+0xb2/0x270
             do_int80_syscall_32+0x5b/0x1a0
             entry_INT80_compat+0x88/0xa0
      
      other info that might help us debug this:
      
      Chain exists of:
        &mm->mmap_sem --> &ei->log_mutex --> &ei->dio_sem
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&ei->dio_sem);
                                     lock(&ei->log_mutex);
                                     lock(&ei->dio_sem);
        lock(&mm->mmap_sem);
      
       *** DEADLOCK ***
      
      1 lock held by aio-dio-invalid/30928:
       #0: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400
      
      stack backtrace:
      CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Call Trace:
       dump_stack+0x7c/0xbb
       print_circular_bug.isra.37+0x297/0x2a4
       check_prev_add.constprop.45+0x781/0x7a0
       ? __lock_acquire+0x42e/0x7a0
       validate_chain.isra.41+0x7f0/0xb00
       __lock_acquire+0x42e/0x7a0
       lock_acquire+0xbd/0x220
       ? get_user_pages_unlocked+0x5a/0x1e0
       down_read+0x48/0xb0
       ? get_user_pages_unlocked+0x5a/0x1e0
       get_user_pages_unlocked+0x5a/0x1e0
       get_user_pages_fast+0xa4/0x150
       iov_iter_get_pages+0xc3/0x340
       do_direct_IO+0xf93/0x1d70
       ? __alloc_workqueue_key+0x358/0x490
       ? __blockdev_direct_IO+0x14b/0x1c20
       __blockdev_direct_IO+0x32d/0x1c20
       ? btrfs_run_delalloc_work+0x40/0x40
       ? can_nocow_extent+0x490/0x490
       ? kvm_clock_read+0x1f/0x30
       ? can_nocow_extent+0x490/0x490
       ? btrfs_run_delalloc_work+0x40/0x40
       btrfs_direct_IO+0x227/0x400
       ? btrfs_run_delalloc_work+0x40/0x40
       generic_file_direct_write+0xcf/0x180
       btrfs_file_write_iter+0x308/0x58c
       aio_write+0xf8/0x1d0
       ? kvm_clock_read+0x1f/0x30
       ? __might_fault+0x3e/0x90
       io_submit_one+0x3a9/0x620
       ? io_submit_one+0xe5/0x620
       __ia32_compat_sys_io_submit+0xb2/0x270
       do_int80_syscall_32+0x5b/0x1a0
       entry_INT80_compat+0x88/0xa0
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51c62a33
    • F
      Btrfs: fix assertion on fsync of regular file when using no-holes feature · e17af96e
      Filipe Manana 提交于
      commit 7ed586d0a8241e81d58c656c5b315f781fa6fc97 upstream.
      
      When using the NO_HOLES feature and logging a regular file, we were
      expecting that if we find an inline extent, that either its size in RAM
      (uncompressed and unenconded) matches the size of the file or if it does
      not, that it matches the sector size and it represents compressed data.
      This assertion does not cover a case where the length of the inline extent
      is smaller than the sector size and also smaller the file's size, such
      case is possible through fallocate. Example:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
        $ xfs_io -c "falloc 40 40" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
      In the above example we trigger the assertion because the inline extent's
      length is 21 bytes while the file size is 80 bytes. The fallocate() call
      merely updated the file's size and did not touch the existing inline
      extent, as expected.
      
      So fix this by adjusting the assertion so that an inline extent length
      smaller than the file size is valid if the file size is smaller than the
      filesystem's sector size.
      
      A test case for fstests follows soon.
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Fixes: a89ca6f2 ("Btrfs: fix fsync after truncate when no_holes feature is enabled")
      CC: stable@vger.kernel.org # 4.14+
      Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e17af96e
    • F
      Btrfs: fix wrong dentries after fsync of file that got its parent replaced · d2c6df39
      Filipe Manana 提交于
      commit 0f375eed92b5a407657532637ed9652611a682f5 upstream.
      
      In a scenario like the following:
      
        mkdir /mnt/A               # inode 258
        mkdir /mnt/B               # inode 259
        touch /mnt/B/bar           # inode 260
      
        sync
      
        mv /mnt/B/bar /mnt/A/bar
        mv -T /mnt/A /mnt/B
        fsync /mnt/B/bar
      
        <power fail>
      
      After replaying the log we end up with file bar having 2 hard links, both
      with the name 'bar' and one in the directory with inode number 258 and the
      other in the directory with inode number 259. Also, we end up with the
      directory inode 259 still existing and with the directory inode 258 still
      named as 'A', instead of 'B'. In this scenario, file 'bar' should only
      have one hard link, located at directory inode 258, the directory inode
      259 should not exist anymore and the name for directory inode 258 should
      be 'B'.
      
      This incorrect behaviour happens because when attempting to log the old
      parents of an inode, we skip any parents that no longer exist. Fix this
      by forcing a full commit if an old parent no longer exists.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2c6df39
    • F
      Btrfs: fix warning when replaying log after fsync of a tmpfile · 55f21e16
      Filipe Manana 提交于
      commit f2d72f42d5fa3bf33761d9e47201745f624fcff5 upstream.
      
      When replaying a log which contains a tmpfile (which necessarily has a
      link count of 0) we end up calling inc_nlink(), at
      fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
      the following:
      
        [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
        [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
        [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [195191.943726] RIP: 0010:inc_nlink+0x33/0x40
        [195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
        [195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
        [195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
        [195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
        [195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
        [195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
        [195191.943735] FS:  00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
        [195191.943736] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
        [195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [195191.943741] Call Trace:
        [195191.943778]  replay_one_buffer+0x797/0x7d0 [btrfs]
        [195191.943802]  walk_up_log_tree+0x1c1/0x250 [btrfs]
        [195191.943809]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943825]  walk_log_tree+0xae/0x1d0 [btrfs]
        [195191.943840]  btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
        [195191.943856]  ? replay_dir_deletes+0x280/0x280 [btrfs]
        [195191.943870]  open_ctree+0x1c3b/0x22a0 [btrfs]
        [195191.943887]  btrfs_mount_root+0x6b4/0x800 [btrfs]
        [195191.943894]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943899]  ? pcpu_alloc+0x55b/0x7c0
        [195191.943906]  ? mount_fs+0x3b/0x140
        [195191.943908]  mount_fs+0x3b/0x140
        [195191.943912]  ? __init_waitqueue_head+0x36/0x50
        [195191.943916]  vfs_kern_mount+0x62/0x160
        [195191.943927]  btrfs_mount+0x134/0x890 [btrfs]
        [195191.943936]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943938]  ? pcpu_alloc+0x55b/0x7c0
        [195191.943943]  ? mount_fs+0x3b/0x140
        [195191.943952]  ? btrfs_remount+0x570/0x570 [btrfs]
        [195191.943954]  mount_fs+0x3b/0x140
        [195191.943956]  ? __init_waitqueue_head+0x36/0x50
        [195191.943960]  vfs_kern_mount+0x62/0x160
        [195191.943963]  do_mount+0x1f9/0xd40
        [195191.943967]  ? memdup_user+0x4b/0x70
        [195191.943971]  ksys_mount+0x7e/0xd0
        [195191.943974]  __x64_sys_mount+0x21/0x30
        [195191.943977]  do_syscall_64+0x60/0x1b0
        [195191.943980]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [195191.943983] RIP: 0033:0x7f570e4e524a
        [195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
        [195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
        [195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
        [195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
        [195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
        [195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
        [195191.944002] irq event stamp: 8688
        [195191.944010] hardirqs last  enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
        [195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
        [195191.944018] softirqs last  enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
        [195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
        [195191.944022] ---[ end trace 5d6e873a9a0b811a ]---
      
      This happens because the inode does not have the flag I_LINKABLE set,
      which is a runtime only flag, not meant to be persisted, set when the
      inode is created through open(2) if the flag O_EXCL is not passed to it.
      Except for the warning, there are no other consequences (like corruptions
      or metadata inconsistencies).
      
      Since it's pointless to replay a tmpfile as it would be deleted in a
      later phase of the log replay procedure (it has a link count of 0), fix
      this by not logging tmpfiles and if a tmpfile is found in a log (created
      by a kernel without this change), skip the replay of the inode.
      
      A test case for fstests follows soon.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.18+
      Reported-by: NMartin Steigerwald <martin@lichtvoll.de>
      Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55f21e16
    • J
      btrfs: fix error handling in free_log_tree · cdecd48a
      Jeff Mahoney 提交于
      commit 374b0e2d upstream.
      
      When we hit an I/O error in free_log_tree->walk_log_tree during file system
      shutdown we can crash due to there not being a valid transaction handle.
      
      Use btrfs_handle_fs_error when there's no transaction handle to use.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
        IP: free_log_tree+0xd2/0x140 [btrfs]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        Modules linked in: <modules>
        CPU: 2 PID: 23544 Comm: umount Tainted: G        W        4.12.14-kvmsmall #9 SLE15 (unreleased)
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        task: ffff96bfd3478880 task.stack: ffffa7cf40d78000
        RIP: 0010:free_log_tree+0xd2/0x140 [btrfs]
        RSP: 0018:ffffa7cf40d7bd10 EFLAGS: 00010282
        RAX: 00000000fffffffb RBX: 00000000fffffffb RCX: 0000000000000002
        RDX: 0000000000000000 RSI: ffff96c02f07d4c8 RDI: 0000000000000282
        RBP: ffff96c013cf1000 R08: ffff96c02f07d4c8 R09: ffff96c02f07d4d0
        R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
        R13: ffff96c005e800c0 R14: ffffa7cf40d7bdb8 R15: 0000000000000000
        FS:  00007f17856bcfc0(0000) GS:ffff96c03f600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000060 CR3: 0000000045ed6002 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? wait_for_writer+0xb0/0xb0 [btrfs]
         btrfs_free_log+0x17/0x30 [btrfs]
         btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs]
         btrfs_free_fs_roots+0xc0/0x130 [btrfs]
         ? wait_for_completion+0xf2/0x100
         close_ctree+0xea/0x2e0 [btrfs]
         ? kthread_stop+0x161/0x260
         generic_shutdown_super+0x6c/0x120
         kill_anon_super+0xe/0x20
         btrfs_kill_super+0x13/0x100 [btrfs]
         deactivate_locked_super+0x3f/0x70
         cleanup_mnt+0x3b/0x70
         task_work_run+0x78/0x90
         exit_to_usermode_loop+0x77/0xa6
         do_syscall_64+0x1c5/0x1e0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
        RIP: 0033:0x7f1784f90827
        RSP: 002b:00007ffdeeb03118 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000556a60c62970 RCX: 00007f1784f90827
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556a60c62b50
        RBP: 0000000000000000 R08: 0000000000000005 R09: 00000000ffffffff
        R10: 0000556a60c63900 R11: 0000000000000246 R12: 0000556a60c62b50
        R13: 00007f17854a81c4 R14: 0000000000000000 R15: 0000000000000000
        RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: ffffa7cf40d7bd10
        CR2: 0000000000000060
      
      Fixes: 681ae509 ("Btrfs: cleanup reserved space when freeing tree log on error")
      CC: <stable@vger.kernel.org> # v3.13
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdecd48a
  9. 23 8月, 2018 1 次提交
    • F
      Btrfs: sync log after logging new name · d4682ba0
      Filipe Manana 提交于
      When we add a new name for an inode which was logged in the current
      transaction, we update the inode in the log so that its new name and
      ancestors are added to the log. However when we do this we do not persist
      the log, so the changes remain in memory only, and as a consequence, any
      ancestors that were created in the current transaction are updated such
      that future calls to btrfs_inode_in_log() return true. This leads to a
      subsequent fsync against such new ancestor directories returning
      immediately, without persisting the log, therefore after a power failure
      the new ancestor directories do not exist, despite fsync being called
      against them explicitly.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ mkdir /mnt/A/C
        $ touch /mnt/B/foo
        $ xfs_io -c "fsync" /mnt/B/foo
        $ ln /mnt/B/foo /mnt/A/C/foo
        $ xfs_io -c "fsync" /mnt/A
        <power failure>
      
      After the power failure, directory "A" does not exist, despite the explicit
      fsync on it.
      
      Instead of fixing this by changing the behaviour of the explicit fsync on
      directory "A" to persist the log instead of doing nothing, make the logging
      of the new file name (which happens when creating a hard link or renaming)
      persist the log. This approach not only is simpler, not requiring addition
      of new fields to the inode in memory structure, but also gives us the same
      behaviour as ext4, xfs and f2fs (possibly other filesystems too).
      
      A test case for fstests follows soon.
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      Reported-by: NVijay Chidambaram <vvijay03@gmail.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4682ba0
  10. 06 8月, 2018 11 次提交
  11. 29 5月, 2018 2 次提交
    • D
      btrfs: replace waitqueue_actvie with cond_wake_up · 093258e6
      David Sterba 提交于
      Use the wrappers and reduce the amount of low-level details about the
      waitqueue management.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      093258e6
    • D
      btrfs: add barriers to btrfs_sync_log before log_commit_wait wakeups · 3d3a2e61
      David Sterba 提交于
      Currently the code assumes that there's an implied barrier by the
      sequence of code preceding the wakeup, namely the mutex unlock.
      
      As Nikolay pointed out:
      
      I think this is wrong (not your code) but the original assumption that
      the RELEASE semantics provided by mutex_unlock is sufficient.
      According to memory-barriers.txt:
      
      Section 'LOCK ACQUISITION FUNCTIONS' states:
      
       (2) RELEASE operation implication:
      
           Memory operations issued before the RELEASE will be completed before the
           RELEASE operation has completed.
      
           Memory operations issued after the RELEASE *may* be completed before the
           RELEASE operation has completed.
      
      (I've bolded the may portion)
      
      The example given there:
      
      As an example, consider the following:
      
          *A = a;
          *B = b;
          ACQUIRE
          *C = c;
          *D = d;
          RELEASE
          *E = e;
          *F = f;
      
      The following sequence of events is acceptable:
      
          ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
      
      So if we assume that *C is modifying the flag which the waitqueue is checking,
      and *E is the actual wakeup, then those accesses can be re-ordered...
      
      IMHO this code should be considered broken...
      ---
      
      To be on the safe side, add the barriers. The synchronization logic
      around log using the mutexes and several other threads does not make it
      easy to reason for/against the barrier.
      
      CC: Nikolay Borisov <nborisov@suse.com>
      Link: https://lkml.kernel.org/r/6ee068d8-1a69-3728-00d1-d86293d43c9f@suse.comReviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d3a2e61
  12. 17 5月, 2018 1 次提交
    • F
      Btrfs: fix duplicate extents after fsync of file with prealloc extents · 31d11b83
      Filipe Manana 提交于
      In commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
      after fsync log replay"), on fsync,  we started to always log all prealloc
      extents beyond an inode's i_size in order to avoid losing them after a
      power failure. However under some cases this can lead to the log replay
      code to create duplicate extent items, with different lengths, in the
      extent tree. That happens because, as of that commit, we can now log
      extent items based on extent maps that are not on the "modified" list
      of extent maps of the inode's extent map tree. Logging extent items based
      on extent maps is used during the fast fsync path to save time and for
      this to work reliably it requires that the extent maps are not merged
      with other adjacent extent maps - having the extent maps in the list
      of modified extents gives such guarantee.
      
      Consider the following example, captured during a long run of fsstress,
      which illustrates this problem.
      
      We have inode 271, in the filesystem tree (root 5), for which all of the
      following operations and discussion apply to.
      
      A buffered write starts at offset 312391 with a length of 933471 bytes
      (end offset at 1245862). At this point we have, for this inode, the
      following extent maps with the their field values:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
            block_len 376832, orig_block_len 376832
      em C, start 417792, orig_start 417792, len 782336, block_start
            18446744073709551613, block_len 0, orig_block_len 0
      em D, start 1200128, orig_start 1200128, len 835584, block_start
            1106776064, block_len 835584, orig_block_len 835584
      em E, start 2035712, orig_start 2035712, len 245760, block_start
            1107611648, block_len 245760, orig_block_len 245760
      
      Extent map A corresponds to a hole and extent maps D and E correspond to
      preallocated extents.
      
      Extent map D ends where extent map E begins (1106776064 + 835584 =
      1107611648), but these extent maps were not merged because they are in
      the inode's list of modified extent maps.
      
      An fsync against this inode is made, which triggers the fast path
      (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
      of the data previously written using buffered IO, and when the respective
      ordered extent finishes, btrfs_drop_extents() is called against the
      (aligned) range 311296..1249279. This causes a split of extent map D at
      btrfs_drop_extent_cache(), replacing extent map D with a new extent map
      D', also added to the list of modified extents,  with the following
      values:
      
      em D', start 1249280, orig_start of 1200128,
             block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
             orig_block_len 835584,
             block_len 786432 (835584 - (1249280 - 1200128))
      
      Then, during the fast fsync, btrfs_log_changed_extents() is called and
      extent maps D' and E are removed from the list of modified extents. The
      flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
      clear_em_logging() is called on each of them, and that makes extent map E
      to be merged with extent map D' (try_merge_map()), resulting in D' being
      deleted and E adjusted to:
      
      em E, start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192,
            orig_block_len 245760
      
      A direct IO write at offset 1847296 and length of 360448 bytes (end offset
      at 2207744) starts, and at that moment the following extent maps exist for
      our inode:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192, orig_block_len 245760
      
      The dio write results in drop_extent_cache() being called twice. The first
      time for a range that starts at offset 1847296 and ends at offset 2035711
      (length of 188416), which results in a double split of extent map E,
      replacing it with two new extent maps:
      
      em F, start 1249280, orig_start 1200128, block_start 1106825216,
            block_len 598016, orig_block_len 598016
      em G, start 2035712, orig_start 1200128, block_start 1107611648,
            block_len 245760, orig_block_len 1032192
      
      It also creates a new extent map that represents a part of the requested
      IO (through create_io_em()):
      
      em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
      
      The second call to drop_extent_cache() has a range with a start offset of
      2035712 and end offset of 2207743 (length of 172032). This leads to
      replacing extent map G with a new extent map I with the following values:
      
      em I, start 2207744, orig_start 1200128, block_start 1107783680,
            block_len 73728, orig_block_len 1032192
      
      It also creates a new extent map that represents the second part of the
      requested IO (through create_io_em()):
      
      em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
      
      The dio write set the inode's i_size to 2207744 bytes.
      
      After the dio write the inode has the following extent maps:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em F, start 1249280, orig_start 1200128, len 598016,
            block_start 1106825216, block_len 598016, orig_block_len 598016
      em H, start 1847296, orig_start 1200128, len 188416,
            block_start 1107423232, block_len 188416, orig_block_len 835584
      em J, start 2035712, orig_start 2035712, len 172032,
            block_start 1107611648, block_len 172032, orig_block_len 245760
      em I, start 2207744, orig_start 1200128, len 73728,
            block_start 1107783680, block_len 73728, orig_block_len 1032192
      
      Now do some change to the file, like adding a xattr for example and then
      fsync it again. This triggers a fast fsync path, and as of commit
      471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync
      log replay"), we use the extent map I to log a file extent item because
      it's a prealloc extent and it starts at an offset matching the inode's
      i_size. However when we log it, we create a file extent item with a value
      for the disk byte location that is wrong, as can be seen from the
      following output of "btrfs inspect-internal dump-tree":
      
       item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
           generation 22 type 2 (prealloc)
           prealloc data disk byte 1106776064 nr 1032192
           prealloc data offset 1007616 nr 73728
      
      Here the disk byte value corresponds to calculation based on some fields
      from the extent map I:
      
        1106776064 = block_start (1107783680) - 1007616 (extent_offset)
        extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
      
      The disk byte value of 1106776064 clashes with disk byte values of the
      file extent items at offsets 1249280 and 1847296 in the fs tree:
      
              item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1106776064 nr 835584
                      prealloc data offset 49152 nr 598016
              item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1106776064 nr 835584
                      extent data offset 647168 nr 188416 ram 835584
                      extent compression 0 (none)
              item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1107611648 nr 245760
                      extent data offset 0 nr 172032 ram 245760
                      extent compression 0 (none)
              item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1107611648 nr 245760
                      prealloc data offset 172032 nr 73728
      
      Instead of the disk byte value of 1106776064, the value of 1107611648
      should have been logged. Also the data offset value should have been
      172032 and not 1007616.
      After a log replay we end up getting two extent items in the extent tree
      with different lengths, one of 835584, which is correct and existed
      before the log replay, and another one of 1032192 which is wrong and is
      based on the logged file extent item:
      
       item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
          refs 2 gen 15 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 2
       item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
          refs 1 gen 22 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 1
      
      Obviously this leads to many problems and a filesystem check reports many
      errors:
      
       (...)
       checking extents
       Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
       extent item 1106776064 has multiple extent items
       ref mismatch on [1106776064 835584] extent item 2, found 3
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
       Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
       Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
       backpointer mismatch on [1106776064 835584]
       checking free space cache
       block group 1103101952 has wrong amount of free space
       failed to load free space cache for block group 1103101952
       checking fs roots
       (...)
      
      So fix this by logging the prealloc extents beyond the inode's i_size
      based on searches in the subvolume tree instead of the extent maps.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31d11b83
  13. 14 5月, 2018 1 次提交
    • F
      Btrfs: fix xattr loss after power failure · 9a8fca62
      Filipe Manana 提交于
      If a file has xattrs, we fsync it, to ensure we clear the flags
      BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
      inode, the current transaction commits and then we fsync it (without
      either of those bits being set in its inode), we end up not logging
      all its xattrs. This results in deleting all xattrs when replying the
      log after a power failure.
      
      Trivial reproducer
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/foobar
        $ setfattr -n user.xa -v qwerty /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ getfattr --absolute-names --dump /mnt/foobar
        <empty output>
        $
      
      So fix this by making sure all xattrs are logged if we log a file's inode
      item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
      BTRFS_INODE_COPY_EVERYTHING were set in the inode.
      
      Fixes: 36283bf7 ("Btrfs: fix fsync xattr loss in the fast fsync path")
      Cc: <stable@vger.kernel.org> # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a8fca62
  14. 12 4月, 2018 2 次提交
    • D
      btrfs: replace GPL boilerplate by SPDX -- sources · c1d7c514
      David Sterba 提交于
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1d7c514
    • F
      Btrfs: fix loss of prealloc extents past i_size after fsync log replay · 471d557a
      Filipe Manana 提交于
      Currently if we allocate extents beyond an inode's i_size (through the
      fallocate system call) and then fsync the file, we log the extents but
      after a power failure we replay them and then immediately drop them.
      This behaviour happens since about 2009, commit c71bf099 ("Btrfs:
      Avoid orphan inodes cleanup while replaying log"), because it marks
      the inode as an orphan instead of dropping any extents beyond i_size
      before replaying logged extents, so after the log replay, and while
      the mount operation is still ongoing, we find the inode marked as an
      orphan and then perform a truncation (drop extents beyond the inode's
      i_size). Because the processing of orphan inodes is still done
      right after replaying the log and before the mount operation finishes,
      the intention of that commit does not make any sense (at least as
      of today). However reverting that behaviour is not enough, because
      we can not simply discard all extents beyond i_size and then replay
      logged extents, because we risk dropping extents beyond i_size created
      in past transactions, for example:
      
        add prealloc extent beyond i_size
        fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
        transaction commit
        add another prealloc extent beyond i_size
        fsync - triggers the fast fsync path
        power failure
      
      In that scenario, we would drop the first extent and then replay the
      second one. To fix this just make sure that all prealloc extents
      beyond i_size are logged, and if we find too many (which is far from
      a common case), fallback to a full transaction commit (like we do when
      logging regular extents in the fast fsync path).
      
      Trivial reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
       $ sync
       $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
       $ xfs_io -c "fsync" /mnt/foo
       <power failure>
      
       # mount to replay log
       $ mount /dev/sdb /mnt
       # at this point the file only has one extent, at offset 0, size 256K
      
      A test case for fstests follows soon, covering multiple scenarios that
      involve adding prealloc extents with previous shrinking truncates and
      without such truncates.
      
      Fixes: c71bf099 ("Btrfs: Avoid orphan inodes cleanup while replaying log")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      471d557a
  15. 06 4月, 2018 2 次提交
  16. 31 3月, 2018 3 次提交
    • Q
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo 提交于
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      581c1760
    • F
      Btrfs: fix copy_items() return value when logging an inode · 8434ec46
      Filipe Manana 提交于
      When logging an inode, at tree-log.c:copy_items(), if we call
      btrfs_next_leaf() at the loop which checks for the need to log holes, we
      need to make sure copy_items() returns the value 1 to its caller and
      not 0 (on success). This is because the path the caller passed was
      released and is now different from what is was before, and the caller
      expects a return value of 0 to mean both success and that the path
      has not changed, while a return value of 1 means both success and
      signals the caller that it can not reuse the path, it has to perform
      another tree search.
      
      Even though this is a case that should not be triggered on normal
      circumstances or very rare at least, its consequences can be very
      unpredictable (especially when replaying a log tree).
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8434ec46
    • F
      Btrfs: fix fsync after hole punching when using no-holes feature · 4ee3fad3
      Filipe Manana 提交于
      When we have the no-holes mode enabled and fsync a file after punching a
      hole in it, we can end up not logging the whole hole range in the log tree.
      This happens if the file has extent items that span more than one leaf and
      we punch a hole that covers a range that starts in a leaf but does not go
      beyond the offset of the first extent in the next leaf.
      
      Example:
      
        $ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb
        $ mount /dev/sdb /mnt
        $ for ((i = 0; i <= 831; i++)); do
      	offset=$((i * 2 * 256 * 1024))
      	xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \
      		/mnt/foobar >/dev/null
          done
        $ sync
      
        # We now have 2 leafs in our filesystem fs tree, the first leaf has an
        # item corresponding the extent at file offset 216530944 and the second
        # leaf has a first item corresponding to the extent at offset 217055232.
        # Now we punch a hole that partially covers the range of the extent at
        # offset 216530944 but does go beyond the offset 217055232.
      
        $ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
        <power fail>
      
        # mount to replay the log
        $ mount /dev/sdb /mnt
      
        # Before this patch, only the subrange [216658016, 216662016[ (length of
        # 4000 bytes) was logged, leaving an incorrect file layout after log
        # replay.
      
      Fix this by checking if there is a hole between the last extent item that
      we processed and the first extent item in the next leaf, and if there is
      one, log an explicit hole extent item.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4ee3fad3
  17. 26 3月, 2018 1 次提交