1. 04 6月, 2019 12 次提交
  2. 31 5月, 2019 24 次提交
    • B
      NFS: Fix a double unlock from nfs_match,get_client · 26433652
      Benjamin Coddington 提交于
      [ Upstream commit c260121a97a3e4df6536edbc2f26e166eff370ce ]
      
      Now that nfs_match_client drops the nfs_client_lock, we should be
      careful
      to always return it in the same condition: locked.
      
      Fixes: 950a578c6128 ("NFS: make nfs_match_client killable")
      Reported-by: syzbot+228a82b263b5da91883d@syzkaller.appspotmail.com
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      26433652
    • C
      chardev: add additional check for minor range overlap · 65ec64f2
      Chengguang Xu 提交于
      [ Upstream commit de36e16d1557a0b6eb328bc3516359a12ba5c25c ]
      
      Current overlap checking cannot correctly handle
      a case which is baseminor < existing baseminor &&
      baseminor + minorct > existing baseminor + minorct.
      Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      65ec64f2
    • Q
      btrfs: Don't panic when we can't find a root key · bd3d8f4c
      Qu Wenruo 提交于
      [ Upstream commit 7ac1e464c4d473b517bb784f30d40da1f842482e ]
      
      When we failed to find a root key in btrfs_update_root(), we just panic.
      
      That's definitely not cool, fix it by outputting an unique error
      message, aborting current transaction and return -EUCLEAN. This should
      not normally happen as the root has been used by the callers in some
      way.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bd3d8f4c
    • J
      btrfs: fix panic during relocation after ENOSPC before writeback happens · 431cbaec
      Josef Bacik 提交于
      [ Upstream commit ff612ba7849964b1898fd3ccd1f56941129c6aab ]
      
      We've been seeing the following sporadically throughout our fleet
      
      panic: kernel BUG at fs/btrfs/relocation.c:4584!
      netversion: 5.0-0
      Backtrace:
       #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8
       #1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c
       #2 [ffffc90003adb988] crash_kexec at ffffffff811048ad
       #3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a
       #4 [ffffc90003adb9c0] do_trap at ffffffff81019114
       #5 [ffffc90003adba00] do_error_trap at ffffffff810195d0
       #6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b
          [exception RIP: btrfs_reloc_cow_block+692]
          RIP: ffffffff8143b614  RSP: ffffc90003adbb68  RFLAGS: 00010246
          RAX: fffffffffffffff7  RBX: ffff8806b9c32000  RCX: ffff8806aad00690
          RDX: ffff880850b295e0  RSI: ffff8806b9c32000  RDI: ffff88084f205bd0
          RBP: ffff880849415000   R8: ffffc90003adbbe0   R9: ffff88085ac90000
          R10: ffff8805f7369140  R11: 0000000000000000  R12: ffff880850b295e0
          R13: ffff88084f205bd0  R14: 0000000000000000  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd
       #8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3
       #9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c
      
      The way relocation moves data extents is by creating a reloc inode and
      preallocating extents in this inode and then copying the data into these
      preallocated extents.  Once we've done this for all of our extents,
      we'll write out these dirty pages, which marks the extent written, and
      goes into btrfs_reloc_cow_block().  From here we get our current
      reloc_control, which _should_ match the reloc_control for the current
      block group we're relocating.
      
      However if we get an ENOSPC in this path at some point we'll bail out,
      never initiating writeback on this inode.  Not a huge deal, unless we
      happen to be doing relocation on a different block group, and this block
      group is now rc->stage == UPDATE_DATA_PTRS.  This trips the BUG_ON() in
      btrfs_reloc_cow_block(), because we expect to be done modifying the data
      inode.  We are in fact done modifying the metadata for the data inode
      we're currently using, but not the one from the failed block group, and
      thus we BUG_ON().
      
      (This happens when writeback finishes for extents from the previous
      group, when we are at btrfs_finish_ordered_io() which updates the data
      reloc tree (inode item, drops/adds extent items, etc).)
      
      Fix this by writing out the reloc data inode always, and then breaking
      out of the loop after that point to keep from tripping this BUG_ON()
      later.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [ add note from Filipe ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      431cbaec
    • R
      Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve · 1084fc9a
      Robbie Ko 提交于
      [ Upstream commit 39ad317315887c2cb9a4347a93a8859326ddf136 ]
      
      When doing fallocate, we first add the range to the reserve_list and
      then reserve the quota.  If quota reservation fails, we'll release all
      reserved parts of reserve_list.
      
      However, cur_offset is not updated to indicate that this range is
      already been inserted into the list.  Therefore, the same range is freed
      twice.  Once at list_for_each_entry loop, and once at the end of the
      function.  This will result in WARN_ON on bytes_may_use when we free the
      remaining space.
      
      At the end, under the 'out' label we have a call to:
      
         btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset);
      
      The start offset, third argument, should be cur_offset.
      
      Everything from alloc_start to cur_offset was freed by the
      list_for_each_entry_safe_loop.
      
      Fixes: 18513091 ("btrfs: update btrfs_space_info's bytes_may_use timely")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1084fc9a
    • A
      gfs2: Fix occasional glock use-after-free · c4b51dbc
      Andreas Gruenbacher 提交于
      [ Upstream commit 9287c6452d2b1f24ea8e84bd3cf6f3c6f267f712 ]
      
      This patch has to do with the life cycle of glocks and buffers.  When
      gfs2 metadata or journaled data is queued to be written, a gfs2_bufdata
      object is assigned to track the buffer, and that is queued to various
      lists, including the glock's gl_ail_list to indicate it's on the active
      items list.  Once the page associated with the buffer has been written,
      it is removed from the ail list, but its life isn't over until a revoke
      has been successfully written.
      
      So after the block is written, its bufdata object is moved from the
      glock's gl_ail_list to a file-system-wide list of pending revokes,
      sd_log_le_revoke.  At that point the glock still needs to track how many
      revokes it contributed to that list (in gl_revokes) so that things like
      glock go_sync can ensure all the metadata has been not only written, but
      also revoked before the glock is granted to a different node.  This is
      to guarantee journal replay doesn't replay the block once the glock has
      been granted to another node.
      
      Ross Lagerwall recently discovered a race in which an inode could be
      evicted, and its glock freed after its ail list had been synced, but
      while it still had unwritten revokes on the sd_log_le_revoke list.  The
      evict decremented the glock reference count to zero, which allowed the
      glock to be freed.  After the revoke was written, function
      revoke_lo_after_commit tried to adjust the glock's gl_revokes counter
      and clear its GLF_LFLUSH flag, at which time it referenced the freed
      glock.
      
      This patch fixes the problem by incrementing the glock reference count
      in gfs2_add_revoke when the glock's first bufdata object is moved from
      the glock to the global revokes list. Later, when the glock's last such
      bufdata object is freed, the reference count is decremented. This
      guarantees that whichever process finishes last (the revoke writing or
      the evict) will properly free the glock, and neither will reference the
      glock after it has been freed.
      Reported-by: NRoss Lagerwall <ross.lagerwall@citrix.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c4b51dbc
    • R
      NFS: make nfs_match_client killable · 36296b00
      Roberto Bergantinos Corpas 提交于
      [ Upstream commit 950a578c6128c2886e295b9c7ecb0b6b22fcc92b ]
      
          Actually we don't do anything with return value from
          nfs_wait_client_init_complete in nfs_match_client, as a
          consequence if we get a fatal signal and client is not
          fully initialised, we'll loop to "again" label
      
          This has been proven to cause soft lockups on some scenarios
          (no-carrier but configured network interfaces)
      Signed-off-by: NRoberto Bergantinos Corpas <rbergant@redhat.com>
      Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      36296b00
    • R
      gfs2: Fix lru_count going negative · bac85208
      Ross Lagerwall 提交于
      [ Upstream commit 7881ef3f33bb80f459ea6020d1e021fc524a6348 ]
      
      Under certain conditions, lru_count may drop below zero resulting in
      a large amount of log spam like this:
      
      vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
          negative objects to delete nr=-1
      
      This happens as follows:
      1) A glock is moved from lru_list to the dispose list and lru_count is
         decremented.
      2) The dispose function calls cond_resched() and drops the lru lock.
      3) Another thread takes the lru lock and tries to add the same glock to
         lru_list, checking if the glock is on an lru list.
      4) It is on a list (actually the dispose list) and so it avoids
         incrementing lru_count.
      5) The glock is moved to lru_list.
      5) The original thread doesn't dispose it because it has been re-added
         to the lru list but the lru_count has still decreased by one.
      
      Fix by checking if the LRU flag is set on the glock rather than checking
      if the glock is on some list and rearrange the code so that the LRU flag
      is added/removed precisely when the glock is added/removed from lru_list.
      Signed-off-by: NRoss Lagerwall <ross.lagerwall@citrix.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bac85208
    • D
      Revert "btrfs: Honour FITRIM range constraints during free space trim" · 06a67c0f
      David Sterba 提交于
      This reverts commit 8b13bb91.
      
      There is currently no corresponding patch in master due to additional
      changes that would be significantly different from plain revert in the
      respective stable branch.
      
      The range argument was not handled correctly and could cause trim to
      overlap allocated areas or reach beyond the end of the device. The
      address space that fitrim normally operates on is in logical
      coordinates, while the discards are done on the physical device extents.
      This distinction cannot be made with the current ioctl interface and
      caused the confusion.
      
      The bug depends on the layout of block groups and does not always
      happen. The whole-fs trim (run by default by the fstrim tool) is not
      affected.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06a67c0f
    • A
      acct_on(): don't mess with freeze protection · 7c2bcb3c
      Al Viro 提交于
      commit 9419a3191dcb27f24478d288abaab697228d28e6 upstream.
      
      What happens there is that we are replacing file->path.mnt of
      a file we'd just opened with a clone and we need the write
      count contribution to be transferred from original mount to
      new one.  That's it.  We do *NOT* want any kind of freeze
      protection for the duration of switchover.
      
      IOW, we should just use __mnt_{want,drop}_write() for that
      switchover; no need to bother with mnt_{want,drop}_write()
      there.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c2bcb3c
    • A
      ovl: relax WARN_ON() for overlapping layers use case · 86c43c40
      Amir Goldstein 提交于
      commit acf3062a7e1ccf67c6f7e7c28671a6708fde63b0 upstream.
      
      This nasty little syzbot repro:
      https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000
      
      Creates overlay mounts where the same directory is both in upper and lower
      layers. Simplified example:
      
        mkdir foo work
        mount -t overlay none foo -o"lowerdir=.,upperdir=foo,workdir=work"
      
      The repro runs several threads in parallel that attempt to chdir into foo
      and attempt to symlink/rename/exec/mkdir the file bar.
      
      The repro hits a WARN_ON() I placed in ovl_instantiate(), which suggests
      that an overlay inode already exists in cache and is hashed by the pointer
      of the real upper dentry that ovl_create_real() has just created. At the
      point of the WARN_ON(), for overlay dir inode lock is held and upper dir
      inode lock, so at first, I did not see how this was possible.
      
      On a closer look, I see that after ovl_create_real(), because of the
      overlapping upper and lower layers, a lookup by another thread can find the
      file foo/bar that was just created in upper layer, at overlay path
      foo/foo/bar and hash the an overlay inode with the new real dentry as lower
      dentry. This is possible because the overlay directory foo/foo is not
      locked and the upper dentry foo/bar is in dcache, so ovl_lookup() can find
      it without taking upper dir inode shared lock.
      
      Overlapping layers is considered a wrong setup which would result in
      unexpected behavior, but it shouldn't crash the kernel and it shouldn't
      trigger WARN_ON() either, so relax this WARN_ON() and leave a pr_warn()
      instead to cover all cases of failure to get an overlay inode.
      
      The error returned from failure to insert new inode to cache with
      inode_insert5() was changed to -EEXIST, to distinguish from the error
      -ENOMEM returned on failure to get/allocate inode with iget5_locked().
      
      Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com
      Fixes: 01b39dcc ("ovl: use inode_insert5() to hash a newly...")
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86c43c40
    • J
      btrfs: honor path->skip_locking in backref code · 9c0339dd
      Josef Bacik 提交于
      commit 38e3eebff643db725633657d1d87a3be019d1018 upstream.
      
      Qgroups will do the old roots lookup at delayed ref time, which could be
      while walking down the extent root while running a delayed ref.  This
      should be fine, except we specifically lock eb's in the backref walking
      code irrespective of path->skip_locking, which deadlocks the system.
      Fix up the backref code to honor path->skip_locking, nobody will be
      modifying the commit_root when we're searching so it's completely safe
      to do.
      
      This happens since fb235dc0 ("btrfs: qgroup: Move half of the qgroup
      accounting time out of commit trans"), kernel may lockup with quota
      enabled.
      
      There is one backref trace triggered by snapshot dropping along with
      write operation in the source subvolume.  The example can be reliably
      reproduced:
      
        btrfs-cleaner   D    0  4062      2 0x80000000
        Call Trace:
         schedule+0x32/0x90
         btrfs_tree_read_lock+0x93/0x130 [btrfs]
         find_parent_nodes+0x29b/0x1170 [btrfs]
         btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
         btrfs_find_all_roots+0x57/0x70 [btrfs]
         btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
         btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
         btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
         do_walk_down+0x541/0x5e3 [btrfs]
         walk_down_tree+0xab/0xe7 [btrfs]
         btrfs_drop_snapshot+0x356/0x71a [btrfs]
         btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
         cleaner_kthread+0x12b/0x160 [btrfs]
         kthread+0x112/0x130
         ret_from_fork+0x27/0x50
      
      When dropping snapshots with qgroup enabled, we will trigger backref
      walk.
      
      However such backref walk at that timing is pretty dangerous, as if one
      of the parent nodes get WRITE locked by other thread, we could cause a
      dead lock.
      
      For example:
      
                 FS 260     FS 261 (Dropped)
                  node A        node B
                 /      \      /      \
             node C      node D      node E
            /   \         /  \        /     \
        leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
      
      The lock sequence would be:
      
            Thread A (cleaner)             |       Thread B (other writer)
      -----------------------------------------------------------------------
      write_lock(B)                        |
      write_lock(D)                        |
      ^^^ called by walk_down_tree()       |
                                           |       write_lock(A)
                                           |       write_lock(D) << Stall
      read_lock(H) << for backref walk     |
      read_lock(D) << lock owner is        |
                      the same thread A    |
                      so read lock is OK   |
      read_lock(A) << Stall                |
      
      So thread A hold write lock D, and needs read lock A to unlock.
      While thread B holds write lock A, while needs lock D to unlock.
      
      This will cause a deadlock.
      
      This is not only limited to snapshot dropping case.  As the backref
      walk, even only happens on commit trees, is breaking the normal top-down
      locking order, makes it deadlock prone.
      
      Fixes: fb235dc0 ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
      CC: stable@vger.kernel.org # 4.14+
      Reported-and-tested-by: NDavid Sterba <dsterba@suse.com>
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [ rebase to latest branch and fix lock assert bug in btrfs/007 ]
      [ backport to linux-4.19.y branch, solve minor conflicts ]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ copy logs and deadlock analysis from Qu's patch ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c0339dd
    • O
      NFSv4.1 fix incorrect return value in copy_file_range · cc1afc10
      Olga Kornievskaia 提交于
      commit 0769663b4f580566ef6cdf366f3073dbe8022c39 upstream.
      
      According to the NFSv4.2 spec if the input and output file is the
      same file, operation should fail with EINVAL. However, linux
      copy_file_range() system call has no such restrictions. Therefore,
      in such case let's return EOPNOTSUPP and allow VFS to fallback
      to doing do_splice_direct(). Also when copy_file_range is called
      on an NFSv4.0 or 4.1 mount (ie., a server that doesn't support
      COPY functionality), we also need to return EOPNOTSUPP and
      fallback to a regular copy.
      
      Fixes xfstest generic/075, generic/091, generic/112, generic/263
      for all NFSv4.x versions.
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Yu Xu <xuyu@linux.alibaba.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc1afc10
    • O
      NFSv4.2 fix unnecessary retry in nfs4_copy_file_range · e1eed692
      Olga Kornievskaia 提交于
      commit 45ac486ecf2dc998e25cf32f0cabf2deaad875be upstream.
      
      Currently nfs42_proc_copy_file_range() can not return EAGAIN.
      
      Fixes: e4648aa4 ("NFS recover from destination server reboot for copies")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Yu Xu <xuyu@linux.alibaba.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e1eed692
    • T
      btrfs: sysfs: don't leak memory when failing add fsid · 94e1f966
      Tobin C. Harding 提交于
      commit e32773357d5cc271b1d23550b3ed026eb5c2a468 upstream.
      
      A failed call to kobject_init_and_add() must be followed by a call to
      kobject_put().  Currently in the error path when adding fs_devices we
      are missing this call.  This could be fixed by calling
      btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
      by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
      Here we choose the second option because it prevents the slightly
      unusual error path handling requirements of kobject from leaking out
      into btrfs functions.
      
      Add a call to kobject_put() in the error path of kobject_add_and_init().
      This causes the release method to be called if kobject_init_and_add()
      fails.  open_tree() is the function that calls btrfs_sysfs_add_fsid()
      and the error code in this function is already written with the
      assumption that the release method is called during the error path of
      open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
      fail_fsdev_sysfs label).
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94e1f966
    • T
      btrfs: sysfs: Fix error path kobject memory leak · 946ad2ec
      Tobin C. Harding 提交于
      commit 450ff8348808a89cc27436771aa05c2b90c0eef1 upstream.
      
      If a call to kobject_init_and_add() fails we must call kobject_put()
      otherwise we leak memory.
      
      Calling kobject_put() when kobject_init_and_add() fails drops the
      refcount back to 0 and calls the ktype release method (which in turn
      calls the percpu destroy and kfree).
      
      Add call to kobject_put() in the error path of call to
      kobject_init_and_add().
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      946ad2ec
    • F
      Btrfs: fix race between ranged fsync and writeback of adjacent ranges · 92f907d7
      Filipe Manana 提交于
      commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream.
      
      When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
      inode) that happens to be ranged, which happens during a msync() or writes
      for files opened with O_SYNC for example, we can end up with a corrupt log,
      due to different file extent items representing ranges that overlap with
      each other, or hit some assertion failures.
      
      When doing a ranged fsync we only flush delalloc and wait for ordered
      exents within that range. If while we are logging items from our inode
      ordered extents for adjacent ranges complete, we end up in a race that can
      make us insert the file extent items that overlap with others we logged
      previously and the assertion failures.
      
      For example, if tree-log.c:copy_items() receives a leaf that has the
      following file extents items, all with a length of 4K and therefore there
      is an implicit hole in the range 68K to 72K - 1:
      
        (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
      
      It copies them to the log tree. However due to the need to detect implicit
      holes, it may release the path, in order to look at the previous leaf to
      detect an implicit hole, and then later it will search again in the tree
      for the first file extent item key, with the goal of locking again the
      leaf (which might have changed due to concurrent changes to other inodes).
      
      However when it locks again the leaf containing the first key, the key
      corresponding to the extent at offset 72K may not be there anymore since
      there is an ordered extent for that range that is finishing (that is,
      somewhere in the middle of btrfs_finish_ordered_io()), and it just
      removed the file extent item but has not yet replaced it with a new file
      extent item, so the part of copy_items() that does hole detection will
      decide that there is a hole in the range starting from 68K to 76K - 1,
      and therefore insert a file extent item to represent that hole, having
      a key offset of 68K. After that we now have a log tree with 2 different
      extent items that have overlapping ranges:
      
       1) The file extent item copied before copy_items() released the path,
          which has a key offset of 72K and a length of 4K, representing the
          file range 72K to 76K - 1.
      
       2) And a file extent item representing a hole that has a key offset of
          68K and a length of 8K, representing the range 68K to 76K - 1. This
          item was inserted after releasing the path, and overlaps with the
          extent item inserted before.
      
      The overlapping extent items can cause all sorts of unpredictable and
      incorrect behaviour, either when replayed or if a fast (non full) fsync
      happens later, which can trigger a BUG_ON() when calling
      btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
      trace like the following:
      
        [61666.783269] ------------[ cut here ]------------
        [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
        [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
        (...)
        [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
        [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
        [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
        [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
        [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
        [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
        [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
        [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
        [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
        [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
        [61666.786253] Call Trace:
        [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
        [61666.786253]  ? time_hardirqs_on+0x9/0x14
        [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
        [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
        [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
        [61666.786253]  ? lock_acquire+0x131/0x1c5
        [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
        [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
        [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
        [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
        [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
        [61666.786253]  vfs_fsync_range+0x8c/0x9e
        [61666.786253]  SyS_msync+0x13c/0x1c9
        [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad
      
      A sample of a corrupt log tree leaf with overlapping extents I got from
      running btrfs/072:
      
            item 14 key (295 108 200704) itemoff 2599 itemsize 53
                    extent data disk bytenr 0 nr 0
                    extent data offset 0 nr 458752 ram 458752
            item 15 key (295 108 659456) itemoff 2546 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 606208 nr 163840 ram 770048
            item 16 key (295 108 663552) itemoff 2493 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 610304 nr 155648 ram 770048
            item 17 key (295 108 819200) itemoff 2440 itemsize 53
                    extent data disk bytenr 4334788608 nr 4096
                    extent data offset 0 nr 4096 ram 4096
      
      The file extent item at offset 659456 (item 15) ends at offset 823296
      (659456 + 163840) while the next file extent item (item 16) starts at
      offset 663552.
      
      Another different problem that the race can trigger is a failure in the
      assertions at tree-log.c:copy_items(), which expect that the first file
      extent item key we found before releasing the path exists after we have
      released path and that the last key we found before releasing the path
      also exists after releasing the path:
      
        $ cat -n fs/btrfs/tree-log.c
        4080          if (need_find_last_extent) {
        4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
        4082                  btrfs_release_path(src_path);
        4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
        4084                                  src_path, 0, 0);
        4085                  if (ret < 0)
        4086                          return ret;
        4087                  ASSERT(ret == 0);
        (...)
        4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
        4104                          ret = btrfs_next_leaf(inode->root, src_path);
        4105                          if (ret < 0)
        4106                                  return ret;
        4107                          ASSERT(ret == 0);
        4108                          src = src_path->nodes[0];
        4109                          i = 0;
        4110                          need_find_last_extent = true;
        4111                  }
        (...)
      
      The second assertion implicitly expects that the last key before the path
      release still exists, because the surrounding while loop only stops after
      we have found that key. When this assertion fails it produces a stack like
      this:
      
        [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
        [139590.037406] ------------[ cut here ]------------
        [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
        [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
        (...)
        [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
        (...)
        [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
        [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
        [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
        [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
        [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
        [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
        [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
        [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
        [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [139590.044250] Call Trace:
        [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
        [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
        [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
        [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
        [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
        [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
        [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
        [139590.047592]  __vfs_write+0x129/0x1c0
        [139590.047932]  vfs_write+0xc2/0x1b0
        [139590.048270]  ksys_write+0x55/0xc0
        [139590.048608]  do_syscall_64+0x60/0x1b0
        [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [139590.049287] RIP: 0033:0x7f2efc4be190
        (...)
        [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
        [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
        [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
        [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
        [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
        (...)
        [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
      
      So fix this race between a full ranged fsync and writeback of adjacent
      ranges by flushing all delalloc and waiting for all ordered extents to
      complete before logging the inode. This is the simplest way to solve the
      problem because currently the full fsync path does not deal with ranges
      at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
      to look at adjacent ranges for hole detection. For use cases of ranged
      fsyncs this can make a few fsyncs slower but on the other hand it can
      make some following fsyncs to other ranges do less work or no need to do
      anything at all. A full fsync is rare anyway and happens only once after
      loading/creating an inode and once after less common operations such as a
      shrinking truncate.
      
      This is an issue that exists for a long time, and was often triggered by
      generic/127, because it does mmap'ed writes and msync (which triggers a
      ranged fsync). Adding support for the tree checker to detect overlapping
      extents (next patch in the series) and trigger a WARN() when such cases
      are found, and then calling btrfs_check_leaf_full() at the end of
      btrfs_insert_file_extent() made the issue much easier to detect. Running
      btrfs/072 with that change to the tree checker and making fsstress open
      files always with O_SYNC made it much easier to trigger the issue (as
      triggering it with generic/127 is very rare).
      
      CC: stable@vger.kernel.org # 3.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92f907d7
    • F
      Btrfs: avoid fallback to transaction commit during fsync of files with holes · 4f9a774d
      Filipe Manana 提交于
      commit ebb929060aeb162417b4c1307e63daee47b208d9 upstream.
      
      When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
      file that has holes and has file extent items spanning two or more leafs,
      we can end up falling to back to a full transaction commit due to a logic
      bug that leads to failure to insert a duplicate file extent item that is
      meant to represent a hole between the last file extent item of a leaf and
      the first file extent item in the next leaf. The failure (EEXIST error)
      leads to a transaction commit (as most errors when logging an inode do).
      
      For example, we have the two following leafs:
      
      Leaf N:
      
        -----------------------------------------------
        | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
        -----------------------------------------------
        The file extent item at the end of leaf N has a length of 4Kb,
        representing the file range from 64K to 68K - 1.
      
      Leaf N + 1:
      
        -----------------------------------------------
        | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
        -----------------------------------------------
        The file extent item at the first slot of leaf N + 1 has a length of
        4Kb too, representing the file range from 72K to 76K - 1.
      
      During the full fsync path, when we are at tree-log.c:copy_items() with
      leaf N as a parameter, after processing the last file extent item, that
      represents the extent at offset 64K, we take a look at the first file
      extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
      between the two extents, and therefore we insert a file extent item
      representing that hole, starting at file offset 68K and ending at offset
      72K - 1. However we don't update the value of *last_extent, which is used
      to represent the end offset (plus 1, non-inclusive end) of the last file
      extent item inserted in the log, so it stays with a value of 68K and not
      with a value of 72K.
      
      Then, when copy_items() is called for leaf N + 1, because the value of
      *last_extent is smaller then the offset of the first extent item in the
      leaf (68K < 72K), we look at the last file extent item in the previous
      leaf (leaf N) and see it there's a 4K gap between it and our first file
      extent item (again, 68K < 72K), so we decide to insert a file extent item
      representing the hole, starting at file offset 68K and ending at offset
      72K - 1, this insertion will fail with -EEXIST being returned from
      btrfs_insert_file_extent() because we already inserted a file extent item
      representing a hole for this offset (68K) in the previous call to
      copy_items(), when processing leaf N.
      
      The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
      which falls back to a full transaction commit.
      
      Fix this by adjusting *last_extent after inserting a hole when we had to
      look at the next leaf.
      
      Fixes: 4ee3fad3 ("Btrfs: fix fsync after hole punching when using no-holes feature")
      Cc: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f9a774d
    • F
      Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path · 7ec747c8
      Filipe Manana 提交于
      commit 72bd2323ec87722c115a5906bc6a1b31d11e8f54 upstream.
      
      Currently when we fail to COW a path at btrfs_update_root() we end up
      always aborting the transaction. However all the current callers of
      btrfs_update_root() are able to deal with errors returned from it, many do
      end up aborting the transaction themselves (directly or not, such as the
      transaction commit path), other BUG_ON() or just gracefully cancel whatever
      they were doing.
      
      When syncing the fsync log, we call btrfs_update_root() through
      tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
      sync code does not abort the transaction, instead it gracefully handles
      the error and returns -EAGAIN to the fsync handler, so that it falls back
      to a transaction commit. Any other error different from -ENOSPC, makes the
      log sync code abort the transaction.
      
      So remove the transaction abort from btrfs_update_log() when we fail to
      COW a path to update the root item, so that if an -ENOSPC failure happens
      we avoid aborting the current transaction and have a chance of the fsync
      succeeding after falling back to a transaction commit.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Cc: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ec747c8
    • J
      btrfs: don't double unlock on error in btrfs_punch_hole · ce21e658
      Josef Bacik 提交于
      commit 8fca955057b9c58467d1b231e43f19c4cf26ae8c upstream.
      
      If we have an error writing out a delalloc range in
      btrfs_punch_hole_lock_range we'll unlock the inode and then goto
      out_only_mutex, where we will again unlock the inode.  This is bad,
      don't do this.
      
      Fixes: f27451f2 ("Btrfs: add support for fallocate's zero range operation")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce21e658
    • A
      gfs2: Fix sign extension bug in gfs2_update_stats · fdc78eed
      Andreas Gruenbacher 提交于
      commit 5a5ec83d6ac974b12085cd99b196795f14079037 upstream.
      
      Commit 4d207133 changed the types of the statistic values in struct
      gfs2_lkstats from s64 to u64.  Because of that, what should be a signed
      value in gfs2_update_stats turned into an unsigned value.  When shifted
      right, we end up with a large positive value instead of a small negative
      value, which results in an incorrect variance estimate.
      
      Fixes: 4d207133 ("gfs2: Make statistics unsigned, suitable for use with do_div()")
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Cc: stable@vger.kernel.org # v4.4+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fdc78eed
    • D
      f2fs: Fix use of number of devices · 70d33cce
      Damien Le Moal 提交于
      commit 0916878da355650d7e77104a7ac0fa1784eca852 upstream.
      
      For a single device mount using a zoned block device, the zone
      information for the device is stored in the sbi->devs single entry
      array and sbi->s_ndevs is set to 1. This differs from a single device
      mount using a regular block device which does not allocate sbi->devs
      and sets sbi->s_ndevs to 0.
      
      However, sbi->s_devs == 0 condition is used throughout the code to
      differentiate a single device mount from a multi-device mount where
      sbi->s_ndevs is always larger than 1. This results in problems with
      single zoned block device volumes as these are treated as multi-device
      mounts but do not have the start_blk and end_blk information set. One
      of the problem observed is skipping of zone discard issuing resulting in
      write commands being issued to full zones or unaligned to a zone write
      pointer.
      
      Fix this problem by simply treating the cases sbi->s_ndevs == 0 (single
      regular block device mount) and sbi->s_ndevs == 1 (single zoned block
      device mount) in the same manner. This is done by introducing the
      helper function f2fs_is_multi_device() and using this helper in place
      of direct tests of sbi->s_ndevs value, improving code readability.
      
      Fixes: 7bb3a371 ("f2fs: Fix zoned block device support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      70d33cce
    • J
      ext4: wait for outstanding dio during truncate in nojournal mode · 5220582c
      Jan Kara 提交于
      commit 82a25b027ca48d7ef197295846b352345853dfa8 upstream.
      
      We didn't wait for outstanding direct IO during truncate in nojournal
      mode (as we skip orphan handling in that case). This can lead to fs
      corruption or stale data exposure if truncate ends up freeing blocks
      and these get reallocated before direct IO finishes. Fix the condition
      determining whether the wait is necessary.
      
      CC: stable@vger.kernel.org
      Fixes: 1c9114f9 ("ext4: serialize unlocked dio reads with truncate")
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5220582c
    • J
      ext4: do not delete unlinked inode from orphan list on failed truncate · 71e430fd
      Jan Kara 提交于
      commit ee0ed02ca93ef1ecf8963ad96638795d55af2c14 upstream.
      
      It is possible that unlinked inode enters ext4_setattr() (e.g. if
      somebody calls ftruncate(2) on unlinked but still open file). In such
      case we should not delete the inode from the orphan list if truncate
      fails. Note that this is mostly a theoretical concern as filesystem is
      corrupted if we reach this path anyway but let's be consistent in our
      orphan handling.
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71e430fd
  3. 26 5月, 2019 4 次提交