1. 18 12月, 2019 20 次提交
    • B
      gfs2: fix glock reference problem in gfs2_trans_remove_revoke · 0809e108
      Bob Peterson 提交于
      [ Upstream commit fe5e7ba11fcf1d75af8173836309e8562aefedef ]
      
      Commit 9287c6452d2b fixed a situation in which gfs2 could use a glock
      after it had been freed. To do that, it temporarily added a new glock
      reference by calling gfs2_glock_hold in function gfs2_add_revoke.
      However, if the bd element was removed by gfs2_trans_remove_revoke, it
      failed to drop the additional reference.
      
      This patch adds logic to gfs2_trans_remove_revoke to properly drop the
      additional glock reference.
      
      Fixes: 9287c6452d2b ("gfs2: Fix occasional glock use-after-free")
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0809e108
    • M
      mm, thp, proc: report THP eligibility for each vma · c76adee3
      Michal Hocko 提交于
      [ Upstream commit 7635d9cbe8327e131a1d3d8517dc186c2796ce2e ]
      
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c76adee3
    • Y
      ext4: fix a bug in ext4_wait_for_tail_page_commit · b1ec93dd
      yangerkun 提交于
      commit 565333a1554d704789e74205989305c811fd9c7a upstream.
      
      No need to wait for any commit once the page is fully truncated.
      Besides, it may confuse e.g. concurrent ext4_writepage() with the page
      still be dirty (will be cleared by truncate_pagecache() in
      ext4_setattr()) but buffers has been freed; and then trigger a bug
      show as below:
      
      [   26.057508] ------------[ cut here ]------------
      [   26.058531] kernel BUG at fs/ext4/inode.c:2134!
      ...
      [   26.088130] Call trace:
      [   26.088695]  ext4_writepage+0x914/0xb28
      [   26.089541]  writeout.isra.4+0x1b4/0x2b8
      [   26.090409]  move_to_new_page+0x3b0/0x568
      [   26.091338]  __unmap_and_move+0x648/0x988
      [   26.092241]  unmap_and_move+0x48c/0xbb8
      [   26.093096]  migrate_pages+0x220/0xb28
      [   26.093945]  kernel_mbind+0x828/0xa18
      [   26.094791]  __arm64_sys_mbind+0xc8/0x138
      [   26.095716]  el0_svc_common+0x190/0x490
      [   26.096571]  el0_svc_handler+0x60/0xd0
      [   26.097423]  el0_svc+0x8/0xc
      
      Run the procedure (generate by syzkaller) parallel with ext3.
      
      void main()
      {
      	int fd, fd1, ret;
      	void *addr;
      	size_t length = 4096;
      	int flags;
      	off_t offset = 0;
      	char *str = "12345";
      
      	fd = open("a", O_RDWR | O_CREAT);
      	assert(fd >= 0);
      
      	/* Truncate to 4k */
      	ret = ftruncate(fd, length);
      	assert(ret == 0);
      
      	/* Journal data mode */
      	flags = 0xc00f;
      	ret = ioctl(fd, _IOW('f', 2, long), &flags);
      	assert(ret == 0);
      
      	/* Truncate to 0 */
      	fd1 = open("a", O_TRUNC | O_NOATIME);
      	assert(fd1 >= 0);
      
      	addr = mmap(NULL, length, PROT_WRITE | PROT_READ,
      					MAP_SHARED, fd, offset);
      	assert(addr != (void *)-1);
      
      	memcpy(addr, str, 5);
      	mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE);
      }
      
      And the bug will be triggered once we seen the below order.
      
      reproduce1                         reproduce2
      
      ...                            |   ...
      truncate to 4k                 |
      change to journal data mode    |
                                     |   memcpy(set page dirty)
      truncate to 0:                 |
      ext4_setattr:                  |
      ...                            |
      ext4_wait_for_tail_page_commit |
                                     |   mbind(trigger bug)
      truncate_pagecache(clean dirty)|   ...
      ...                            |
      
      mbind will call ext4_writepage() since the page still be dirty, and then
      report the bug since the buffers has been free. Fix it by return
      directly once offset equals to 0 which means the page has been fully
      truncated.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.comReviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b1ec93dd
    • D
      splice: only read in as much information as there is pipe buffer space · 326ba910
      Darrick J. Wong 提交于
      commit 3253d9d093376d62b4a56e609f15d2ec5085ac73 upstream.
      
      Andreas Grünbacher reports that on the two filesystems that support
      iomap directio, it's possible for splice() to return -EAGAIN (instead of
      a short splice) if the pipe being written to has less space available in
      its pipe buffers than the length supplied by the calling process.
      
      Months ago we fixed splice_direct_to_actor to clamp the length of the
      read request to the size of the splice pipe.  Do the same to do_splice.
      
      Fixes: 17614445576b6 ("splice: don't read more than available pipe space")
      Reported-by: syzbot+3c01db6025f26530cf8d@syzkaller.appspotmail.com
      Reported-by: NAndreas Grünbacher <andreas.gruenbacher@gmail.com>
      Reviewed-by: NAndreas Grünbacher <andreas.gruenbacher@gmail.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      326ba910
    • T
      ext4: work around deleting a file with i_nlink == 0 safely · 8e7a8653
      Theodore Ts'o 提交于
      commit c7df4a1ecb8579838ec8c56b2bb6a6716e974f37 upstream.
      
      If the file system is corrupted such that a file's i_links_count is
      too small, then it's possible that when unlinking that file, i_nlink
      will already be zero.  Previously we were working around this kind of
      corruption by forcing i_nlink to one; but we were doing this before
      trying to delete the directory entry --- and if the file system is
      corrupted enough that ext4_delete_entry() fails, then we exit with
      i_nlink elevated, and this causes the orphan inode list handling to be
      FUBAR'ed, such that when we unmount the file system, the orphan inode
      list can get corrupted.
      
      A better way to fix this is to simply skip trying to call drop_nlink()
      if i_nlink is already zero, thus moving the check to the place where
      it makes the most sense.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=205433
      
      Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.eduSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e7a8653
    • J
      reiserfs: fix extended attributes on the root directory · 7b3ea9bb
      Jeff Mahoney 提交于
      commit 60e4cf67a582d64f07713eda5fcc8ccdaf7833e6 upstream.
      
      Since commit d0a5b995 (vfs: Add IOP_XATTR inode operations flag)
      extended attributes haven't worked on the root directory in reiserfs.
      
      This is due to reiserfs conditionally setting the sb->s_xattrs handler
      array depending on whether it located or create the internal privroot
      directory.  It necessarily does this after the root inode is already
      read in.  The IOP_XATTR flag is set during inode initialization, so
      it never gets set on the root directory.
      
      This commit unconditionally assigns sb->s_xattrs and clears IOP_XATTR on
      internal inodes.  The old return values due to the conditional assignment
      are handled via open_xa_root, which now returns EOPNOTSUPP as the VFS
      would have done.
      
      Link: https://lore.kernel.org/r/20191024143127.17509-1-jeffm@suse.com
      CC: stable@vger.kernel.org
      Fixes: d0a5b995 ("vfs: Add IOP_XATTR inode operations flag")
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b3ea9bb
    • J
      ext4: Fix credit estimate for final inode freeing · 595a92a4
      Jan Kara 提交于
      commit 65db869c754e7c271691dd5feabf884347e694f5 upstream.
      
      Estimate for the number of credits needed for final freeing of inode in
      ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for
      orphan deletion, bitmap & group descriptor for inode freeing) and not
      just 3.
      
      [ Fixed minor whitespace nit. -- TYT ]
      
      Fixes: e50e5129 ("ext4: xattr-in-inode support")
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20191105164437.32602-6-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      595a92a4
    • D
      quota: fix livelock in dquot_writeback_dquots · f919b26f
      Dmitry Monakhov 提交于
      commit 6ff33d99fc5c96797103b48b7b0902c296f09c05 upstream.
      
      Write only quotas which are dirty at entry.
      
      XFSTEST: https://github.com/dmonakhov/xfstests/commit/b10ad23566a5bf75832a6f500e1236084083cddc
      
      Link: https://lore.kernel.org/r/20191031103920.3919-1-dmonakhov@openvz.org
      CC: stable@vger.kernel.org
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f919b26f
    • C
      ext2: check err when partial != NULL · 26eca105
      Chengguang Xu 提交于
      commit e705f4b8aa27a59f8933e8f384e9752f052c469c upstream.
      
      Check err when partial == NULL is meaningless because
      partial == NULL means getting branch successfully without
      error.
      
      CC: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20191105045100.7104-1-cgxu519@mykernel.netSigned-off-by: NChengguang Xu <cgxu519@mykernel.net>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26eca105
    • D
      quota: Check that quota is not dirty before release · 77b14d6e
      Dmitry Monakhov 提交于
      commit df4bb5d128e2c44848aeb36b7ceceba3ac85080d upstream.
      
      There is a race window where quota was redirted once we drop dq_list_lock inside dqput(),
      but before we grab dquot->dq_lock inside dquot_release()
      
      TASK1                                                       TASK2 (chowner)
      ->dqput()
        we_slept:
          spin_lock(&dq_list_lock)
          if (dquot_dirty(dquot)) {
                spin_unlock(&dq_list_lock);
                dquot->dq_sb->dq_op->write_dquot(dquot);
                goto we_slept
          if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
                spin_unlock(&dq_list_lock);
                dquot->dq_sb->dq_op->release_dquot(dquot);
                                                                  dqget()
      							    mark_dquot_dirty()
      							    dqput()
                goto we_slept;
              }
      So dquot dirty quota will be released by TASK1, but on next we_sleept loop
      we detect this and call ->write_dquot() for it.
      XFSTEST: https://github.com/dmonakhov/xfstests/commit/440a80d4cbb39e9234df4d7240aee1d551c36107
      
      Link: https://lore.kernel.org/r/20191031103920.3919-2-dmonakhov@openvz.org
      CC: stable@vger.kernel.org
      Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77b14d6e
    • A
      ovl: relax WARN_ON() on rename to self · f785f33c
      Amir Goldstein 提交于
      commit 6889ee5a53b8d969aa542047f5ac8acdc0e79a91 upstream.
      
      In ovl_rename(), if new upper is hardlinked to old upper underneath
      overlayfs before upper dirs are locked, user will get an ESTALE error
      and a WARN_ON will be printed.
      
      Changes to underlying layers while overlayfs is mounted may result in
      unexpected behavior, but it shouldn't crash the kernel and it shouldn't
      trigger WARN_ON() either, so relax this WARN_ON().
      
      Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com
      Fixes: 804032fa ("ovl: don't check rename to self")
      Cc: <stable@vger.kernel.org> # v4.9+
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f785f33c
    • A
      ovl: fix corner case of non-unique st_dev;st_ino · 3e929ddf
      Amir Goldstein 提交于
      commit 9c6d8f13e9da10a26ad7f0a020ef86e8ef142835 upstream.
      
      On non-samefs overlay without xino, non pure upper inodes should use a
      pseudo_dev assigned to each unique lower fs and pure upper inodes use the
      real upper st_dev.
      
      It is fine for an overlay pure upper inode to use the same st_dev;st_ino
      values as the real upper inode, because the content of those two different
      filesystem objects is always the same.
      
      In this case, however:
       - two filesystems, A and B
       - upper layer is on A
       - lower layer 1 is also on A
       - lower layer 2 is on B
      
      Non pure upper overlay inode, whose origin is in layer 1 will have the same
      st_dev;st_ino values as the real lower inode. This may result with a false
      positive results of 'diff' between the real lower and copied up overlay
      inode.
      
      Fix this by using the upper st_dev;st_ino values in this case.  This breaks
      the property of constant st_dev;st_ino across copy up of this case. This
      breakage will be fixed by a later patch.
      
      Fixes: 5148626b ("ovl: allocate anon bdev per unique lower fs")
      Cc: stable@vger.kernel.org # v4.17+
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e929ddf
    • J
      btrfs: record all roots for rename exchange on a subvol · 8862b80b
      Josef Bacik 提交于
      commit 3e1740993e43116b3bc71b0aad1e6872f6ccf341 upstream.
      
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8862b80b
    • F
      Btrfs: send, skip backreference walking for extents with many references · f8031853
      Filipe Manana 提交于
      commit fd0ddbe2509568b00df364156f47561e9f469f15 upstream.
      
      Backreference walking, which is used by send to figure if it can issue
      clone operations instead of write operations, can be very slow and use
      too much memory when extents have many references. This change simply
      skips backreference walking when an extent has more than 64 references,
      in which case we fallback to a write operation instead of a clone
      operation. This limit is conservative and in practice I observed no
      signicant slowdown with up to 100 references and still low memory usage
      up to that limit.
      
      This is a temporary workaround until there are speedups in the backref
      walking code, and as such it does not attempt to add extra interfaces or
      knobs to tweak the threshold.
      Reported-by: NAtemu <atemu.main@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8031853
    • Q
      btrfs: Remove btrfs_bio::flags member · dc2a320d
      Qu Wenruo 提交于
      commit 34b127aecd4fe8e6a3903e10f204a7b7ffddca22 upstream.
      
      The last user of btrfs_bio::flags was removed in commit 326e1dbb
      ("block: remove management of bi_remaining when restoring original
      bi_end_io"), remove it.
      
      (Tagged for stable as the structure is heavily used and space savings
      are desirable.)
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc2a320d
    • T
      btrfs: Avoid getting stuck during cyclic writebacks · dfca82a7
      Tejun Heo 提交于
      commit f7bddf1e27d18fbc7d3e3056ba449cfbe4e20b0a upstream.
      
      During a cyclic writeback, extent_write_cache_pages() uses done_index
      to update the writeback_index after the current run is over.  However,
      instead of current index + 1, it gets to to the current index itself.
      
      Unfortunately, this, combined with returning on EOF instead of looping
      back, can lead to the following pathlogical behavior.
      
      1. There is a single file which has accumulated enough dirty pages to
         trigger balance_dirty_pages() and the writer appending to the file
         with a series of short writes.
      
      2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
      
      3. Writeback kicks in and the cursor is on the last page of the dirty
         file.  Writeback is started or skipped if already in progress.  As
         it's EOF, extent_write_cache_pages() returns and the cursor is set
         to done_index which is pointing to the last page.
      
      4. Writeback is done.  Nothing happens till balance_dirty_pages
         finishes, at which point we go back to #1.
      
      This can almost completely stall out writing back of the file and keep
      the system over dirty threshold for a long time which can mess up the
      whole system.  We encountered this issue in production with a package
      handling application which can reliably reproduce the issue when
      running under tight memory limits.
      
      Reading the comment in the error handling section, this seems to be to
      avoid accidentally skipping a page in case the write attempt on the
      page doesn't succeed.  However, this concern seems bogus.
      
      On each page, the code either:
      
      * Skips and moves onto the next page.
      
      * Fails issue and sets done_index to index + 1.
      
      * Successfully issues and continue to the next page if budget allows
        and not EOF.
      
      IOW, as long as it's not EOF and there's budget, the code never
      retries writing back the same page.  Only when a page happens to be
      the last page of a particular run, we end up retrying the page, which
      can't possibly guarantee anything data integrity related.  Besides,
      cyclic writes are only used for non-syncing writebacks meaning that
      there's no data integrity implication to begin with.
      
      Fix it by always setting done_index past the current page being
      processed.
      
      Note that this problem exists in other writepages too.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dfca82a7
    • F
      Btrfs: fix negative subv_writers counter and data space leak after buffered write · 8155dbe0
      Filipe Manana 提交于
      commit a0e248bb502d5165b3314ac3819e888fdcdf7d9f upstream.
      
      When doing a buffered write it's possible to leave the subv_writers
      counter of the root, used for synchronization between buffered nocow
      writers and snapshotting. This happens in an exceptional case like the
      following:
      
      1) We fail to allocate data space for the write, since there's not
         enough available data space nor enough unallocated space for allocating
         a new data block group;
      
      2) Because of that failure, we try to go to NOCOW mode, which succeeds
         and therefore we set the local variable 'only_release_metadata' to true
         and set the root's sub_writers counter to 1 through the call to
         btrfs_start_write_no_snapshotting() made by check_can_nocow();
      
      3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
         to happen but not impossible;
      
      4) No pages are copied because btrfs_copy_from_user() returned zero;
      
      5) We call btrfs_end_write_no_snapshotting() which decrements the root's
         subv_writers counter to 0;
      
      6) We don't set 'only_release_metadata' back to 'false' because we do
         it only if 'copied', the value returned by btrfs_copy_from_user(), is
         greater than zero;
      
      7) On the next iteration of the while loop, which processes the same
         page range, we are now able to allocate data space for the write (we
         got enough data space released in the meanwhile);
      
      8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
         now there isn't enough free metadata space, or in some other place
         further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
         btrfs_dirty_pages()), we break out of the while loop with
         'only_release_metadata' having a value of 'true';
      
      9) Because 'only_release_metadata' is 'true' we end up decrementing the
         root's subv_writers counter to -1 (through a call to
         btrfs_end_write_no_snapshotting()), and we also end up not releasing the
         data space previously reserved through btrfs_check_data_free_space().
         As a consequence the mechanism for synchronizing NOCOW buffered writes
         with snapshotting gets broken.
      
      Fix this by always setting 'only_release_metadata' to false at the start
      of each iteration.
      
      Fixes: 8257b2dc ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
      Fixes: 7ee9e440 ("Btrfs: check if we can nocow if we don't have data space")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8155dbe0
    • F
      Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc · 9d0e32f0
      Filipe Manana 提交于
      commit 536870071dbc4278264f59c9a2f5f447e584d139 upstream.
      
      In the fixup worker, if we fail to mark the range as delalloc in the io
      tree, we must release the previously reserved metadata, as well as update
      the outstanding extents counter for the inode, otherwise we leak metadata
      space.
      
      In pratice we can't return an error from btrfs_set_extent_delalloc(),
      which is just a wrapper around __set_extent_bit(), as for most errors
      __set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as
      well) and returning an -EEXIST error doesn't happen in this case since
      the exclusive bits parameter always has a value of 0 through this code
      path. Nevertheless, just fix the error handling in the fixup worker,
      in case one day __set_extent_bit() can return an error to this code
      path.
      
      Fixes: f3038ee3 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d0e32f0
    • J
      btrfs: use refcount_inc_not_zero in kill_all_nodes · eda96b24
      Josef Bacik 提交于
      commit baf320b9d531f1cfbf64c60dd155ff80a58b3796 upstream.
      
      We hit the following warning while running down a different problem
      
      [ 6197.175850] ------------[ cut here ]------------
      [ 6197.185082] refcount_t: underflow; use-after-free.
      [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
      [ 6197.521792] Call Trace:
      [ 6197.526687]  __btrfs_release_delayed_node+0x76/0x1c0
      [ 6197.536615]  btrfs_kill_all_delayed_nodes+0xec/0x130
      [ 6197.546532]  ? __btrfs_btree_balance_dirty+0x60/0x60
      [ 6197.556482]  btrfs_clean_one_deleted_snapshot+0x71/0xd0
      [ 6197.566910]  cleaner_kthread+0xfa/0x120
      [ 6197.574573]  kthread+0x111/0x130
      [ 6197.581022]  ? kthread_create_on_node+0x60/0x60
      [ 6197.590086]  ret_from_fork+0x1f/0x30
      [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---
      
      This is because the free side drops the ref without the lock, and then
      takes the lock if our refcount is 0.  So you can have nodes on the tree
      that have a refcount of 0.  Fix this by zero'ing out that element in our
      temporary array so we don't try to kill it again.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eda96b24
    • J
      btrfs: check page->mapping when loading free space cache · 6e3b9068
      Josef Bacik 提交于
      commit 3797136b626ad4b6582223660c041efdea8f26b2 upstream.
      
      While testing 5.2 we ran into the following panic
      
      [52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
      [52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
      [52238.304051] Call Trace:
      [52238.308958]  try_to_free_buffers+0x15b/0x1b0
      [52238.317503]  shrink_page_list+0x1164/0x1780
      [52238.325877]  shrink_inactive_list+0x18f/0x3b0
      [52238.334596]  shrink_node_memcg+0x23e/0x7d0
      [52238.342790]  ? do_shrink_slab+0x4f/0x290
      [52238.350648]  shrink_node+0xce/0x4a0
      [52238.357628]  balance_pgdat+0x2c7/0x510
      [52238.365135]  kswapd+0x216/0x3e0
      [52238.371425]  ? wait_woken+0x80/0x80
      [52238.378412]  ? balance_pgdat+0x510/0x510
      [52238.386265]  kthread+0x111/0x130
      [52238.392727]  ? kthread_create_on_node+0x60/0x60
      [52238.401782]  ret_from_fork+0x1f/0x30
      
      The page we were trying to drop had a page->private, but had no
      page->mapping and so called drop_buffers, assuming that we had a
      buffer_head on the page, and then panic'ed trying to deref 1, which is
      our page->private for data pages.
      
      This is happening because we're truncating the free space cache while
      we're trying to load the free space cache.  This isn't supposed to
      happen, and I'll fix that in a followup patch.  However we still
      shouldn't allow those sort of mistakes to result in messing with pages
      that do not belong to us.  So add the page->mapping check to verify that
      we still own this page after dropping and re-acquiring the page lock.
      
      This page being unlocked as:
      btrfs_readpage
        extent_read_full_page
          __extent_read_full_page
            __do_readpage
              if (!nr)
      	   unlock_page  <-- nr can be 0 only if submit_extent_page
      			    returns an error
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add callchain ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e3b9068
  2. 13 12月, 2019 20 次提交
    • B
      xfs: add missing error check in xfs_prepare_shift() · 17559e35
      Brian Foster 提交于
      commit 1749d1ea89bdf3181328b7d846e609d5a0e53e50 upstream.
      
      xfs_prepare_shift() fails to check the error return from
      xfs_flush_unmap_range(). If the latter fails, that could lead to an
      insert/collapse range operation over a delalloc range, which is not
      supported.
      
      Add an error check and return appropriately. This is reproduced
      rarely by generic/475.
      
      Fixes: 7f9f71be84bc ("xfs: extent shifting doesn't fully invalidate page cache")
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Cc: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      17559e35
    • D
      iomap: partially revert 4721a601099 (simulated directio short read on EFAULT) · 4c67dbea
      Darrick J. Wong 提交于
      [ Upstream commit 8f67b5adc030553fbc877124306f3f3bdab89aa8 ]
      
      In commit 4721a601099, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.  This causes infinite
      splice() loops and assertion failures on generic/095 on overlayfs
      because xfs only permit total success or total failure of a directio
      operation.  The underlying issue in the pipe splice code has now been
      fixed by changing the pipe splice loop to avoid avoid reading more data
      than there is space in the pipe.
      
      Therefore, it's no longer necessary to simulate the short directio, so
      remove the hack from iomap.
      
      Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4c67dbea
    • D
      splice: don't read more than available pipe space · 019b6325
      Darrick J. Wong 提交于
      [ Upstream commit 17614445576b6af24e9cf36607c6448164719c96 ]
      
      In commit 4721a601099, we tried to fix a problem wherein directio reads
      into a splice pipe will bounce EFAULT/EAGAIN all the way out to
      userspace by simulating a zero-byte short read.  This happens because
      some directio read implementations (xfs) will call
      bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
      reads, but as soon as we run out of pipe buffers that _get_pages call
      returns EFAULT, which the splice code translates to EAGAIN and bounces
      out to userspace.
      
      In that commit, the iomap code catches the EFAULT and simulates a
      zero-byte read, but that causes assertion errors on regular splice reads
      because xfs doesn't allow short directio reads.
      
      The brokenness is compounded by splice_direct_to_actor immediately
      bailing on do_splice_to returning <= 0 without ever calling ->actor
      (which empties out the pipe), so if userspace calls back we'll EFAULT
      again on the full pipe, and nothing ever gets copied.
      
      Therefore, teach splice_direct_to_actor to clamp its requests to the
      amount of free space in the pipe and remove the simulated short read
      from the iomap directio code.
      
      Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Ranted-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      019b6325
    • J
      iomap: Fix pipe page leakage during splicing · b59116ff
      Jan Kara 提交于
      commit 419e9c38aa075ed0cd3c13d47e15954b686bcdb6 upstream.
      
      When splicing using iomap_dio_rw() to a pipe, we may leak pipe pages
      because bio_iov_iter_get_pages() records that the pipe will have full
      extent worth of data however if file size is not block size aligned
      iomap_dio_rw() returns less than what bio_iov_iter_get_pages() set up
      and splice code gets confused leaking a pipe page with the file tail.
      
      Handle the situation similarly to the old direct IO implementation and
      revert iter to actually returned read amount which makes iter consistent
      with value returned from iomap_dio_rw() and thus the splice code is
      happy.
      
      Fixes: ff6a9292 ("iomap: implement direct I/O")
      CC: stable@vger.kernel.org
      Reported-by: syzbot+991400e8eba7e00a26e1@syzkaller.appspotmail.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b59116ff
    • T
      kernfs: fix ino wrap-around detection · 18493bac
      Tejun Heo 提交于
      commit e23f568aa63f64cd6b355094224cc9356c0f696b upstream.
      
      When the 32bit ino wraps around, kernfs increments the generation
      number to distinguish reused ino instances.  The wrap-around detection
      tests whether the allocated ino is lower than what the cursor but the
      cursor is pointing to the next ino to allocate so the condition never
      triggers.
      
      Fix it by remembering the last ino and comparing against that.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Fixes: 4a3ef68a ("kernfs: implement i_generation")
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18493bac
    • P
      CIFS: Fix SMB2 oplock break processing · d4785d88
      Pavel Shilovsky 提交于
      commit fa9c2362497fbd64788063288dc4e74daf977ebb upstream.
      
      Even when mounting modern protocol version the server may be
      configured without supporting SMB2.1 leases and the client
      uses SMB2 oplock to optimize IO performance through local caching.
      
      However there is a problem in oplock break handling that leads
      to missing a break notification on the client who has a file
      opened. It latter causes big latencies to other clients that
      are trying to open the same file.
      
      The problem reproduces when there are multiple shares from the
      same server mounted on the client. The processing code tries to
      match persistent and volatile file ids from the break notification
      with an open file but it skips all share besides the first one.
      Fix this by looking up in all shares belonging to the server that
      issued the oplock break.
      
      Cc: Stable <stable@vger.kernel.org>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4785d88
    • P
      CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks · df871e55
      Pavel Shilovsky 提交于
      commit 6f582b273ec23332074d970a7fb25bef835df71f upstream.
      
      Currently when the client creates a cifsFileInfo structure for
      a newly opened file, it allocates a list of byte-range locks
      with a pointer to the new cfile and attaches this list to the
      inode's lock list. The latter happens before initializing all
      other fields, e.g. cfile->tlink. Thus a partially initialized
      cifsFileInfo structure becomes available to other threads that
      walk through the inode's lock list. One example of such a thread
      may be an oplock break worker thread that tries to push all
      cached byte-range locks. This causes NULL-pointer dereference
      in smb2_push_mandatory_locks() when accessing cfile->tlink:
      
      [598428.945633] BUG: kernel NULL pointer dereference, address: 0000000000000038
      ...
      [598428.945749] Workqueue: cifsoplockd cifs_oplock_break [cifs]
      [598428.945793] RIP: 0010:smb2_push_mandatory_locks+0xd6/0x5a0 [cifs]
      ...
      [598428.945834] Call Trace:
      [598428.945870]  ? cifs_revalidate_mapping+0x45/0x90 [cifs]
      [598428.945901]  cifs_oplock_break+0x13d/0x450 [cifs]
      [598428.945909]  process_one_work+0x1db/0x380
      [598428.945914]  worker_thread+0x4d/0x400
      [598428.945921]  kthread+0x104/0x140
      [598428.945925]  ? process_one_work+0x380/0x380
      [598428.945931]  ? kthread_park+0x80/0x80
      [598428.945937]  ret_from_fork+0x35/0x40
      
      Fix this by reordering initialization steps of the cifsFileInfo
      structure: initialize all the fields first and then add the new
      byte-range lock list to the inode's lock list.
      
      Cc: Stable <stable@vger.kernel.org>
      Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      df871e55
    • M
      fuse: verify attributes · 710c33ad
      Miklos Szeredi 提交于
      commit eb59bd17d2fa6e5e84fba61a5ebdea984222e6d5 upstream.
      
      If a filesystem returns negative inode sizes, future reads on the file were
      causing the cpu to spin on truncate_pagecache.
      
      Create a helper to validate the attributes.  This now does two things:
      
       - check the file mode
       - check if the file size fits in i_size without overflowing
      Reported-by: NArijit Banerjee <arijit@rubrik.com>
      Fixes: d8a5ba45 ("[PATCH] FUSE - core")
      Cc: <stable@vger.kernel.org> # v2.6.14
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      710c33ad
    • M
      fuse: verify nlink · 9f435a5e
      Miklos Szeredi 提交于
      commit c634da718db9b2fac201df2ae1b1b095344ce5eb upstream.
      
      When adding a new hard link, make sure that i_nlink doesn't overflow.
      
      Fixes: ac45d613 ("fuse: fix nlink after unlink")
      Cc: <stable@vger.kernel.org> # v3.4
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f435a5e
    • Z
      nfsd: Return EPERM, not EACCES, in some SETATTR cases · 1ff89e6d
      zhengbin 提交于
      [ Upstream commit 255fbca65137e25b12bced18ec9a014dc77ecda0 ]
      
      As the man(2) page for utime/utimes states, EPERM is returned when the
      second parameter of utime or utimes is not NULL, the caller's effective UID
      does not match the owner of the file, and the caller is not privileged.
      
      However, in a NFS directory mounted from knfsd, it will return EACCES
      (from nfsd_setattr-> fh_verify->nfsd_permission).  This patch fixes
      that.
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1ff89e6d
    • K
      pstore/ram: Avoid NULL deref in ftrace merging failure path · 6b6f6030
      Kees Cook 提交于
      [ Upstream commit 8665569e97dd52920713b95675409648986b5b0d ]
      
      Given corruption in the ftrace records, it might be possible to allocate
      tmp_prz without assigning prz to it, but still marking it as needing to
      be freed, which would cause at least a NULL dereference.
      
      smatch warnings:
      fs/pstore/ram.c:340 ramoops_pstore_read() error: we previously assumed 'prz' could be null (see line 255)
      
      https://lists.01.org/pipermail/kbuild-all/2018-December/055528.htmlReported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 2fbea82b ("pstore: Merge per-CPU ftrace records into one")
      Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6b6f6030
    • D
      dlm: fix invalid cluster name warning · 446a04d8
      David Teigland 提交于
      [ Upstream commit 3595c559326d0b660bb088a88e22e0ca630a0e35 ]
      
      The warning added in commit 3b0e761b
        "dlm: print log message when cluster name is not set"
      
      did not account for the fact that lockspaces created
      from userland do not supply a cluster name, so bogus
      warnings are printed every time a userland lockspace
      is created.
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      446a04d8
    • S
      nfsd: fix a warning in __cld_pipe_upcall() · 3ff6af8e
      Scott Mayhew 提交于
      [ Upstream commit b493fd31c0b89d9453917e977002de58bebc3802 ]
      
      __cld_pipe_upcall() emits a "do not call blocking ops when
      !TASK_RUNNING" warning due to the dput() call in rpc_queue_upcall().
      Fix it by using a completion instead of hand coding the wait.
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3ff6af8e
    • W
      dlm: NULL check before kmem_cache_destroy is not needed · 3b0107ca
      Wen Yang 提交于
      [ Upstream commit f31a89692830061bceba8469607e4e4b0f900159 ]
      
      kmem_cache_destroy(NULL) is safe, so removes NULL check before
      freeing the mem. This patch also fix ifnullfree.cocci warnings.
      Signed-off-by: NWen Yang <wen.yang99@zte.com.cn>
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3b0107ca
    • J
      lockd: fix decoding of TEST results · a2b7010f
      J. Bruce Fields 提交于
      [ Upstream commit b8db159239b3f51e2b909859935cc25cb3ff3eed ]
      
      We fail to advance the read pointer when reading the stat.oh field that
      identifies the lock-holder in a TEST result.
      
      This turns out not to matter if the server is knfsd, which always
      returns a zero-length field.  But other servers (Ganesha is an example)
      may not do this.  The result is bad values in fcntl F_GETLK results.
      
      Fix this.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a2b7010f
    • S
      f2fs: fix to allow node segment for GC by ioctl path · c8aa27cf
      Sahitya Tummala 提交于
      [ Upstream commit 08ac9a3870f6babb2b1fff46118536ca8a71ef19 ]
      
      Allow node type segments also to be GC'd via f2fs ioctl
      F2FS_IOC_GARBAGE_COLLECT_RANGE.
      Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c8aa27cf
    • Y
      f2fs: change segment to section in f2fs_ioc_gc_range · 313f1fef
      Yunlong Song 提交于
      [ Upstream commit 67b0e42b768c9ddc3fd5ca1aee3db815cfaa635c ]
      
      f2fs_ioc_gc_range skips blocks_per_seg each time, however, f2fs_gc moves
      blocks of section each time, so fix it from segment to section.
      Signed-off-by: NYunlong Song <yunlong.song@huawei.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      313f1fef
    • Y
      f2fs: fix count of seg_freed to make sec_freed correct · 859c93a0
      Yunlong Song 提交于
      [ Upstream commit d6c66cd19ef322fe0d51ba09ce1b7f386acab04a ]
      
      When sbi->segs_per_sec > 1, and if some segno has 0 valid blocks before
      gc starts, do_garbage_collect will skip counting seg_freed++, and this
      will cause seg_freed < sbi->segs_per_sec and finally skip sec_freed++.
      Signed-off-by: NYunlong Song <yunlong.song@huawei.com>
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      859c93a0
    • C
      f2fs: fix to account preflush command for noflush_merge mode · c1054aeb
      Chao Yu 提交于
      [ Upstream commit a8075dc484cf10ebdb07bee2b17322fb0a846309 ]
      
      Previously, we only account preflush command for flush_merge mode,
      so for noflush_merge mode, we can not know in-flight preflush
      command count, fix it.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c1054aeb
    • D
      iomap: readpages doesn't zero page tail beyond EOF · c3a62b65
      Dave Chinner 提交于
      [ Upstream commit 8c110d43c6bca4b24dd13272a9d4e0ba6f2ec957 ]
      
      When we read the EOF page of the file via readpages, we need
      to zero the region beyond EOF that we either do not read or
      should not contain data so that mmap does not expose stale data to
      user applications.
      
      However, iomap_adjust_read_range() fails to detect EOF correctly,
      and so fsx on 1k block size filesystems fails very quickly with
      mapreads exposing data beyond EOF. There are two problems here.
      
      Firstly, when calculating the end block of the EOF byte, we have
      to round the size by one to avoid a block aligned EOF from reporting
      a block too large. i.e. a size of 1024 bytes is 1 block, which in
      index terms is block 0. Therefore we have to calculate the end block
      from (isize - 1), not isize.
      
      The second bug is determining if the current page spans EOF, and so
      whether we need split it into two half, one for the IO, and the
      other for zeroing. Unfortunately, the code that checks whether
      we should split the block doesn't actually check if we span EOF, it
      just checks if the read spans the /offset in the page/ that EOF
      sits on. So it splits every read into two if EOF is not page
      aligned, regardless of whether we are reading the EOF block or not.
      
      Hence we need to restrict the "does the read span EOF" check to
      just the page that spans EOF, not every page we read.
      
      This patch results in correct EOF detection through readpages:
      
      xfs_vm_readpages:     dev 259:0 ino 0x43 nr_pages 24
      xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
      iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
      xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
      iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
      iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
      
      As you can see, it now does full page reads until the last one which
      is split correctly at the block aligned EOF, reading 3072 bytes and
      zeroing the last 1024 bytes. The original version of the patch got
      this right, but it got another case wrong.
      
      The EOF detection crossing really needs to the the original length
      as plen, while it starts at the end of the block, will be shortened
      as up-to-date blocks are found on the page. This means "orig_pos +
      plen" no longer points to the end of the page, and so will not
      correctly detect EOF crossing. Hence we have to use the length
      passed in to detect this partial page case:
      
      xfs_filemap_fault:    dev 259:1 ino 0x43  write_fault 0
      xfs_vm_readpage:      dev 259:1 ino 0x43 nr_pages 1
      xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
      iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
      xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
      iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
      
      Heere we see a trace where the first block on the EOF page is up to
      date, hence poff = 1024 bytes. The offset into the page of EOF is
      3072, so the range we want to read is 1024 - 3071, and the range we
      want to zero is 3072 - 4095. You can see this is split correctly
      now.
      
      This fixes the stale data beyond EOF problem that fsx quickly
      uncovers on 1k block size filesystems.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c3a62b65