1. 02 12月, 2020 3 次提交
  2. 16 11月, 2020 2 次提交
  3. 15 11月, 2020 2 次提交
    • D
      afs: Fix afs_write_end() when called with copied == 0 [ver #3] · 3ad216ee
      David Howells 提交于
      When afs_write_end() is called with copied == 0, it tries to set the
      dirty region, but there's no way to actually encode a 0-length region in
      the encoding in page->private.
      
      "0,0", for example, indicates a 1-byte region at offset 0.  The maths
      miscalculates this and sets it incorrectly.
      
      Fix it to just do nothing but unlock and put the page in this case.  We
      don't actually need to mark the page dirty as nothing presumably
      changed.
      
      Fixes: 65dd2d60 ("afs: Alter dirty range encoding in page->private")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad216ee
    • W
      ocfs2: initialize ip_next_orphan · f5785283
      Wengang Wang 提交于
      Though problem if found on a lower 4.1.12 kernel, I think upstream has
      same issue.
      
      In one node in the cluster, there is the following callback trace:
      
         # cat /proc/21473/stack
         __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
         ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
         ocfs2_evict_inode+0x152/0x820 [ocfs2]
         evict+0xae/0x1a0
         iput+0x1c6/0x230
         ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
         ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
         ocfs2_dir_foreach+0x29/0x30 [ocfs2]
         ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
         ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
         process_one_work+0x169/0x4a0
         worker_thread+0x5b/0x560
         kthread+0xcb/0xf0
         ret_from_fork+0x61/0x90
      
      The above stack is not reasonable, the final iput shouldn't happen in
      ocfs2_orphan_filldir() function.  Looking at the code,
      
        2067         /* Skip inodes which are already added to recover list, since dio may
        2068          * happen concurrently with unlink/rename */
        2069         if (OCFS2_I(iter)->ip_next_orphan) {
        2070                 iput(iter);
        2071                 return 0;
        2072         }
        2073
      
      The logic thinks the inode is already in recover list on seeing
      ip_next_orphan is non-NULL, so it skip this inode after dropping a
      reference which incremented in ocfs2_iget().
      
      While, if the inode is already in recover list, it should have another
      reference and the iput() at line 2070 should not be the final iput
      (dropping the last reference).  So I don't think the inode is really in
      the recover list (no vmcore to confirm).
      
      Note that ocfs2_queue_orphans(), though not shown up in the call back
      trace, is holding cluster lock on the orphan directory when looking up
      for unlinked inodes.  The on disk inode eviction could involve a lot of
      IOs which may need long time to finish.  That means this node could hold
      the cluster lock for very long time, that can lead to the lock requests
      (from other nodes) to the orhpan directory hang for long time.
      
      Looking at more on ip_next_orphan, I found it's not initialized when
      allocating a new ocfs2_inode_info structure.
      
      This causes te reflink operations from some nodes hang for very long
      time waiting for the cluster lock on the orphan directory.
      
      Fix: initialize ip_next_orphan as NULL.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5785283
  4. 13 11月, 2020 2 次提交
    • B
      gfs2: Fix case in which ail writes are done to jdata holes · 4e79e3f0
      Bob Peterson 提交于
      Patch b2a846db ("gfs2: Ignore journal log writes for jdata holes")
      tried (unsuccessfully) to fix a case in which writes were done to jdata
      blocks, the blocks are sent to the ail list, then a punch_hole or truncate
      operation caused the blocks to be freed. In other words, the ail items
      are for jdata holes. Before b2a846db, the jdata hole caused function
      gfs2_block_map to return -EIO, which was eventually interpreted as an
      IO error to the journal, and then withdraw.
      
      This patch changes function gfs2_get_block_noalloc, which is only used
      for jdata writes, so it returns -ENODATA rather than -EIO, and when
      -ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
      We can safely ignore it because gfs2_ail1_start_one is only called
      when the jdata pages have already been written and truncated, so the
      ail1 content no longer applies.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      4e79e3f0
    • B
      Revert "gfs2: Ignore journal log writes for jdata holes" · d3039c06
      Bob Peterson 提交于
      This reverts commit b2a846db.
      
      That commit changed the behavior of function gfs2_block_map to return
      -ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
      false.  While that fixed the intended problem for jdata, it also broke
      other callers of gfs2_block_map such as some jdata block reads.  Before
      the patch, an encountered hole would be skipped and the buffer seen as
      unmapped by the caller.  The patch changed the behavior to return
      -ENODATA, which is interpreted as an error by the caller.
      
      The -ENODATA return code should be restricted to the specific case where
      jdata holes are encountered during ail1 writes.  That will be done in a
      later patch.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      d3039c06
  5. 12 11月, 2020 10 次提交
  6. 11 11月, 2020 7 次提交
    • D
      vfs: move __sb_{start,end}_write* to fs.h · 9b852342
      Darrick J. Wong 提交于
      Now that we've straightened out the callers, move these three functions
      to fs.h since they're fairly trivial.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      9b852342
    • D
      vfs: separate __sb_start_write into blocking and non-blocking helpers · 8a3c84b6
      Darrick J. Wong 提交于
      Break this function into two helpers so that it's obvious that the
      trylock versions return a value that must be checked, and the blocking
      versions don't require that.  While we're at it, clean up the return
      type mismatch.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      8a3c84b6
    • D
      vfs: remove lockdep bogosity in __sb_start_write · 22843291
      Darrick J. Wong 提交于
      __sb_start_write has some weird looking lockdep code that claims to
      exist to handle nested freeze locking requests from xfs.  The code as
      written seems broken -- if we think we hold a read lock on any of the
      higher freeze levels (e.g. we hold SB_FREEZE_WRITE and are trying to
      lock SB_FREEZE_PAGEFAULT), it converts a blocking lock attempt into a
      trylock.
      
      However, it's not correct to downgrade a blocking lock attempt to a
      trylock unless the downgrading code or the callers are prepared to deal
      with that situation.  Neither __sb_start_write nor its callers handle
      this at all.  For example:
      
      sb_start_pagefault ignores the return value completely, with the result
      that if xfs_filemap_fault loses a race with a different thread trying to
      fsfreeze, it will proceed without pagefault freeze protection (thereby
      breaking locking rules) and then unlocks the pagefault freeze lock that
      it doesn't own on its way out (thereby corrupting the lock state), which
      leads to a system hang shortly afterwards.
      
      Normally, this won't happen because our ownership of a read lock on a
      higher freeze protection level blocks fsfreeze from grabbing a write
      lock on that higher level.  *However*, if lockdep is offline,
      lock_is_held_type unconditionally returns 1, which means that
      percpu_rwsem_is_held returns 1, which means that __sb_start_write
      unconditionally converts blocking freeze lock attempts into trylocks,
      even when we *don't* hold anything that would block a fsfreeze.
      
      Apparently this all held together until 5.10-rc1, when bugs in lockdep
      caused lockdep to shut itself off early in an fstests run, and once
      fstests gets to the "race writes with freezer" tests, kaboom.  This
      might explain the long trail of vanishingly infrequent livelocks in
      fstests after lockdep goes offline that I've never been able to
      diagnose.
      
      We could fix it by spinning on the trylock if wait==true, but AFAICT the
      locking works fine if lockdep is not built at all (and I didn't see any
      complaints running fstests overnight), so remove this snippet entirely.
      
      NOTE: Commit f4b554af in 2015 created the current weird logic (which
      used to exist in a different form in commit 5accdf82 from 2012) in
      __sb_start_write.  XFS solved this whole problem in the late 2.6 era by
      creating a variant of transactions (XFS_TRANS_NO_WRITECOUNT) that don't
      grab intwrite freeze protection, thus making lockdep's solution
      unnecessary.  The commit claims that Dave Chinner explained that the
      trylock hack + comment could be removed, but nobody ever did.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      22843291
    • D
      xfs: fix brainos in the refcount scrubber's rmap fragment processor · 54e9b09e
      Darrick J. Wong 提交于
      Fix some serious WTF in the reference count scrubber's rmap fragment
      processing.  The code comment says that this loop is supposed to move
      all fragment records starting at or before bno onto the worklist, but
      there's no obvious reason why nr (the number of items added) should
      increment starting from 1, and breaking the loop when we've added the
      target number seems dubious since we could have more rmap fragments that
      should have been added to the worklist.
      
      This seems to manifest in xfs/411 when adding one to the refcount field.
      
      Fixes: dbde19da ("xfs: cross-reference the rmapbt data with the refcountbt")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      54e9b09e
    • D
      xfs: fix rmap key and record comparison functions · 6ff646b2
      Darrick J. Wong 提交于
      Keys for extent interval records in the reverse mapping btree are
      supposed to be computed as follows:
      
      (physical block, owner, fork, is_btree, is_unwritten, offset)
      
      This provides users the ability to look up a reverse mapping from a bmbt
      record -- start with the physical block; then if there are multiple
      records for the same block, move on to the owner; then the inode fork
      type; and so on to the file offset.
      
      However, the key comparison functions incorrectly remove the
      fork/btree/unwritten information that's encoded in the on-disk offset.
      This means that lookup comparisons are only done with:
      
      (physical block, owner, offset)
      
      This means that queries can return incorrect results.  On consistent
      filesystems this hasn't been an issue because blocks are never shared
      between forks or with bmbt blocks; and are never unwritten.  However,
      this bug means that online repair cannot always detect corruption in the
      key information in internal rmapbt nodes.
      
      Found by fuzzing keys[1].attrfork = ones on xfs/371.
      
      Fixes: 4b8ed677 ("xfs: add rmap btree operations")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      6ff646b2
    • D
      xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextents · 5dda3897
      Darrick J. Wong 提交于
      When the bmbt scrubber is looking up rmap extents, we need to set the
      extent flags from the bmbt record fully.  This will matter once we fix
      the rmap btree comparison functions to check those flags correctly.
      
      Fixes: d852657c ("xfs: cross-reference reverse-mapping btree")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5dda3897
    • D
      xfs: fix flags argument to rmap lookup when converting shared file rmaps · ea843989
      Darrick J. Wong 提交于
      Pass the same oldext argument (which contains the existing rmapping's
      unwritten state) to xfs_rmap_lookup_le_range at the start of
      xfs_rmap_convert_shared.  At this point in the code, flags is zero,
      which means that we perform lookups using the wrong key.
      
      Fixes: 3f165b33 ("xfs: convert unwritten status of reverse mappings for shared files")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      ea843989
  7. 07 11月, 2020 14 次提交