1. 07 1月, 2022 9 次提交
  2. 27 12月, 2021 3 次提交
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · e100c699
      Dave Chinner 提交于
      mainline-inclusion
      from mainline-v5.13-rc4
      commit 5f9b4b0d
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5f9b4b0de8dc2fb8eb655463b438001c111570fe
      
      -------------------------------------------------
      
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      e100c699
    • D
      xfs: update superblock counters correctly for !lazysbcount · 20560d6e
      Dave Chinner 提交于
      mainline-inclusion
      from mainline-v5.12-rc4
      commit 6543990a
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6543990a168acf366f4b6174d7bd46ba15a8a2a6
      
      -------------------------------------------------
      
      Keep the mount superblock counters up to date for !lazysbcount
      filesystems so that when we log the superblock they do not need
      updating in any way because they are already correct.
      
      It's found by what Zorro reported:
      1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev
      2. mount $dev $mnt
      3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load)
      4. umount $mnt
      5. xfs_repair -n $dev
      and I've seen no problem with this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-by: NZorro Lang <zlang@redhat.com>
      Reviewed-by: NGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      20560d6e
    • D
      xfs: remove obsolete AGF counter debugging · dc0c7ae0
      Darrick J. Wong 提交于
      mainline-inclusion
      from mainline-v5.12-rc4
      commit 1aec7c3d
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1aec7c3d05670b92b7339b19999009a93808efb9
      
      -------------------------------------------------
      
      In commit f8f2835a we changed the behavior of XFS to use EFIs to
      remove blocks from an overfilled AGFL because there were complaints
      about transaction overruns that stemmed from trying to free multiple
      blocks in a single transaction.
      
      Unfortunately, that commit missed a subtlety in the debug-mode
      transaction accounting when a realtime volume is attached.  If a
      realtime file undergoes a data fork mapping change such that realtime
      extents are allocated (or freed) in the same transaction that a data
      device block is also allocated (or freed), we can trip a debugging
      assertion.  This can happen (for example) if a realtime extent is
      allocated and it is necessary to reshape the bmbt to hold the new
      mapping.
      
      When we go to allocate a bmbt block from an AG, the first thing the data
      device block allocator does is ensure that the freelist is the proper
      length.  If the freelist is too long, it will trim the freelist to the
      proper length.
      
      In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
      record the decrement in the AG free list count.  Prior to f8f28 we would
      put the free block back in the free space btrees in the same
      transaction, which calls xfs_trans_agblocks_delta() to record the
      increment in the AG free block count.  Since AGFL blocks are included in
      the global free block count (fdblocks), there is no corresponding
      fdblocks update, so the AGFL free satisfies the following condition in
      xfs_trans_apply_sb_deltas:
      
      	/*
      	 * Check that superblock mods match the mods made to AGF counters.
      	 */
      	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
      	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
      		tp->t_ag_btree_delta));
      
      The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
      the number blocks that were allocated.
      
      After commit f8f28 we defer the block freeing to the next chained
      transaction, which means that the calls to xfs_trans_agflist_delta and
      xfs_trans_agblocks_delta occur in separate transactions.  The (first)
      transaction that shortens the free list trips on the comparison, which
      has now become:
      
      (X + 0) == ((X) + -1 + 0)
      
      because we haven't freed the AGFL block yet; we've only logged an
      intention to free it.  When the second transaction (the deferred free)
      commits, it will evaluate the expression as:
      
      (0 + 0) == (1 + 0 + 0)
      
      and trip over that in turn.
      
      At this point, the astute reader may note that the two commits tagged by
      this patch have been in the kernel for a long time but haven't generated
      any bug reports.  How is it that the author became aware of this bug?
      
      This originally surfaced as an intermittent failure when I was testing
      realtime rmap, but a different bug report by Zorro Lang reveals the same
      assertion occuring on !lazysbcount filesystems.
      
      The common factor to both reports (and why this problem wasn't
      previously reported) becomes apparent if we consider when
      xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():
      
      	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
      		xfs_trans_apply_sb_deltas(tp);
      
      With a modern lazysbcount filesystem, transactions update only the
      percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence
      xfs_trans_apply_sb_deltas is rarely called.
      
      However, updates to the count of free realtime extents are not part of
      lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or
      removing data fork mappings to realtime files; similarly,
      XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems.
      
      Dave mentioned in response to an earlier version of this patch:
      
      "IIUC, what you are saying is that this debug code is simply not
      exercised in normal testing and hasn't been for the past decade?  And it
      still won't be exercised on anything other than realtime device testing?
      
      "...it was debugging code from 1994 that was largely turned into dead
      code when lazysbcounters were introduced in 2007. Hence I'm not sure it
      holds any value anymore."
      
      This debugging code isn't especially helpful - you can modify the
      flcount on one AG and the freeblks of another AG, and it won't trigger.
      Add the fact that nobody noticed for a decade, and let's just get rid of
      it (and start testing realtime :P).
      
      This bug was found by running generic/051 on either a V4 filesystem
      lacking lazysbcount; or a V5 filesystem with a realtime volume.
      
      Cc: bfoster@redhat.com, zlang@redhat.com
      Fixes: f8f2835a ("xfs: defer agfl block frees when dfops is available")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      dc0c7ae0
  3. 26 9月, 2020 1 次提交
  4. 16 9月, 2020 1 次提交
  5. 29 7月, 2020 1 次提交
  6. 07 7月, 2020 1 次提交
    • B
      xfs: preserve rmapbt swapext block reservation from freed blocks · f74681ba
      Brian Foster 提交于
      The rmapbt extent swap algorithm remaps individual extents between
      the source inode and the target to trigger reverse mapping metadata
      updates. If either inode straddles a format or other bmap allocation
      boundary, the individual unmap and map cycles can trigger repeated
      bmap block allocations and frees as the extent count bounces back
      and forth across the boundary. While net block usage is bound across
      the swap operation, this behavior can prematurely exhaust the
      transaction block reservation because it continuously drains as the
      transaction rolls. Each allocation accounts against the reservation
      and each free returns to global free space on transaction roll.
      
      The previous workaround to this problem attempted to detect this
      boundary condition and provide surplus block reservation to
      acommodate it. This is insufficient because more remaps can occur
      than implied by the extent counts; if start offset boundaries are
      not aligned between the two inodes, for example.
      
      To address this problem more generically and dynamically, add a
      transaction accounting mode that returns freed blocks to the
      transaction reservation instead of the superblock counters on
      transaction roll and use it when the rmapbt based algorithm is
      active. This allows the chain of remap transactions to preserve the
      block reservation based own its own frees and prevent premature
      exhaustion regardless of the remap pattern. Note that this is only
      safe for superblocks with lazy sb accounting, but the latter is
      required for v5 supers and the rmap feature depends on v5.
      
      Fixes: b3fed434 ("xfs: account format bouncing into rmapbt swapext tx reservation")
      Root-caused-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f74681ba
  7. 27 5月, 2020 3 次提交
    • D
      xfs: remove the m_active_trans counter · b41b46c2
      Dave Chinner 提交于
      It's a global atomic counter, and we are hitting it at a rate of
      half a million transactions a second, so it's bouncing the counter
      cacheline all over the place on large machines. We don't actually
      need it anymore - it used to be required because the VFS freeze code
      could not track/prevent filesystem transactions that were running,
      but that problem no longer exists.
      
      Hence to remove the counter, we simply have to ensure that nothing
      calls xfs_sync_sb() while we are trying to quiesce the filesytem.
      That only happens if the log worker is still running when we call
      xfs_quiesce_attr(). The log worker is cancelled at the end of
      xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
      early here and then we can remove the counter altogether.
      
      Concurrent create, 50 million inodes, identical 16p/16GB virtual
      machines on different physical hosts. Machine A has twice the CPU
      cores per socket of machine B:
      
      		unpatched	patched
      machine A:	3m16s		2m00s
      machine B:	4m04s		4m05s
      
      Create rates:
      		unpatched	patched
      machine A:	282k+/-31k	468k+/-21k
      machine B:	231k+/-8k	233k+/-11k
      
      Concurrent rm of same 50 million inodes:
      
      		unpatched	patched
      machine A:	6m42s		2m33s
      machine B:	4m47s		4m47s
      
      The transaction rate on the fast machine went from just under
      300k/sec to 700k/sec, which indicates just how much of a bottleneck
      this atomic counter was.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b41b46c2
    • D
      xfs: reduce free inode accounting overhead · f18c9a90
      Dave Chinner 提交于
      Shaokun Zhang reported that XFS was using substantial CPU time in
      percpu_count_sum() when running a single threaded benchmark on
      a high CPU count (128p) machine from xfs_mod_ifree(). The issue
      is that the filesystem is empty when the benchmark runs, so inode
      allocation is running with a very low inode free count.
      
      With the percpu counter batching, this means comparisons when the
      counter is less that 128 * 256 = 32768 use the slow path of adding
      up all the counters across the CPUs, and this is expensive on high
      CPU count machines.
      
      The summing in xfs_mod_ifree() is only used to fire an assert if an
      underrun occurs. The error is ignored by the higher level code.
      Hence this is really just debug code and we don't need to run it
      on production kernels, nor do we need such debug checks to return
      error values just to trigger an assert.
      
      Finally, xfs_mod_icount/xfs_mod_ifree are only called from
      xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
      directly call the percpu_counter_add/percpu_counter_compare
      functions. The compare functions are now run only on debug builds as
      they are internal to ASSERT() checks and so only compiled in when
      ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
      Reported-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f18c9a90
    • D
      xfs: gut error handling in xfs_trans_unreserve_and_mod_sb() · dc3ffbb1
      Dave Chinner 提交于
      xfs: gut error handling in xfs_trans_unreserve_and_mod_sb()
      
      From: Dave Chinner <dchinner@redhat.com>
      
      The error handling in xfs_trans_unreserve_and_mod_sb() is largely
      incorrect - rolling back the changes in the transaction if only one
      counter underruns makes all the other counters incorrect. We still
      allow the change to proceed and committing the transaction, except
      now we have multiple incorrect counters instead of a single
      underflow.
      
      Further, we don't actually report the error to the caller, so this
      is completely silent except on debug kernels that will assert on
      failure before we even get to the rollback code.  Hence this error
      handling is broken, untested, and largely unnecessary complexity.
      
      Just remove it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      dc3ffbb1
  8. 27 3月, 2020 2 次提交
  9. 26 3月, 2020 1 次提交
    • D
      xfs: prohibit fs freezing when using empty transactions · 27fb5a72
      Darrick J. Wong 提交于
      I noticed that fsfreeze can take a very long time to freeze an XFS if
      there happens to be a GETFSMAP caller running in the background.  I also
      happened to notice the following in dmesg:
      
      ------------[ cut here ]------------
      WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs]
      Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs]
      CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs]
      Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de <0f> 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54
      RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202
      RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff
      RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000
      R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8
      FS:  00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0
      Call Trace:
       xfs_fs_freeze+0x25/0x40 [xfs]
       freeze_super+0xc8/0x180
       do_vfs_ioctl+0x70b/0x750
       ? __fget_files+0x135/0x210
       ksys_ioctl+0x3a/0xb0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x50/0x1a0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      These two things appear to be related.  The assertion trips when another
      thread initiates a fsmap request (which uses an empty transaction) after
      the freezer waited for m_active_trans to hit zero but before the the
      freezer executes the WARN_ON just prior to calling xfs_log_quiesce.
      
      The lengthy delays in freezing happen because the freezer calls
      xfs_wait_buftarg to clean out the buffer lru list.  Meanwhile, the
      GETFSMAP caller is continuing to grab and release buffers, which means
      that it can take a very long time for the buffer lru list to empty out.
      
      We fix both of these races by calling sb_start_write to obtain freeze
      protection while using empty transactions for GETFSMAP and for metadata
      scrubbing.  The other two users occur during mount, during which time we
      cannot fs freeze.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      27fb5a72
  10. 12 3月, 2020 1 次提交
  11. 19 11月, 2019 1 次提交
  12. 27 8月, 2019 1 次提交
  13. 01 7月, 2019 1 次提交
  14. 29 6月, 2019 7 次提交
  15. 12 6月, 2019 1 次提交
  16. 29 9月, 2018 1 次提交
    • D
      xfs: avoid lockdep false positives in xfs_trans_alloc · 8683edb7
      Dave Chinner 提交于
      We've had a few reports of lockdep tripping over memory reclaim
      context vs filesystem freeze "deadlocks". They all have looked
      to be false positives on analysis, but it seems that they are
      being tripped because we take freeze references before we run
      a GFP_KERNEL allocation for the struct xfs_trans.
      
      We can avoid this false positive vector just by re-ordering the
      operations in xfs_trans_alloc(). That is. we need allocate the
      structure before we take the freeze reference and enter the GFP_NOFS
      allocation context that follows the xfs_trans around. This prevents
      lockdep from seeing the GFP_KERNEL allocation inside the transaction
      context, and that prevents it from triggering the freeze level vs
      alloc context vs reclaim warnings.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      8683edb7
  17. 03 8月, 2018 5 次提交