1. 13 4月, 2020 1 次提交
  2. 03 3月, 2020 1 次提交
  3. 19 11月, 2019 1 次提交
  4. 14 11月, 2019 1 次提交
  5. 27 8月, 2019 1 次提交
  6. 29 6月, 2019 2 次提交
  7. 27 4月, 2019 1 次提交
  8. 17 4月, 2019 1 次提交
    • D
      xfs: implement per-inode writeback completion queues · cb357bf3
      Darrick J. Wong 提交于
      When scheduling writeback of dirty file data in the page cache, XFS uses
      IO completion workqueue items to ensure that filesystem metadata only
      updates after the write completes successfully.  This is essential for
      converting unwritten extents to real extents at the right time and
      performing COW remappings.
      
      Unfortunately, XFS queues each IO completion work item to an unbounded
      workqueue, which means that the kernel can spawn dozens of threads to
      try to handle the items quickly.  These threads need to take the ILOCK
      to update file metadata, which results in heavy ILOCK contention if a
      large number of the work items target a single file, which is
      inefficient.
      
      Worse yet, the writeback completion threads get stuck waiting for the
      ILOCK while holding transaction reservations, which can use up all
      available log reservation space.  When that happens, metadata updates to
      other parts of the filesystem grind to a halt, even if the filesystem
      could otherwise have handled it.
      
      Even worse, if one of the things grinding to a halt happens to be a
      thread in the middle of a defer-ops finish holding the same ILOCK and
      trying to obtain more log reservation having exhausted the permanent
      reservation, we now have an ABBA deadlock - writeback completion has a
      transaction reserved and wants the ILOCK, and someone else has the ILOCK
      and wants a transaction reservation.
      
      Therefore, we create a per-inode writeback io completion queue + work
      item.  When writeback finishes, it can add the ioend to the per-inode
      queue and let the single worker item process that queue.  This
      dramatically cuts down on the number of kworkers and ILOCK contention in
      the system, and seems to have eliminated an occasional deadlock I was
      seeing while running generic/476.
      
      Testing with a program that simulates a heavy random-write workload to a
      single file demonstrates that the number of kworkers drops from
      approximately 120 threads per file to 1, without dramatically changing
      write bandwidth or pagecache access latency.
      
      Note that we leave the xfs-conv workqueue's max_active alone because we
      still want to be able to run ioend processing for as many inodes as the
      system can handle.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      cb357bf3
  9. 15 4月, 2019 1 次提交
  10. 30 7月, 2018 2 次提交
  11. 27 7月, 2018 1 次提交
  12. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  13. 16 5月, 2018 1 次提交
  14. 10 5月, 2018 2 次提交
    • D
      xfs: log item flags are racy · 22525c17
      Dave Chinner 提交于
      The log item flags contain a field that is protected by the AIL
      lock - the XFS_LI_IN_AIL flag. We use non-atomic RMW operations to
      set and clear these flags, but most of the updates and checks are
      not done with the AIL lock held and so are susceptible to update
      races.
      
      Fix this by changing the log item flags to use atomic bitops rather
      than be reliant on the AIL lock for update serialisation.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      22525c17
    • D
      xfs: validate cached inodes are free when allocated · afca6c5b
      Dave Chinner 提交于
      A recent fuzzed filesystem image cached random dcache corruption
      when the reproducer was run. This often showed up as panics in
      lookup_slow() on a null inode->i_ops pointer when doing pathwalks.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      ....
      Call Trace:
       lookup_slow+0x44/0x60
       walk_component+0x3dd/0x9f0
       link_path_walk+0x4a7/0x830
       path_lookupat+0xc1/0x470
       filename_lookup+0x129/0x270
       user_path_at_empty+0x36/0x40
       path_listxattr+0x98/0x110
       SyS_listxattr+0x13/0x20
       do_syscall_64+0xf5/0x280
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      but had many different failure modes including deadlocks trying to
      lock the inode that was just allocated or KASAN reports of
      use-after-free violations.
      
      The cause of the problem was a corrupt INOBT on a v4 fs where the
      root inode was marked as free in the inobt record. Hence when we
      allocated an inode, it chose the root inode to allocate, found it in
      the cache and re-initialised it.
      
      We recently fixed a similar inode allocation issue caused by inobt
      record corruption problem in xfs_iget_cache_miss() in commit
      ee457001 ("xfs: catch inode allocation state mismatch
      corruption"). This change adds similar checks to the cache-hit path
      to catch it, and turns the reproducer into a corruption shutdown
      situation.
      Reported-by: NWen Xu <wen.xu@gatech.edu>
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: fix typos in comment]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      afca6c5b
  15. 24 3月, 2018 1 次提交
    • D
      xfs: catch inode allocation state mismatch corruption · ee457001
      Dave Chinner 提交于
      We recently came across a V4 filesystem causing memory corruption
      due to a newly allocated inode being setup twice and being added to
      the superblock inode list twice. From code inspection, the only way
      this could happen is if a newly allocated inode was not marked as
      free on disk (i.e. di_mode wasn't zero).
      
      Running the metadump on an upstream debug kernel fails during inode
      allocation like so:
      
      XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
      e.c, line: 838
       ------------[ cut here ]------------
      kernel BUG at fs/xfs/xfs_message.c:114!
      invalid opcode: 0000 [#1] PREEMPT SMP
      CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
      1/2014
      RIP: 0010:assfail+0x28/0x30
      RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
      RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
      RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
      RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
      R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
      FS:  00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
      000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
      Call Trace:
       xfs_ialloc+0x383/0x570
       xfs_dir_ialloc+0x6a/0x2a0
       xfs_create+0x412/0x670
       xfs_generic_create+0x1f7/0x2c0
       ? capable_wrt_inode_uidgid+0x3f/0x50
       vfs_mkdir+0xfb/0x1b0
       SyS_mkdir+0xcf/0xf0
       do_syscall_64+0x73/0x1a0
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      Extracting the inode number we crashed on from an event trace and
      looking at it with xfs_db:
      
      xfs_db> inode 184452204
      xfs_db> p
      core.magic = 0x494e
      core.mode = 0100644
      core.version = 2
      core.format = 2 (extents)
      core.nlinkv2 = 1
      core.onlink = 0
      .....
      
      Confirms that it is not a free inode on disk. xfs_repair
      also trips over this inode:
      
      .....
      zero length extent (off = 0, fsbno = 0) in ino 184452204
      correcting nextents for inode 184452204
      bad attribute fork in inode 184452204, would clear attr fork
      bad nblocks 1 for inode 184452204, would reset to 0
      bad anextents 1 for inode 184452204, would reset to 0
      imap claims in-use inode 184452204 is free, would correct imap
      would have cleared inode 184452204
      .....
      disconnected inode 184452204, would move to lost+found
      
      And so we have a situation where the directory structure and the
      inobt thinks the inode is free, but the inode on disk thinks it is
      still in use. Where this corruption came from is not possible to
      diagnose, but we can detect it and prevent the kernel from oopsing
      on lookup. The reproducer now results in:
      
      $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
      mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
      ists
      mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
      ists
      mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
      re needs cleaning
      mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
      utput error
      mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
      utput error
      ....
      
      And this corruption shutdown:
      
      [   54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
       marked free on disk
      [   54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
      of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x425/0x670
      [   54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
      443
      [   54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
      S 1.10.2-1 04/01/2014
      [   54.852859] Call Trace:
      [   54.853531]  dump_stack+0x85/0xc5
      [   54.854385]  xfs_trans_cancel+0x197/0x1c0
      [   54.855421]  xfs_create+0x425/0x670
      [   54.856314]  xfs_generic_create+0x1f7/0x2c0
      [   54.857390]  ? capable_wrt_inode_uidgid+0x3f/0x50
      [   54.858586]  vfs_mkdir+0xfb/0x1b0
      [   54.859458]  SyS_mkdir+0xcf/0xf0
      [   54.860254]  do_syscall_64+0x73/0x1a0
      [   54.861193]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [   54.862492] RIP: 0033:0x7fb73bddf547
      [   54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
      000000000053
      [   54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
      bddf547
      [   54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
      a55449a
      [   54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
      8670dd0
      [   54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
      00001ff
      [   54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
      a553500
      [   54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
      024 of file fs/xfs/xfs_trans.c.  Return address = ffffffff814cd050
      [   54.882790] XFS (loop0): Corruption of in-memory data detected.  Shutt=
      ing down filesystem
      [   54.884597] XFS (loop0): Please umount the filesystem and rectify the =
      problem(s)
      
      Note that this crash is only possible on v4 filesystemsi or v5
      filesystems mounted with the ikeep mount option. For all other V5
      filesystems, this problem cannot occur because we don't read inodes
      we are allocating from disk - we simply overwrite them with the new
      inode information.
      Signed-Off-By: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Tested-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ee457001
  16. 29 1月, 2018 2 次提交
  17. 18 1月, 2018 1 次提交
    • D
      xfs: recheck reflink / dirty page status before freeing CoW reservations · be78ff0e
      Darrick J. Wong 提交于
      Eryu Guan reported seeing occasional hangs when running generic/269 with
      a new fsstress that supports clonerange/deduperange.  The cause of this
      hang is an infinite loop when we convert the CoW fork extents from
      unwritten to real just prior to writing the pages out; the infinite
      loop happens because there's nothing in the CoW fork to convert, and so
      it spins forever.
      
      The fundamental issue here is that when we go to perform these CoW fork
      conversions, we're supposed to have an extent waiting for us, but the
      low space CoW reaper has snuck in and blown them away!  There are four
      conditions that can dissuade the reaper from touching our file -- no
      reflink iflag; dirty page cache; writeback in progress; or directio in
      progress.  We check the four conditions prior to taking the locks, but
      we neglect to recheck them once we have the locks, which is how we end
      up whacking the writeback that's in progress.
      
      Therefore, refactor the four checks into a helper function and call it
      once again once we have the locks to make sure we really want to reap
      the inode.  While we're at it, add an ASSERT for this weird condition so
      that we'll fail noisily if we ever screw this up again.
      Reported-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NEryu Guan <eguan@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      be78ff0e
  18. 09 1月, 2018 1 次提交
  19. 22 12月, 2017 1 次提交
  20. 21 12月, 2017 1 次提交
  21. 27 10月, 2017 1 次提交
  22. 02 9月, 2017 1 次提交
  23. 20 6月, 2017 2 次提交
  24. 08 6月, 2017 1 次提交
  25. 28 4月, 2017 2 次提交
    • B
      xfs: update ag iterator to support wait on new inodes · ae2c4ac2
      Brian Foster 提交于
      The AG inode iterator currently skips new inodes as such inodes are
      inserted into the inode radix tree before they are fully
      constructed. Certain contexts require the ability to wait on the
      construction of new inodes, however. The fs-wide dquot release from
      the quotaoff sequence is an example of this.
      
      Update the AG inode iterator to support the ability to wait on
      inodes flagged with XFS_INEW upon request. Create a new
      xfs_inode_ag_iterator_flags() interface and support a set of
      iteration flags to modify the iteration behavior. When the
      XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
      radix tree inode lookup and wait on them before the callback is
      executed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ae2c4ac2
    • B
      xfs: support ability to wait on new inodes · 756baca2
      Brian Foster 提交于
      Inodes that are inserted into the perag tree but still under
      construction are flagged with the XFS_INEW bit. Most contexts either
      skip such inodes when they are encountered or have the ability to
      handle them.
      
      The runtime quotaoff sequence introduces a context that must wait
      for construction of such inodes to correctly ensure that all dquots
      in the fs are released. In anticipation of this, support the ability
      to wait on new inodes. Wake the appropriate bit when XFS_INEW is
      cleared.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      756baca2
  26. 08 3月, 2017 1 次提交
  27. 31 1月, 2017 2 次提交
    • B
      xfs: sync eofblocks scans under iolock are livelock prone · c3155097
      Brian Foster 提交于
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c3155097
    • B
      xfs: pull up iolock from xfs_free_eofblocks() · a36b9261
      Brian Foster 提交于
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a36b9261
  28. 04 1月, 2017 1 次提交
    • C
      xfs: fix crash and data corruption due to removal of busy COW extents · a1b7a4de
      Christoph Hellwig 提交于
      There is a race window between write_cache_pages calling
      clear_page_dirty_for_io and XFS calling set_page_writeback, in which
      the mapping for an inode is tagged neither as dirty, nor as writeback.
      
      If the COW shrinker hits in exactly that window we'll remove the delayed
      COW extents and writepages trying to write it back, which in release
      kernels will manifest as corruption of the bmap btree, and in debug
      kernels will trip the ASSERT about now calling xfs_bmapi_write with the
      COWFORK flag for holes.  A complex customer load manages to hit this
      window fairly reliably, probably by always having COW writeback in flight
      while the cow shrinker runs.
      
      This patch adds another check for having the I_DIRTY_PAGES flag set,
      which is still set during this race window.  While this fixes the problem
      I'm still not overly happy about the way the COW shrinker works as it
      still seems a bit fragile.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a1b7a4de
  29. 30 11月, 2016 1 次提交
  30. 10 11月, 2016 1 次提交
    • B
      xfs: fix unbalanced inode reclaim flush locking · 98efe8af
      Brian Foster 提交于
      Filesystem shutdown testing on an older distro kernel has uncovered an
      imbalanced locking pattern for the inode flush lock in
      xfs_reclaim_inode(). Specifically, there is a double unlock sequence
      between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
      "reclaim:" label.
      
      This actually does not cause obvious problems on current kernels due to
      the current flush lock implementation. Older kernels use a counting
      based flush lock mechanism, however, which effectively breaks the lock
      indefinitely when an already unlocked flush lock is repeatedly unlocked.
      Though this only currently occurs on filesystem shutdown, it has
      reproduced the effect of elevating an fs shutdown to a system-wide crash
      or hang.
      
      As it turns out, the flush lock is not actually required for the reclaim
      logic in xfs_reclaim_inode() because by that time we have already cycled
      the flush lock once while holding ILOCK_EXCL. Therefore, remove the
      additional flush lock/unlock cycle around the 'reclaim:' label and
      update branches into this label to release the flush lock where
      appropriate. Add an assert to xfs_ifunlock() to help prevent future
      occurences of the same problem.
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      98efe8af
  31. 08 11月, 2016 1 次提交
    • B
      xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan · 39937234
      Brian Foster 提交于
      The cowblocks background scanner currently clears the cowblocks tag
      for inodes without any real allocations in the cow fork. This
      excludes inodes with only delalloc blocks in the cow fork. While we
      might never expect to clear delalloc blocks from the cow fork in the
      background scanner, it is not necessarily correct to clear the
      cowblocks tag from such inodes.
      
      For example, if the background scanner happens to process an inode
      between a buffered write and writeback, the scanner catches the
      inode in a state after delalloc blocks have been allocated to the
      cow fork but before the delalloc blocks have been converted to real
      blocks by writeback. The background scanner then incorrectly clears
      the cowblocks tag, even if part of the aforementioned delalloc
      reservation will not be remapped to the data fork (i.e., extra
      blocks due to the cowextsize hint). This means that any such
      additional blocks in the cow fork might never be reclaimed by the
      background scanner and could persist until the inode itself is
      reclaimed.
      
      To address this problem, only skip and clear inodes without any cow
      fork allocations whatsoever from the background scanner. While we
      generally do not want to cancel delalloc reservations from the
      background scanner, the pagecache dirty check following the
      cowblocks check should prevent that situation. If we do end up with
      delalloc cow fork blocks without a dirty address space mapping, this
      is probably an indication that something has gone wrong and the
      blocks should be reclaimed, as they may never be converted to a real
      allocation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      39937234
  32. 24 10月, 2016 1 次提交
  33. 06 10月, 2016 1 次提交
    • D
      xfs: garbage collect old cowextsz reservations · 83104d44
      Darrick J. Wong 提交于
      Trim CoW reservations made on behalf of a cowextsz hint if they get too
      old or we run low on quota, so long as we don't have dirty data awaiting
      writeback or directio operations in progress.
      
      Garbage collection of the cowextsize extents are kept separate from
      prealloc extent reaping because setting the CoW prealloc lifetime to a
      (much) higher value than the regular prealloc extent lifetime has been
      useful for combatting CoW fragmentation on VM hosts where the VMs
      experience bursty write behaviors and we can keep the utilization ratios
      low enough that we don't start to run out of space.  IOWs, it benefits
      us to keep the CoW fork reservations around for as long as we can unless
      we run out of blocks or hit inode reclaim.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83104d44