1. 17 2月, 2017 4 次提交
    • B
      xfs: split indlen reservations fairly when under reserved · 75d65361
      Brian Foster 提交于
      Certain workoads that punch holes into speculative preallocation can
      cause delalloc indirect reservation splits when the delalloc extent is
      split in two. If further splits occur, an already short-handed extent
      can be split into two in a manner that leaves zero indirect blocks for
      one of the two new extents. This occurs because the shortage is large
      enough that the xfs_bmap_split_indlen() algorithm completely drains the
      requested indlen of one of the extents before it honors the existing
      reservation.
      
      This ultimately results in a warning from xfs_bmap_del_extent(). This
      has been observed during file copies of large, sparse files using 'cp
      --sparse=always.'
      
      To avoid this problem, update xfs_bmap_split_indlen() to explicitly
      apply the reservation shortage fairly between both extents. This smooths
      out the overall indlen shortage and defers the situation where we end up
      with a delalloc extent with zero indlen reservation to extreme
      circumstances.
      Reported-by: NPatrick Dung <mpatdung@gmail.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      75d65361
    • B
      xfs: handle indlen shortage on delalloc extent merge · 0e339ef8
      Brian Foster 提交于
      When a delalloc extent is created, it can be merged with pre-existing,
      contiguous, delalloc extents. When this occurs,
      xfs_bmap_add_extent_hole_delay() merges the extents along with the
      associated indirect block reservations. The expectation here is that the
      combined worst case indlen reservation is always less than or equal to
      the indlen reservation for the individual extents.
      
      This is not always the case, however, as existing extents can less than
      the expected indlen reservation if the extent was previously split due
      to a hole punch. If a new extent merges with such an extent, the total
      indlen requirement may be larger than the sum of the indlen reservations
      held by both extents.
      
      xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
      reservation is always available and assigns it to the merged extent
      without consideration for the indlen held by the pre-existing extent. As
      a result, the subsequent xfs_mod_fdblocks() call can attempt an
      unintentional allocation rather than a free (indicated by an ASSERT()
      failure). Further, if the allocation happens to fail in this context,
      the failure goes unhandled and creates a filesystem wide block
      accounting inconsistency.
      
      Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
      indlen reservation assigned to the merged extent to the sum of the
      indlen reservations held by each of the individual extents.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e339ef8
    • B
      xfs: resurrect debug mode drop buffered writes mechanism · 9dbddd7b
      Brian Foster 提交于
      A debug mode write failure mechanism was introduced to XFS in commit
      801cc4e1 ("xfs: debug mode forced buffered write failure") to
      facilitate targeted testing of delalloc indirect reservation management
      from userspace. This code was subsequently rendered ineffective by the
      move to iomap based buffered writes in commit 68a9f5e7 ("xfs:
      implement iomap based buffered write path"). This likely went unnoticed
      because the associated userspace code had not made it into xfstests.
      
      Resurrect this mechanism to facilitate effective indlen reservation
      testing from xfstests. The move to iomap based buffered writes relocated
      the hook this mechanism needs to return write failure from XFS to
      generic code. The failure trigger must remain in XFS. Given that
      limitation, convert this from a write failure mechanism to one that
      simply drops writes without returning failure to userspace. Rename all
      "fail_writes" references to "drop_writes" to illustrate the point. This
      is more hacky than preferred, but still triggers the XFS error handling
      behavior required to drive the indlen tests. This is only available in
      DEBUG mode and for testing purposes only.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9dbddd7b
    • B
      xfs: clear delalloc and cache on buffered write failure · fa7f138a
      Brian Foster 提交于
      The buffered write failure handling code in
      xfs_file_iomap_end_delalloc() has a couple minor problems. First, if
      written == 0, start_fsb is not rounded down and it fails to kill off a
      delalloc block if the start offset is block unaligned. This results in a
      lingering delalloc block and broken delalloc block accounting detected
      at unmount time. Fix this by rounding down start_fsb in the unlikely
      event that written == 0.
      
      Second, it is possible for a failed overwrite of a delalloc extent to
      leave dirty pagecache around over a hole in the file. This is because is
      possible to hit ->iomap_end() on write failure before the iomap code has
      attempted to allocate pagecache, and thus has no need to clean it up. If
      the targeted delalloc extent was successfully written by a previous
      write, however, then it does still have dirty pages when ->iomap_end()
      punches out the underlying blocks. This ultimately results in writeback
      over a hole. To fix this problem, unconditionally punch out the
      pagecache from XFS before the associated delalloc range.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fa7f138a
  2. 10 2月, 2017 6 次提交
  3. 07 2月, 2017 5 次提交
  4. 04 2月, 2017 1 次提交
  5. 03 2月, 2017 7 次提交
    • D
      xfs: mark speculative prealloc CoW fork extents unwritten · 5eda4300
      Darrick J. Wong 提交于
      Christoph Hellwig pointed out that there's a potentially nasty race when
      performing simultaneous nearby directio cow writes:
      
      "Thread 1 writes a range from B to c
      
      "                    B --------- C
                                 p
      
      "a little later thread 2 writes from A to B
      
      "        A --------- B
                     p
      
      [editor's note: the 'p' denote cowextsize boundaries, which I added to
      make this more clear]
      
      "but the code preallocates beyond B into the range where thread
      "1 has just written, but ->end_io hasn't been called yet.
      "But once ->end_io is called thread 2 has already allocated
      "up to the extent size hint into the write range of thread 1,
      "so the end_io handler will splice the unintialized blocks from
      "that preallocation back into the file right after B."
      
      We can avoid this race by ensuring that thread 1 cannot accidentally
      remap the blocks that thread 2 allocated (as part of speculative
      preallocation) as part of t2's write preparation in t1's end_io handler.
      The way we make this happen is by taking advantage of the unwritten
      extent flag as an intermediate step.
      
      Recall that when we begin the process of writing data to shared blocks,
      we create a delayed allocation extent in the CoW fork:
      
      D: --RRRRRRSSSRRRRRRRR---
      C: ------DDDDDDD---------
      
      When a thread prepares to CoW some dirty data out to disk, it will now
      convert the delalloc reservation into an /unwritten/ allocated extent in
      the cow fork.  The da conversion code tries to opportunistically
      allocate as much of a (speculatively prealloc'd) extent as possible, so
      we may end up allocating a larger extent than we're actually writing
      out:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UUUUUUU---------
      
      Next, we convert only the part of the extent that we're actively
      planning to write to normal (i.e. not unwritten) status:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UURRUUU---------
      
      If the write succeeds, the end_cow function will now scan the relevant
      range of the CoW fork for real extents and remap only the real extents
      into the data fork:
      
      D: --RRRRRRRRSRRRRRRRR---
      U: ------UU--UUU---------
      
      This ensures that we never obliterate valid data fork extents with
      unwritten blocks from the CoW fork.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5eda4300
    • D
      xfs: allow unwritten extents in the CoW fork · 05a630d7
      Darrick J. Wong 提交于
      In the data fork, we only allow extents to perform the following state
      transitions:
      
      delay -> real <-> unwritten
      
      There's no way to move directly from a delalloc reservation to an
      /unwritten/ allocated extent.  However, for the CoW fork we want to be
      able to do the following to each extent:
      
      delalloc -> unwritten -> written -> remapped to data fork
      
      This will help us to avoid a race in the speculative CoW preallocation
      code between a first thread that is allocating a CoW extent and a second
      thread that is remapping part of a file after a write.  In order to do
      this, however, we need two things: first, we have to be able to
      transition from da to unwritten, and second the function that converts
      between real and unwritten has to be made aware of the cow fork.  Do
      both of those things.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      05a630d7
    • D
      xfs: verify free block header fields · de14c5f5
      Darrick J. Wong 提交于
      Perform basic sanity checking of the directory free block header
      fields so that we avoid hanging the system on invalid data.
      
      (Granted that just means that now we shutdown on directory write,
      but that seems better than hanging...)
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      de14c5f5
    • D
      xfs: check for obviously bad level values in the bmbt root · b3bf607d
      Darrick J. Wong 提交于
      We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
      no such thing as a zero-level bmbt (for that we have extents format),
      so if we see this, send back an error code.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      b3bf607d
    • D
      xfs: filter out obviously bad btree pointers · d5a91bae
      Darrick J. Wong 提交于
      Don't let anybody load an obviously bad btree pointer.  Since the values
      come from disk, we must return an error, not just ASSERT.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      d5a91bae
    • D
      xfs: fail _dir_open when readahead fails · 7a652bbe
      Darrick J. Wong 提交于
      When we open a directory, we try to readahead block 0 of the directory
      on the assumption that we're going to need it soon.  If the bmbt is
      corrupt, the directory will never be usable and the readahead fails
      immediately, so we might as well prevent the directory from being opened
      at all.  This prevents a subsequent read or modify operation from
      hitting it and taking the fs offline.
      
      NOTE: We're only checking for early failures in the block mapping, not
      the readahead directory block itself.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      7a652bbe
    • D
      xfs: fix toctou race when locking an inode to access the data map · 4b5bd5bf
      Darrick J. Wong 提交于
      We use di_format and if_flags to decide whether we're grabbing the ilock
      in btree mode (btree extents not loaded) or shared mode (anything else),
      but the state of those fields can be changed by other threads that are
      also trying to load the btree extents -- IFEXTENTS gets set before the
      _bmap_read_extents call and cleared if it fails.
      
      We don't actually need to have IFEXTENTS set until after the bmbt
      records are successfully loaded and validated, which will fix the race
      between multiple threads trying to read the same directory.  The next
      patch strengthens directory bmbt validation by refusing to open the
      directory if reading the bmbt to start directory readahead fails.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      4b5bd5bf
  6. 31 1月, 2017 9 次提交
  7. 28 1月, 2017 1 次提交
    • B
      xfs: prevent quotacheck from overloading inode lru · e0d76fa4
      Brian Foster 提交于
      Quotacheck runs at mount time in situations where quota accounting must
      be recalculated. In doing so, it uses bulkstat to visit every inode in
      the filesystem. Historically, every inode processed during quotacheck
      was released and immediately tagged for reclaim because quotacheck runs
      before the superblock is marked active by the VFS. In other words,
      the final iput() lead to an immediate ->destroy_inode() call, which
      allowed the XFS background reclaim worker to start reclaiming inodes.
      
      Commit 17c12bcd ("xfs: when replaying bmap operations, don't let
      unlinked inodes get reaped") marks the XFS superblock active sooner as
      part of the mount process to support caching inodes processed during log
      recovery. This occurs before quotacheck and thus means all inodes
      processed by quotacheck are inserted to the LRU on release.  The
      s_umount lock is held until the mount has completed and thus prevents
      the shrinkers from operating on the sb. This means that quotacheck can
      excessively populate the inode LRU and lead to OOM conditions on systems
      without sufficient RAM.
      
      Update the quotacheck bulkstat handler to set XFS_IGET_DONTCACHE on
      inodes processed by quotacheck. This causes ->drop_inode() to return 1
      and in turn causes iput_final() to evict the inode. This preserves the
      original quotacheck behavior and prevents it from overloading the LRU
      and running out of memory.
      
      CC: stable@vger.kernel.org # v4.9
      Reported-by: NMartin Svec <martin.svec@zoner.cz>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e0d76fa4
  8. 27 1月, 2017 1 次提交
    • D
      xfs: fix bmv_count confusion w/ shared extents · c364b6d0
      Darrick J. Wong 提交于
      In a bmapx call, bmv_count is the total size of the array, including the
      zeroth element that userspace uses to supply the search key.  The output
      array starts at offset 1 so that we can set up the user for the next
      invocation.  Since we now can split an extent into multiple bmap records
      due to shared/unshared status, we have to be careful that we don't
      overflow the output array.
      
      In the original patch f86f4037 ("xfs: teach get_bmapx about shared
      extents and the CoW fork") I used cur_ext (the output index) to check
      for overflows, albeit with an off-by-one error.  Since nexleft no longer
      describes the number of unfilled slots in the output, we can rip all
      that out and use cur_ext for the overflow check directly.
      
      Failure to do this causes heap corruption in bmapx callers such as
      xfs_io and xfs_scrub.  xfs/328 can reproduce this problem.
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c364b6d0
  9. 26 1月, 2017 2 次提交
    • D
      xfs: clear _XBF_PAGES from buffers when readahead page · 2aa6ba7b
      Darrick J. Wong 提交于
      If we try to allocate memory pages to back an xfs_buf that we're trying
      to read, it's possible that we'll be so short on memory that the page
      allocation fails.  For a blocking read we'll just wait, but for
      readahead we simply dump all the pages we've collected so far.
      
      Unfortunately, after dumping the pages we neglect to clear the
      _XBF_PAGES state, which means that the subsequent call to xfs_buf_free
      thinks that b_pages still points to pages we own.  It then double-frees
      the b_pages pages.
      
      This results in screaming about negative page refcounts from the memory
      manager, which xfs oughtn't be triggering.  To reproduce this case,
      mount a filesystem where the size of the inodes far outweighs the
      availalble memory (a ~500M inode filesystem on a VM with 300MB memory
      did the trick here) and run bulkstat in parallel with other memory
      eating processes to put a huge load on the system.  The "check summary"
      phase of xfs_scrub also works for this purpose.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      2aa6ba7b
    • C
      xfs: extsize hints are not unlikely in xfs_bmap_btalloc · 493611eb
      Christoph Hellwig 提交于
      With COW files they are the hotpath, just like for files with the
      extent size hint attribute.  We really shouldn't micro-manage anything
      but failure cases with unlikely.
      
      Additionally Arnd Bergmann recently reported that one of these two
      unlikely annotations causes link failures together with an upcoming
      kernel instrumentation patch, so let's get rid of it ASAP.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      493611eb
  10. 25 1月, 2017 4 次提交
新手
引导
客服 返回
顶部