1. 03 8月, 2016 5 次提交
    • D
      xfs: add owner field to extent allocation and freeing · 340785cc
      Darrick J. Wong 提交于
      For the rmap btree to work, we have to feed the extent owner
      information to the the allocation and freeing functions. This
      information is what will end up in the rmap btree that tracks
      allocated extents. While we technically don't need the owner
      information when freeing extents, passing it allows us to validate
      that the extent we are removing from the rmap btree actually
      belonged to the owner we expected it to belong to.
      
      We also define a special set of owner values for internal metadata
      that would otherwise have no owner. This allows us to tell the
      difference between metadata owned by different per-ag btrees, as
      well as static fs metadata (e.g. AG headers) and internal journal
      blocks.
      
      There are also a couple of special cases we need to take care of -
      during EFI recovery, we don't actually know who the original owner
      was, so we need to pass a wildcard to indicate that we aren't
      checking the owner for validity. We also need special handling in
      growfs, as we "free" the space in the last AG when extending it, but
      because it's new space it has no actual owner...
      
      While touching the xfs_bmap_add_free() function, re-order the
      parameters to put the struct xfs_mount first.
      
      Extend the owner field to include both the owner type and some sort
      of index within the owner.  The index field will be used to support
      reverse mappings when reflink is enabled.
      
      When we're freeing extents from an EFI, we don't have the owner
      information available (rmap updates have their own redo items).
      xfs_free_extent therefore doesn't need to do an rmap update. Make
      sure that the log replay code signals this correctly.
      
      This is based upon a patch originally from Dave Chinner. It has been
      extended to add more owner information with the intent of helping
      recovery operations when things go wrong (e.g. offset of user data
      block in a file).
      
      [dchinner: de-shout the xfs_rmap_*_owner helpers]
      [darrick: minor style fixes suggested by Christoph Hellwig]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      340785cc
    • D
      xfs: rmap btree add more reserved blocks · 8018026e
      Darrick J. Wong 提交于
      Originally-From: Dave Chinner <dchinner@redhat.com>
      
      XFS reserves a small amount of space in each AG for the minimum
      number of free blocks needed for operation. Adding the rmap btree
      increases the number of reserved blocks, but it also increases the
      complexity of the calculation as the free inode btree is optional
      (like the rmbt).
      
      Rather than calculate the prealloc blocks every time we need to
      check it, add a function to calculate it at mount time and store it
      in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
      just to use the xfs-mount variable directly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8018026e
    • D
      xfs: introduce rmap btree definitions · b8704944
      Darrick J. Wong 提交于
      Originally-From: Dave Chinner <dchinner@redhat.com>
      
      Add new per-ag rmap btree definitions to the per-ag structures. The
      rmap btree will sit in the empty slots on disk after the free space
      btrees, and hence form a part of the array of space management
      btrees. This requires the definition of the btree to be contiguous
      with the free space btrees.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b8704944
    • D
      xfs: add tracepoints and error injection for deferred extent freeing · ba9e7802
      Darrick J. Wong 提交于
      Add a couple of tracepoints for the deferred extent free operation and
      a site for injecting errors while finishing the operation.  This makes
      it easier to debug deferred ops and test log redo.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ba9e7802
    • D
      xfs: rework xfs_bmap_free callers to use xfs_defer_ops · 3ab78df2
      Darrick J. Wong 提交于
      Restructure everything that used xfs_bmap_free to use xfs_defer_ops
      instead.  For now we'll just remove the old symbols and play some
      cpp magic to make it work; in the next patch we'll actually rename
      everything.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3ab78df2
  2. 21 6月, 2016 2 次提交
  3. 01 6月, 2016 1 次提交
  4. 04 1月, 2016 2 次提交
  5. 03 11月, 2015 1 次提交
    • D
      xfs: introduce BMAPI_ZERO for allocating zeroed extents · 3fbbbea3
      Dave Chinner 提交于
      To enable DAX to do atomic allocation of zeroed extents, we need to
      drive the block zeroing deep into the allocator. Because
      xfs_bmapi_write() can return merged extents on allocation that were
      only partially allocated (i.e. requested range spans allocated and
      hole regions, allocation into the hole was contiguous), we cannot
      zero the extent returned from xfs_bmapi_write() as that can
      overwrite existing data with zeros.
      
      Hence we have to drive the extent zeroing into the allocation code,
      prior to where we merge the extents into the BMBT and return the
      resultant map. This means we need to propagate this need down to
      the xfs_alloc_vextent() and issue the block zeroing at this point.
      
      While this functionality is being introduced for DAX, there is no
      reason why it is specific to DAX - we can per-zero blocks during the
      allocation transaction on any type of device. It's just slow (and
      usually slower than unwritten allocation and conversion) on
      traditional block devices so doesn't tend to get used. We can,
      however, hook hardware zeroing optimisations via sb_issue_zeroout()
      to this operation, so it may be useful in future and hence the
      "allocate zeroed blocks" API needs to be implementation neutral.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fbbbea3
  6. 12 10月, 2015 2 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
    • B
      xfs: validate metadata LSNs against log on v5 superblocks · a45086e2
      Brian Foster 提交于
      Since the onset of v5 superblocks, the LSN of the last modification has
      been included in a variety of on-disk data structures. This LSN is used
      to provide log recovery ordering guarantees (e.g., to ensure an older
      log recovery item is not replayed over a newer target data structure).
      
      While this works correctly from the point a filesystem is formatted and
      mounted, userspace tools have some problematic behaviors that defeat
      this mechanism. For example, xfs_repair historically zeroes out the log
      unconditionally (regardless of whether corruption is detected). If this
      occurs, the LSN of the filesystem is reset and the log is now in a
      problematic state with respect to on-disk metadata structures that might
      have a larger LSN. Until either the log catches up to the highest
      previously used metadata LSN or each affected data structure is modified
      and written out without incident (which resets the metadata LSN), log
      recovery is susceptible to filesystem corruption.
      
      This problem is ultimately addressed and repaired in the associated
      userspace tools. The kernel is still responsible to detect the problem
      and notify the user that something is wrong. Check the superblock LSN at
      mount time and fail the mount if it is invalid. From that point on,
      trigger verifier failure on any metadata I/O where an invalid LSN is
      detected. This results in a filesystem shutdown and guarantees that we
      do not log metadata changes with invalid LSNs on disk. Since this is a
      known issue with a known recovery path, present a warning to instruct
      the user how to recover.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a45086e2
  7. 25 8月, 2015 1 次提交
  8. 29 7月, 2015 1 次提交
    • E
      xfs: create new metadata UUID field and incompat flag · ce748eaa
      Eric Sandeen 提交于
      This adds a new superblock field, sb_meta_uuid.  If set, along with
      a new incompat flag, the code will use that field on a V5 filesystem
      to compare to metadata UUIDs, which allows us to change the user-
      visible UUID at will.  Userspace handles the setting and clearing
      of the incompat flag as appropriate, as the UUID gets changed; i.e.
      setting the user-visible UUID back to the original UUID (as stored in
      the new field) will remove the incompatible feature flag.
      
      If the incompat flag is not set, this copies the user-visible UUID into
      into the meta_uuid slot in memory when the superblock is read from disk;
      the meta_uuid field is not written back to disk in this case.
      
      The remainder of this patch simply switches verifiers, initializers,
      etc to use the new sb_meta_uuid field.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      ce748eaa
  9. 22 6月, 2015 4 次提交
  10. 29 5月, 2015 1 次提交
    • B
      xfs: support min/max agbno args in block allocator · bfe46d4e
      Brian Foster 提交于
      The block allocator supports various arguments to tweak block allocation
      behavior and set allocation requirements. The sparse inode chunk feature
      introduces a new requirement not supported by the current arguments.
      Sparse inode allocations must convert or merge into an inode record that
      describes a fixed length chunk (64 inodes x inodesize). Full inode chunk
      allocations by definition always result in valid inode records. Sparse
      chunk allocations are smaller and the associated records can refer to
      blocks not owned by the inode chunk. This model can result in invalid
      inode records in certain cases.
      
      For example, if a sparse allocation occurs near the start of an AG, the
      aligned inode record for that chunk might refer to agbno 0. If an
      allocation occurs towards the end of the AG and the AG size is not
      aligned, the inode record could refer to blocks beyond the end of the
      AG. While neither of these scenarios directly result in corruption, they
      both insert invalid inode records and at minimum cause repair to
      complain, are unlikely to merge into full chunks over time and set land
      mines for other areas of code.
      
      To guarantee sparse inode chunk allocation creates valid inode records,
      support the ability to specify an agbno range limit for
      XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are
      specified in the allocation arguments and limit the block allocation
      algorithms to that range. The starting 'agbno' hint is clamped to the
      range if the specified agbno is out of range. If no sufficient extent is
      available within the range, the allocation fails. For backwards
      compatibility, the min/max fields can be initialized to 0 to disable
      range limiting (e.g., equivalent to min=0,max=agsize).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bfe46d4e
  11. 24 2月, 2015 1 次提交
    • D
      xfs: xfs_alloc_fix_minleft can underflow near ENOSPC · 3790a8cd
      Dave Chinner 提交于
      Test generic/224 is failing with a corruption being detected on one
      of Michael's test boxes.  Debug that Michael added is indicating
      that the minleft trimming is resulting in an underflow:
      
      .....
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       before goto out_nominleft: rlen          1  args->len          0
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       after fixup:               rlen          1  args->len          1
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       after fixup:               rlen 4294967295  args->len 4294967295
       XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 1424
      
      The "goto out_nominleft:" indicates that we are getting close to
      ENOSPC in the AG, and a couple of allocations later we underflow
      and the corruption check fires in xfs_alloc_ag_vextent_size().
      
      The issue is that the extent length fixups comaprisons are done
      with variables of xfs_extlen_t types. These are unsigned so an
      underflow looks like a really big value and hence is not detected
      as being smaller than the minimum length allowed for the extent.
      Hence the corruption check fires as it is noticing that the returned
      length is longer than the original extent length passed in.
      
      This can be easily fixed by ensuring we do the underflow test on
      signed values, the same way xfs_alloc_fix_len() prevents underflow.
      So we realise in future that these casts prevent underflows from
      going undetected, add comments to the code indicating this.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Tested-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3790a8cd
  12. 23 2月, 2015 2 次提交
  13. 28 11月, 2014 1 次提交
  14. 09 9月, 2014 1 次提交
    • E
      xfs: add a few more verifier tests · e1b05723
      Eric Sandeen 提交于
      These were exposed by fsfuzzer runs; without them we fail
      in various exciting and sometimes convoluted ways when we
      encounter disk corruption.
      
      Without the MAXLEVELS tests we tend to walk off the end of
      an array in a loop like this:
      
              for (i = 0; i < cur->bc_nlevels; i++) {
                      if (cur->bc_bufs[i])
      
      Without the dirblklog test we try to allocate more memory
      than we could possibly hope for and loop forever:
      
      xfs_dabuf_map()
      	nfsb = mp->m_dir_geo->fsbcount;
      	irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP...
      
      As for the logbsize check, that's the convoluted one.
      
      If logbsize is specified at mount time, it's sanitized
      in xfs_parseargs; in particular it makes sure that it's
      not > XLOG_MAX_RECORD_BSIZE.
      
      If not specified at mount time, it comes from the superblock
      via sb_logsunit; this is limited to 256k at mkfs time as well;
      it's copied into m_logbsize in xfs_finish_flags().
      
      However, if for some reason the on-disk value is corrupt and
      too large, nothing catches it.  It's a circuitous path, but
      that size eventually finds its way to places that make the kernel
      very unhappy, leading to oopses in xlog_pack_data() because we
      use the size as an index into iclog->ic_data, but the array
      is not necessarily that big.
      
      Anyway - bounds checking when we read from disk is a good thing!
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e1b05723
  15. 25 6月, 2014 2 次提交
  16. 06 6月, 2014 2 次提交
    • J
      xfs: Fix rounding in xfs_alloc_fix_len() · 30265117
      Jan Kara 提交于
      Rounding in xfs_alloc_fix_len() is wrong. As the comment states, the
      result should be a number of a form (k*prod+mod) however due to sign
      mistake the result is different. As a result allocations on raid arrays
      could be misaligned in some cases.
      
      This also seems to fix occasional assertion failure:
      	XFS_WANT_CORRUPTED_GOTO(rlen <= flen, error0)
      in xfs_alloc_ag_vextent_size().
      
      Also add an assertion that the result of xfs_alloc_fix_len() is of
      expected form.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      30265117
    • D
      xfs: kill xfs_buf_geterror() · 36de9556
      Dave Chinner 提交于
      Most of the callers are just calling ASSERT(!xfs_buf_geterror())
      which means they are checking for bp->b_error == 0. If bp is null in
      this case, we will assert fail, and hence it's no different in
      result to oopsing because of a null bp. In some cases, errors have
      already been checked for or the function returning the buffer can't
      return a buffer with an error, so it's just a redundant assert.
      Either way, the assert can either be removed.
      
      The other two non-assert callers can just test for a buffer and
      error properly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      
      36de9556
  17. 27 2月, 2014 4 次提交
  18. 07 11月, 2013 1 次提交
  19. 24 10月, 2013 3 次提交
    • D
      xfs: decouple inode and bmap btree header files · a4fbe6ab
      Dave Chinner 提交于
      Currently the xfs_inode.h header has a dependency on the definition
      of the BMAP btree records as the inode fork includes an array of
      xfs_bmbt_rec_host_t objects in it's definition.
      
      Move all the btree format definitions from xfs_btree.h,
      xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
      xfs_format.h to continue the process of centralising the on-disk
      format definitions. With this done, the xfs inode definitions are no
      longer dependent on btree header files.
      
      The enables a massive culling of unnecessary includes, with close to
      200 #include directives removed from the XFS kernel code base.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a4fbe6ab
    • D
      xfs: decouple log and transaction headers · 239880ef
      Dave Chinner 提交于
      xfs_trans.h has a dependency on xfs_log.h for a couple of
      structures. Most code that does transactions doesn't need to know
      anything about the log, but this dependency means that they have to
      include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
      files and clean up the includes to be in dependency order.
      
      In doing this, remove the direct include of xfs_trans_reserve.h from
      xfs_trans.h so that we remove the dependency between xfs_trans.h and
      xfs_mount.h. Hence the xfs_trans.h include can be moved to the
      indicate the actual dependencies other header files have on it.
      
      Note that these are kernel only header files, so this does not
      translate to any userspace changes at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      239880ef
    • D
      xfs: create a shared header file for format-related information · 70a9883c
      Dave Chinner 提交于
      All of the buffer operations structures are needed to be exported
      for xfs_db, so move them all to a common location rather than
      spreading them all over the place. They are verifying the on-disk
      format, so while xfs_format.h might be a good place, it is not part
      of the on disk format.
      
      Hence we need to create a new header file that we centralise these
      related definitions. Start by moving the bffer operations
      structures, and then also move all the other definitions that have
      crept into xfs_log_format.h and xfs_format.h as there was no other
      shared header file to put them in.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      70a9883c
  20. 13 8月, 2013 1 次提交
  21. 21 5月, 2013 1 次提交
    • J
      xfs: Avoid pathological backwards allocation · 211d022c
      Jan Kara 提交于
      Writing a large file using direct IO in 16 MB chunks sometimes results
      in a pathological allocation pattern where 16 MB chunks of large free
      extent are allocated to a file in a reversed order. So extents of a file
      look for example as:
      
       ext logical physical expected length flags
         0        0        13          4550656
         1  4550656 188136807   4550668 12562432
         2 17113088 200699240 200699238 622592
         3 17735680 182046055 201321831   4096
         4 17739776 182041959 182050150   4096
         5 17743872 182037863 182046054   4096
         6 17747968 182033767 182041958   4096
         7 17752064 182029671 182037862   4096
      ...
      6757 45400064 154381644 154389835   4096
      6758 45404160 154377548 154385739   4096
      6759 45408256 252951571 154381643  73728 eof
      
      This happens because XFS_ALLOCTYPE_THIS_BNO allocation fails (the last
      extent in the file cannot be further extended) so we fall back to
      XFS_ALLOCTYPE_NEAR_BNO allocation which picks end of a large free
      extent as the best place to continue the file. Since the chunk at the
      end of the free extent again cannot be further extended, this behavior
      repeats until the whole free extent is consumed in a reversed order.
      
      For data allocations this backward allocation isn't beneficial so make
      xfs_alloc_compute_diff() pick start of a free extent instead of its end
      for them. That avoids the backward allocation pattern.
      
      See thread at http://oss.sgi.com/archives/xfs/2013-03/msg00144.html for
      more details about the reproduction case and why this solution was
      chosen.
      
      Based on idea by Dave Chinner <dchinner@redhat.com>.
      
      CC: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      211d022c
  22. 28 4月, 2013 1 次提交
    • D
      xfs: buffer type overruns blf_flags field · 61fe135c
      Dave Chinner 提交于
      The buffer type passed to log recvoery in the buffer log item
      overruns the blf_flags field. I had assumed that flags field was a
      32 bit value, and it turns out it is a unisgned short. Therefore
      having 19 flags doesn't really work.
      
      Convert the buffer type field to numeric value, and use the top 5
      bits of the flags field for it. We currently have 17 types of
      buffers, so using 5 bits gives us plenty of room for expansion in
      future....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      61fe135c