1. 29 5月, 2015 5 次提交
    • B
      xfs: pass inode count through ordered icreate log item · 463958af
      Brian Foster 提交于
      v5 superblocks use an ordered log item for logging the initialization of
      inode chunks. The icreate log item is currently hardcoded to an inode
      count of 64 inodes.
      
      The agbno and extent length are used to initialize the inode chunk from
      log recovery. While an incorrect inode count does not lead to bad inode
      chunk initialization, we should pass the correct inode count such that log
      recovery has enough data to perform meaningful validity checks on the
      chunk.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      463958af
    • B
      xfs: introduce inode record hole mask for sparse inode chunks · 5419040f
      Brian Foster 提交于
      The inode btrees track 64 inodes per record regardless of inode size.
      Thus, inode chunks on disk vary in size depending on the size of the
      inodes. This creates a contiguous allocation requirement for new inode
      chunks that can be difficult to satisfy on an aged and fragmented (free
      space) filesystems.
      
      The inode record freecount currently uses 4 bytes on disk to track the
      free inode count. With a maximum freecount value of 64, only one byte is
      required. Convert the freecount field to a single byte and use two of
      the remaining 3 higher order bytes left for the hole mask field. Use the
      final leftover byte for the total count field.
      
      The hole mask field tracks holes in the chunks of physical space that
      the inode record refers to. This facilitates the sparse allocation of
      inode chunks when contiguous chunks are not available and allows the
      inode btrees to identify what portions of the chunk contain valid
      inodes. The total count field contains the total number of valid inodes
      referred to by the record. This can also be deduced from the hole mask.
      The count field provides clarity and redundancy for internal record
      verification.
      
      Note that neither of the new fields can be written to disk on fs'
      without sparse inode support. Doing so writes to the high-order bytes of
      freecount and causes corruption from the perspective of older kernels.
      The on-disk inobt record data structure is updated with a union to
      distinguish between the original, "full" format and the new, "sparse"
      format. The conversion routines to get, insert and update records are
      updated to translate to and from the on-disk record accordingly such
      that freecount remains a 4-byte value on non-supported fs, yet the new
      fields of the in-core record are always valid with respect to the
      record. This means that higher level code can refer to the current
      in-core record format unconditionally and lower level code ensures that
      records are translated to/from disk according to the capabilities of the
      fs.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5419040f
    • B
      xfs: use sparse chunk alignment for min. inode allocation requirement · 066a1884
      Brian Foster 提交于
      xfs_ialloc_ag_select() iterates through the allocation groups looking
      for free inodes or free space to determine whether to allow an inode
      allocation to proceed. If no free inodes are available, it assumes that
      an AG must have an extent longer than mp->m_ialloc_blks.
      
      Sparse inode chunk support currently allows for allocations smaller than
      the traditional inode chunk size specified in m_ialloc_blks. The current
      minimum sparse allocation is set in the superblock sb_spino_align field
      at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use
      this to represent the minimum supported allocation size for inode
      chunks. Initialize m_ialloc_min_blks at mount time based on whether
      sparse inodes are supported.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      066a1884
    • B
      xfs: update free inode record logic to support sparse inode records · 999633d3
      Brian Foster 提交于
      xfs_difree_inobt() uses logic in a couple places that assume inobt
      records refer to fully allocated chunks. Specifically, the use of
      mp->m_ialloc_inos can cause problems for inode chunks that are sparsely
      allocated. Sparse inode chunks can, by definition, define a smaller
      number of inodes than a full inode chunk.
      
      Fix the logic that determines whether an inode record should be removed
      from the inobt to use the ir_free mask rather than ir_freecount. Fix the
      agi counters modification to use ir_freecount to add the actual number
      of inodes freed rather than assuming a full inode chunk.
      
      Also make sure that we preserve the behavior to not remove inode chunks
      if the block size is large enough for multiple inode chunks (e.g.,
      bsize=64k, isize=512). This behavior was previously implicit in that in
      such configurations, ir.freecount of a single record never matches
      m_ialloc_inos. Hence, add some comments as well.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      999633d3
    • B
      xfs: create individual inode alloc. helper · d4cc540b
      Brian Foster 提交于
      Inode allocation from sparse inode records must filter the ir_free mask
      against ir_holemask.  In preparation for this requirement, create a
      helper to allocate an individual inode from an inode record.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d4cc540b
  2. 23 2月, 2015 3 次提交
    • E
      xfs: pass mp to XFS_WANT_CORRUPTED_RETURN · 5fb5aeee
      Eric Sandeen 提交于
      Today, if we hit an XFS_WANT_CORRUPTED_RETURN we don't print any
      information about which filesystem hit it.  Passing in the mp allows
      us to print the filesystem (device) name, which is a pretty critical
      piece of information.
      
      Tested by running fsfuzzer 'til I hit some.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5fb5aeee
    • E
      xfs: pass mp to XFS_WANT_CORRUPTED_GOTO · c29aad41
      Eric Sandeen 提交于
      Today, if we hit an XFS_WANT_CORRUPTED_GOTO we don't print any
      information about which filesystem hit it.  Passing in the mp allows
      us to print the filesystem (device) name, which is a pretty critical
      piece of information.
      
      Tested by running fsfuzzer 'til I hit some.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      c29aad41
    • D
      xfs: use generic percpu counters for inode counter · 501ab323
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. There are some warts around
      the  use of them for the inode counter as the hand rolled counter is
      designed to be accurate at zero, but has no specific accurracy at
      any other value. This design causes problems for the maximum inode
      count threshold enforcement, as there is no trigger that balances
      the counters as they get close tothe maximum threshold.
      
      Instead of designing new triggers for balancing, just replace the
      handrolled per-cpu counter with a generic counter.  This enables us
      to update the counter through the normal superblock modification
      funtions, but rather than do that we add a xfs_mod_icount() helper
      function (from Christoph Hellwig) and keep the percpu counter
      outside the superblock in the struct xfs_mount.
      
      This means we still need to initialise the per-cpu counter
      specifically when we read the superblock, and vice versa when we
      log/write it, but it does mean that we don't need to change any
      other code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      501ab323
  3. 04 12月, 2014 1 次提交
    • D
      xfs: fix premature enospc on inode allocation · 7a1df156
      Dave Chinner 提交于
      After growing a filesystem, XFS can fail to allocate inodes even
      though there is a large amount of space available in the filesystem
      for inodes. The issue is caused by a nearly full allocation group
      having enough free space in it to be considered for inode
      allocation, but not enough contiguous free space to actually
      allocation inodes.  This situation results in successful selection
      of the AG for allocation, then failure of the allocation resulting
      in ENOSPC being reported to the caller.
      
      It is caused by two possible issues. Firstly, we only consider the
      lognest free extent and whether it would fit an inode chunk. If the
      extent is not correctly aligned, then we can't allocate an inode
      chunk in it regardless of the fact that it is large enough. This
      tends to be a permanent error until space in the AG is freed.
      
      The second issue is that we don't actually lock the AGI or AGF when
      we are doing these checks, and so by the time we get to actually
      allocating the inode chunk the space we thought we had in the AG may
      have been allocated. This tends to be a spurious error as it
      requires a race to trigger. Hence this case is ignored in this patch
      as the reported problem is for permanent errors.
      
      The first issue could be addressed by simply taking into account the
      alignment when checking the longest extent. This, however, would
      prevent allocation in AGs that have aligned, exact sized extents
      free. However, this case should be fairly rare compared to the
      number of allocations that occur near ENOSPC that would trigger this
      condition.
      
      Hence, when selecting the inode AG, take into account the inode
      cluster alignment when checking the lognest free extent in the AG.
      If we can't find any AGs with a contiguous free space large
      enough to be aligned, drop the alignment addition and just try for
      an AG that has enough contiguous free space available for an inode
      chunk. This won't prevent issues from occurring, but should avoid
      situations where other AGs have lots of free space but the selected
      AG can't allocate due to alignment constraints.
      Reported-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7a1df156
  4. 01 12月, 2014 1 次提交
  5. 28 11月, 2014 3 次提交
  6. 29 9月, 2014 1 次提交
  7. 09 9月, 2014 1 次提交
    • E
      xfs: add a few more verifier tests · e1b05723
      Eric Sandeen 提交于
      These were exposed by fsfuzzer runs; without them we fail
      in various exciting and sometimes convoluted ways when we
      encounter disk corruption.
      
      Without the MAXLEVELS tests we tend to walk off the end of
      an array in a loop like this:
      
              for (i = 0; i < cur->bc_nlevels; i++) {
                      if (cur->bc_bufs[i])
      
      Without the dirblklog test we try to allocate more memory
      than we could possibly hope for and loop forever:
      
      xfs_dabuf_map()
      	nfsb = mp->m_dir_geo->fsbcount;
      	irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP...
      
      As for the logbsize check, that's the convoluted one.
      
      If logbsize is specified at mount time, it's sanitized
      in xfs_parseargs; in particular it makes sure that it's
      not > XLOG_MAX_RECORD_BSIZE.
      
      If not specified at mount time, it comes from the superblock
      via sb_logsunit; this is limited to 256k at mkfs time as well;
      it's copied into m_logbsize in xfs_finish_flags().
      
      However, if for some reason the on-disk value is corrupt and
      too large, nothing catches it.  It's a circuitous path, but
      that size eventually finds its way to places that make the kernel
      very unhappy, leading to oopses in xlog_pack_data() because we
      use the size as an index into iclog->ic_data, but the array
      is not necessarily that big.
      
      Anyway - bounds checking when we read from disk is a good thing!
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e1b05723
  8. 25 6月, 2014 2 次提交
  9. 22 6月, 2014 1 次提交
  10. 06 6月, 2014 1 次提交
  11. 20 5月, 2014 2 次提交
  12. 24 4月, 2014 6 次提交
  13. 07 3月, 2014 1 次提交
    • B
      xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation · e480a723
      Brian Foster 提交于
      The inode chunk allocation path can lead to deadlock conditions if
      a transaction is dirtied with an AGF (to fix up the freelist) for
      an AG that cannot satisfy the actual allocation request. This code
      path is written to try and avoid this scenario, but it can be
      reproduced by running xfstests generic/270 in a loop on a 512b fs.
      
      An example situation is:
      - process A attempts an inode allocation on AG 3, modifies
        the freelist, fails the allocation and ultimately moves on to
        AG 0 with the AG 3 AGF held
      - process B is doing a free space operation (i.e., truncate) and
        acquires the AG 0 AGF, waits on the AG 3 AGF
      - process A acquires the AG 0 AGI, waits on the AG 0 AGF (deadlock)
      
      The problem here is that process A acquired the AG 3 AGF while
      moving on to AG 0 (and releasing the AG 3 AGI with the AG 3 AGF
      held). xfs_dialloc() makes one pass through each of the AGs when
      attempting to allocate an inode chunk. The expectation is a clean
      transaction if a particular AG cannot satisfy the allocation
      request. xfs_ialloc_ag_alloc() is written to support this through
      use of the minalignslop allocation args field.
      
      When using the agi->agi_newino optimization, we attempt an exact
      bno allocation request based on the location of the previously
      allocated chunk. minalignslop is set to inform the allocator that
      we will require alignment on this chunk, and thus to not allow the
      request for this AG if the extra space is not available. Suppose
      that the AG in question has just enough space for this request, but
      not at the requested bno. xfs_alloc_fix_freelist() will proceed as
      normal as it determines the request should succeed, and thus it is
      allowed to modify the agf. xfs_alloc_ag_vextent() ultimately fails
      because the requested bno is not available. In response, the caller
      moves on to a NEAR_BNO allocation request for the same AG. The
      alignment is set, but the minalignslop field is never reset. This
      increases the overall requirement of the request from the first
      attempt. If this delta is the difference between allocation success
      and failure for the AG, xfs_alloc_fix_freelist() rejects this
      request outright the second time around and causes the allocation
      request to unnecessarily fail for this AG.
      
      To address this situation, reset the minalignslop field immediately
      after use and prevent it from leaking into subsequent requests.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e480a723
  14. 27 2月, 2014 4 次提交
  15. 13 12月, 2013 5 次提交
  16. 12 12月, 2013 1 次提交
  17. 07 11月, 2013 1 次提交
  18. 24 10月, 2013 1 次提交
    • D
      xfs: decouple inode and bmap btree header files · a4fbe6ab
      Dave Chinner 提交于
      Currently the xfs_inode.h header has a dependency on the definition
      of the BMAP btree records as the inode fork includes an array of
      xfs_bmbt_rec_host_t objects in it's definition.
      
      Move all the btree format definitions from xfs_btree.h,
      xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
      xfs_format.h to continue the process of centralising the on-disk
      format definitions. With this done, the xfs inode definitions are no
      longer dependent on btree header files.
      
      The enables a massive culling of unnecessary includes, with close to
      200 #include directives removed from the XFS kernel code base.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a4fbe6ab