1. 04 10月, 2016 2 次提交
  2. 26 9月, 2016 1 次提交
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  3. 19 9月, 2016 1 次提交
    • D
      xfs: set up per-AG free space reservations · 3fd129b6
      Darrick J. Wong 提交于
      One unfortunate quirk of the reference count and reverse mapping
      btrees -- they can expand in size when blocks are written to *other*
      allocation groups if, say, one large extent becomes a lot of tiny
      extents.  Since we don't want to start throwing errors in the middle
      of CoWing, we need to reserve some blocks to handle future expansion.
      The transaction block reservation counters aren't sufficient here
      because we have to have a reserve of blocks in every AG, not just
      somewhere in the filesystem.
      
      Therefore, create two per-AG block reservation pools.  One feeds the
      AGFL so that rmapbt expansion always succeeds, and the other feeds all
      other metadata so that refcountbt expansion never fails.
      
      Use the count of how many reserved blocks we need to have on hand to
      create a virtual reservation in the AG.  Through selective clamping of
      the maximum length of allocation requests and of the length of the
      longest free extent, we can make it look like there's less free space
      in the AG unless the reservation owner is asking for blocks.
      
      In other words, play some accounting tricks in-core to make sure that
      we always have blocks available.  On the plus side, there's nothing to
      clean up if we crash, which is contrast to the strategy that the rough
      draft used (actually removing extents from the freespace btrees).
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fd129b6
  4. 26 8月, 2016 1 次提交
  5. 17 8月, 2016 2 次提交
  6. 03 8月, 2016 9 次提交
  7. 21 6月, 2016 2 次提交
  8. 01 6月, 2016 1 次提交
  9. 04 1月, 2016 2 次提交
  10. 03 11月, 2015 1 次提交
    • D
      xfs: introduce BMAPI_ZERO for allocating zeroed extents · 3fbbbea3
      Dave Chinner 提交于
      To enable DAX to do atomic allocation of zeroed extents, we need to
      drive the block zeroing deep into the allocator. Because
      xfs_bmapi_write() can return merged extents on allocation that were
      only partially allocated (i.e. requested range spans allocated and
      hole regions, allocation into the hole was contiguous), we cannot
      zero the extent returned from xfs_bmapi_write() as that can
      overwrite existing data with zeros.
      
      Hence we have to drive the extent zeroing into the allocation code,
      prior to where we merge the extents into the BMBT and return the
      resultant map. This means we need to propagate this need down to
      the xfs_alloc_vextent() and issue the block zeroing at this point.
      
      While this functionality is being introduced for DAX, there is no
      reason why it is specific to DAX - we can per-zero blocks during the
      allocation transaction on any type of device. It's just slow (and
      usually slower than unwritten allocation and conversion) on
      traditional block devices so doesn't tend to get used. We can,
      however, hook hardware zeroing optimisations via sb_issue_zeroout()
      to this operation, so it may be useful in future and hence the
      "allocate zeroed blocks" API needs to be implementation neutral.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fbbbea3
  11. 12 10月, 2015 2 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
    • B
      xfs: validate metadata LSNs against log on v5 superblocks · a45086e2
      Brian Foster 提交于
      Since the onset of v5 superblocks, the LSN of the last modification has
      been included in a variety of on-disk data structures. This LSN is used
      to provide log recovery ordering guarantees (e.g., to ensure an older
      log recovery item is not replayed over a newer target data structure).
      
      While this works correctly from the point a filesystem is formatted and
      mounted, userspace tools have some problematic behaviors that defeat
      this mechanism. For example, xfs_repair historically zeroes out the log
      unconditionally (regardless of whether corruption is detected). If this
      occurs, the LSN of the filesystem is reset and the log is now in a
      problematic state with respect to on-disk metadata structures that might
      have a larger LSN. Until either the log catches up to the highest
      previously used metadata LSN or each affected data structure is modified
      and written out without incident (which resets the metadata LSN), log
      recovery is susceptible to filesystem corruption.
      
      This problem is ultimately addressed and repaired in the associated
      userspace tools. The kernel is still responsible to detect the problem
      and notify the user that something is wrong. Check the superblock LSN at
      mount time and fail the mount if it is invalid. From that point on,
      trigger verifier failure on any metadata I/O where an invalid LSN is
      detected. This results in a filesystem shutdown and guarantees that we
      do not log metadata changes with invalid LSNs on disk. Since this is a
      known issue with a known recovery path, present a warning to instruct
      the user how to recover.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a45086e2
  12. 25 8月, 2015 1 次提交
  13. 29 7月, 2015 1 次提交
    • E
      xfs: create new metadata UUID field and incompat flag · ce748eaa
      Eric Sandeen 提交于
      This adds a new superblock field, sb_meta_uuid.  If set, along with
      a new incompat flag, the code will use that field on a V5 filesystem
      to compare to metadata UUIDs, which allows us to change the user-
      visible UUID at will.  Userspace handles the setting and clearing
      of the incompat flag as appropriate, as the UUID gets changed; i.e.
      setting the user-visible UUID back to the original UUID (as stored in
      the new field) will remove the incompatible feature flag.
      
      If the incompat flag is not set, this copies the user-visible UUID into
      into the meta_uuid slot in memory when the superblock is read from disk;
      the meta_uuid field is not written back to disk in this case.
      
      The remainder of this patch simply switches verifiers, initializers,
      etc to use the new sb_meta_uuid field.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      ce748eaa
  14. 22 6月, 2015 4 次提交
  15. 29 5月, 2015 1 次提交
    • B
      xfs: support min/max agbno args in block allocator · bfe46d4e
      Brian Foster 提交于
      The block allocator supports various arguments to tweak block allocation
      behavior and set allocation requirements. The sparse inode chunk feature
      introduces a new requirement not supported by the current arguments.
      Sparse inode allocations must convert or merge into an inode record that
      describes a fixed length chunk (64 inodes x inodesize). Full inode chunk
      allocations by definition always result in valid inode records. Sparse
      chunk allocations are smaller and the associated records can refer to
      blocks not owned by the inode chunk. This model can result in invalid
      inode records in certain cases.
      
      For example, if a sparse allocation occurs near the start of an AG, the
      aligned inode record for that chunk might refer to agbno 0. If an
      allocation occurs towards the end of the AG and the AG size is not
      aligned, the inode record could refer to blocks beyond the end of the
      AG. While neither of these scenarios directly result in corruption, they
      both insert invalid inode records and at minimum cause repair to
      complain, are unlikely to merge into full chunks over time and set land
      mines for other areas of code.
      
      To guarantee sparse inode chunk allocation creates valid inode records,
      support the ability to specify an agbno range limit for
      XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are
      specified in the allocation arguments and limit the block allocation
      algorithms to that range. The starting 'agbno' hint is clamped to the
      range if the specified agbno is out of range. If no sufficient extent is
      available within the range, the allocation fails. For backwards
      compatibility, the min/max fields can be initialized to 0 to disable
      range limiting (e.g., equivalent to min=0,max=agsize).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bfe46d4e
  16. 24 2月, 2015 1 次提交
    • D
      xfs: xfs_alloc_fix_minleft can underflow near ENOSPC · 3790a8cd
      Dave Chinner 提交于
      Test generic/224 is failing with a corruption being detected on one
      of Michael's test boxes.  Debug that Michael added is indicating
      that the minleft trimming is resulting in an underflow:
      
      .....
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       before goto out_nominleft: rlen          1  args->len          0
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       after fixup:               rlen          1  args->len          1
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       after fixup:               rlen 4294967295  args->len 4294967295
       XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 1424
      
      The "goto out_nominleft:" indicates that we are getting close to
      ENOSPC in the AG, and a couple of allocations later we underflow
      and the corruption check fires in xfs_alloc_ag_vextent_size().
      
      The issue is that the extent length fixups comaprisons are done
      with variables of xfs_extlen_t types. These are unsigned so an
      underflow looks like a really big value and hence is not detected
      as being smaller than the minimum length allowed for the extent.
      Hence the corruption check fires as it is noticing that the returned
      length is longer than the original extent length passed in.
      
      This can be easily fixed by ensuring we do the underflow test on
      signed values, the same way xfs_alloc_fix_len() prevents underflow.
      So we realise in future that these casts prevent underflows from
      going undetected, add comments to the code indicating this.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Tested-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3790a8cd
  17. 23 2月, 2015 2 次提交
  18. 28 11月, 2014 1 次提交
  19. 09 9月, 2014 1 次提交
    • E
      xfs: add a few more verifier tests · e1b05723
      Eric Sandeen 提交于
      These were exposed by fsfuzzer runs; without them we fail
      in various exciting and sometimes convoluted ways when we
      encounter disk corruption.
      
      Without the MAXLEVELS tests we tend to walk off the end of
      an array in a loop like this:
      
              for (i = 0; i < cur->bc_nlevels; i++) {
                      if (cur->bc_bufs[i])
      
      Without the dirblklog test we try to allocate more memory
      than we could possibly hope for and loop forever:
      
      xfs_dabuf_map()
      	nfsb = mp->m_dir_geo->fsbcount;
      	irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP...
      
      As for the logbsize check, that's the convoluted one.
      
      If logbsize is specified at mount time, it's sanitized
      in xfs_parseargs; in particular it makes sure that it's
      not > XLOG_MAX_RECORD_BSIZE.
      
      If not specified at mount time, it comes from the superblock
      via sb_logsunit; this is limited to 256k at mkfs time as well;
      it's copied into m_logbsize in xfs_finish_flags().
      
      However, if for some reason the on-disk value is corrupt and
      too large, nothing catches it.  It's a circuitous path, but
      that size eventually finds its way to places that make the kernel
      very unhappy, leading to oopses in xlog_pack_data() because we
      use the size as an index into iclog->ic_data, but the array
      is not necessarily that big.
      
      Anyway - bounds checking when we read from disk is a good thing!
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e1b05723
  20. 25 6月, 2014 2 次提交
  21. 06 6月, 2014 2 次提交
    • J
      xfs: Fix rounding in xfs_alloc_fix_len() · 30265117
      Jan Kara 提交于
      Rounding in xfs_alloc_fix_len() is wrong. As the comment states, the
      result should be a number of a form (k*prod+mod) however due to sign
      mistake the result is different. As a result allocations on raid arrays
      could be misaligned in some cases.
      
      This also seems to fix occasional assertion failure:
      	XFS_WANT_CORRUPTED_GOTO(rlen <= flen, error0)
      in xfs_alloc_ag_vextent_size().
      
      Also add an assertion that the result of xfs_alloc_fix_len() is of
      expected form.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      30265117
    • D
      xfs: kill xfs_buf_geterror() · 36de9556
      Dave Chinner 提交于
      Most of the callers are just calling ASSERT(!xfs_buf_geterror())
      which means they are checking for bp->b_error == 0. If bp is null in
      this case, we will assert fail, and hence it's no different in
      result to oopsing because of a null bp. In some cases, errors have
      already been checked for or the function returning the buffer can't
      return a buffer with an error, so it's just a redundant assert.
      Either way, the assert can either be removed.
      
      The other two non-assert callers can just test for a buffer and
      error properly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      
      36de9556