1. 29 5月, 2015 2 次提交
    • D
      xfs: extent size hints can round up extents past MAXEXTLEN · 6dea405e
      Dave Chinner 提交于
      This results in BMBT corruption, as seen by this test:
      
      # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
      ....
      # mount /dev/vdc /mnt/scratch
      # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo
      
      which results in this failure on a debug kernel:
      
      XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
      ....
      Call Trace:
       [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100
       [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20
       [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120
       [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70
       [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70
       [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0
       [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170
       [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400
       [<ffffffff81ddcc29>] ? down_write+0x29/0x60
       [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310
       [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120
       [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570
       [<ffffffff811cd680>] vfs_fallocate+0x140/0x260
       [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80
       [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17
      
      The tracepoint that indicates the extent that triggered the assert
      failure is:
      
      xfs_iext_insert:   idx 0 offset 0 block 16777224 count 2097152 flag 1
      
      Clearly indicating that the extent length is greater than MAXEXTLEN,
      which is 2097151. A prior trace point shows the allocation was an
      exact size match and that a length greater than MAXEXTLEN was asked
      for:
      
      xfs_alloc_size_done:  agno 1 agbno 8 minlen 2097152 maxlen 2097152
      					    ^^^^^^^        ^^^^^^^
      
      We don't see this problem with extent size hints through the IO path
      because we can't do single IOs large enough to trigger MAXEXTLEN
      allocation. fallocate(), OTOH, is not limited in it's allocation
      sizes and so needs help here.
      
      The issue is that the extent size hint alignment is rounding up the
      extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking
      into account extent size hints when calculating the maximum extent
      length to allocate. xfs_bmapi_reserve_delalloc() is already doing
      this, but direct extent allocation is not.
      
      Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is
      wrong, and it works only because delayed allocation extents are not
      limited in size to MAXEXTLEN in the in-core extent tree. hence this
      calculation does not work for direct allocation, and the delalloc
      code needs fixing. This may, in fact be the underlying bug that
      occassionally causes transaction overruns in delayed allocation
      extent conversion, so now we know it's wrong we should fix it, too.
      Many thanks to Brian Foster for finding this problem during review
      of this patch.
      
      Hence the fix, after much code reading, is to allow
      xfs_bmap_extsize_align() to align partial extents when full
      alignment would extend the alignment past MAXEXTLEN. We can safely
      do this because all callers have higher layer allocation loops that
      already handle short allocations, and so will simply run another
      allocation to cover the remainder of the requested allocation range
      that we ignored during alignment. The advantage of this approach is
      that it also removes the need for callers to do anything other than
      limit their requests to MAXEXTLEN - they don't really need to be
      aware of extent size hints at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6dea405e
    • G
      xfs: use percpu_counter_read_positive for mp->m_icount · 74f9ce1c
      George Wang 提交于
      Function percpu_counter_read just return the current counter, which can be
      negative. This will cause the checking of "allocated inode
      counts <= m_maxicount" false positive. Use percpu_counter_read_positive can
      solve this problem, and be consistent with the purpose to introduce percpu
      mechanism to xfs.
      Signed-off-by: NGeorge Wang <xuw2015@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      74f9ce1c
  2. 13 4月, 2015 3 次提交
    • B
      xfs: kill unnecessary firstused overflow check on attr3 leaf removal · 66db8104
      Brian Foster 提交于
      xfs_attr3_leaf_remove() removes an attribute from an attr leaf block. If
      the attribute nameval data happens to be at the start of the nameval
      region, a new start offset (firstused) for the region is calculated
      (since the region grows from the tail of the block to the start). Once
      the new firstused is calculated, it is checked for zero in an apparent
      overflow check.
      
      Now that the in-core firstused is 32-bit, overflow is not possible and
      this check can be removed. Since the purpose for this check is not
      documented and appears to exist since the port to Linux, be conservative
      and replace it with an assert.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      66db8104
    • B
      xfs: use larger in-core attr firstused field and detect overflow · e87021a2
      Brian Foster 提交于
      The on-disk xfs_attr3_leaf_hdr structure firstused field is 16-bit and
      subject to overflow when fs block size is 64k. The field is typically
      initialized to block size when an attr leaf block is initialized. This
      problem is demonstrated by assert failures when running xfstests
      generic/117 on an fs with 64k blocks.
      
      To support the existing attr leaf block algorithms for insertion,
      rebalance and entry movement, increase the size of the in-core firstused
      field to 32-bit and handle the potential overflow on conversion to/from
      the on-disk structure. If the overflow condition occurs, set a special
      value in the firstused field that is translated back on header read. The
      special value is only required in the case of an empty 64k attr block. A
      value of zero is used because firstused is initialized to the block size
      and grows backwards from there. Furthermore, the attribute block header
      occupies the first bytes of the block. Thus, a value of zero has no
      other legitimate meaning for this structure. Two new conversion helpers
      are created to manage the conversion of firstused to and from disk.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e87021a2
    • B
      xfs: pass attr geometry to attr leaf header conversion functions · 2f661241
      Brian Foster 提交于
      The firstused field of the xfs_attr3_leaf_hdr structure is subject to an
      overflow when fs blocksize is 64k. In preparation to handle this
      overflow in the header conversion functions, pass the attribute geometry
      to the functions that convert the in-core structure to and from the
      on-disk structure.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2f661241
  3. 25 3月, 2015 3 次提交
  4. 24 2月, 2015 2 次提交
    • D
      xfs: xfs_alloc_fix_minleft can underflow near ENOSPC · 3790a8cd
      Dave Chinner 提交于
      Test generic/224 is failing with a corruption being detected on one
      of Michael's test boxes.  Debug that Michael added is indicating
      that the minleft trimming is resulting in an underflow:
      
      .....
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       before goto out_nominleft: rlen          1  args->len          0
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       after fixup:               rlen          1  args->len          1
       before fixup:              rlen          1  args->len          0
       after xfs_alloc_fix_len  : rlen          1  args->len          1
       after fixup:               rlen 4294967295  args->len 4294967295
       XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 1424
      
      The "goto out_nominleft:" indicates that we are getting close to
      ENOSPC in the AG, and a couple of allocations later we underflow
      and the corruption check fires in xfs_alloc_ag_vextent_size().
      
      The issue is that the extent length fixups comaprisons are done
      with variables of xfs_extlen_t types. These are unsigned so an
      underflow looks like a really big value and hence is not detected
      as being smaller than the minimum length allowed for the extent.
      Hence the corruption check fires as it is noticing that the returned
      length is longer than the original extent length passed in.
      
      This can be easily fixed by ensuring we do the underflow test on
      signed values, the same way xfs_alloc_fix_len() prevents underflow.
      So we realise in future that these casts prevent underflows from
      going undetected, add comments to the code indicating this.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Tested-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3790a8cd
    • W
      xfs: remove old and redundant comment in xfs_mount_validate_sb · dd5e7127
      Wang Sheng-Hui 提交于
      The error messages document the reason for the checks better than the comment
      and the comments about volume mounts date back to Irix and so aren't relevant
      any more. So just remove the old and redundant comment.
      Signed-off-by: NWang Sheng-Hui <shhuiw@foxmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dd5e7127
  5. 23 2月, 2015 8 次提交
    • E
      xfs: pass mp to XFS_WANT_CORRUPTED_RETURN · 5fb5aeee
      Eric Sandeen 提交于
      Today, if we hit an XFS_WANT_CORRUPTED_RETURN we don't print any
      information about which filesystem hit it.  Passing in the mp allows
      us to print the filesystem (device) name, which is a pretty critical
      piece of information.
      
      Tested by running fsfuzzer 'til I hit some.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5fb5aeee
    • E
      xfs: pass mp to XFS_WANT_CORRUPTED_GOTO · c29aad41
      Eric Sandeen 提交于
      Today, if we hit an XFS_WANT_CORRUPTED_GOTO we don't print any
      information about which filesystem hit it.  Passing in the mp allows
      us to print the filesystem (device) name, which is a pretty critical
      piece of information.
      
      Tested by running fsfuzzer 'til I hit some.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      c29aad41
    • D
      xfs: remove xfs_mod_incore_sb API · 964aa8d9
      Dave Chinner 提交于
      Now that there are no users of the bitfield based incore superblock
      modification API, just remove the whole damn lot of it, including
      all the bitfield definitions. This finally removes a lot of cruft
      that has been around for a long time.
      
      Credit goes to Christoph Hellwig for providing a great patch
      connecting all the dots to enale us to do this. This patch is
      derived from that work.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      964aa8d9
    • D
      xfs: introduce xfs_mod_frextents · bab98bbe
      Dave Chinner 提交于
      Add a new helper to modify the incore counter of free realtime
      extents. This matches the helpers used for inode and data block
      counters, and removes a significant users of the xfs_mod_incore_sb()
      interface.
      
      Based on a patch originally from Christoph Hellwig.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bab98bbe
    • D
      xfs: Remove icsb infrastructure · 5681ca40
      Dave Chinner 提交于
      Now that the in-core superblock infrastructure has been replaced with
      generic per-cpu counters, we don't need it anymore. Nuke it from
      orbit so we are sure that it won't haunt us again...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5681ca40
    • D
      xfs: use generic percpu counters for free block counter · 0d485ada
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. The free block counter is
      special in that it is used for ENOSPC detection outside transaction
      contexts for for delayed allocation. This means that the counter
      needs to be accurate at zero. The current per-cpu counter code jumps
      through lots of hoops to ensure we never run past zero, but we don't
      need to make all those jumps with the generic counter
      implementation.
      
      The generic counter implementation allows us to pass a "batch"
      threshold at which the addition/subtraction to the counter value
      will be folded back into global value under lock. We can use this
      feature to reduce the batch size as we approach 0 in a very similar
      manner to the existing counters and their rebalance algorithm. If we
      use a batch size of 1 as we approach 0, then every addition and
      subtraction will be done against the global value and hence allow
      accurate detection of zero threshold crossing.
      
      Hence we can replace the handrolled, accurate-at-zero counters with
      generic percpu counters.
      
      Note: this removes just enough of the icsb infrastructure to compile
      without warnings. The rest will go in subsequent commits.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0d485ada
    • D
      xfs: use generic percpu counters for free inode counter · e88b64ea
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. The free inode counter is not
      used for any limit enforcement - the per-AG free inode counters are
      used during allocation to determine if there are inode available for
      allocation.
      
      Hence we don't need any of the complexity of the hand-rolled
      counters and we can simply replace them with generic per-cpu
      counters similar to the inode counter.
      
      This version introduces a xfs_mod_ifree() helper function from
      Christoph Hellwig.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e88b64ea
    • D
      xfs: use generic percpu counters for inode counter · 501ab323
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. There are some warts around
      the  use of them for the inode counter as the hand rolled counter is
      designed to be accurate at zero, but has no specific accurracy at
      any other value. This design causes problems for the maximum inode
      count threshold enforcement, as there is no trigger that balances
      the counters as they get close tothe maximum threshold.
      
      Instead of designing new triggers for balancing, just replace the
      handrolled per-cpu counter with a generic counter.  This enables us
      to update the counter through the normal superblock modification
      funtions, but rather than do that we add a xfs_mod_icount() helper
      function (from Christoph Hellwig) and keep the percpu counter
      outside the superblock in the struct xfs_mount.
      
      This means we still need to initialise the per-cpu counter
      specifically when we read the superblock, and vice versa when we
      log/write it, but it does mean that we don't need to change any
      other code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      501ab323
  6. 22 1月, 2015 4 次提交
    • D
      xfs: set buf types when converting extent formats · fe22d552
      Dave Chinner 提交于
      Conversion from local to extent format does not set the buffer type
      correctly on the new extent buffer when a symlink data is moved out
      of line.
      
      Fix the symlink code and leave a comment in the generic bmap code
      reminding us that the format-specific data copy needs to set the
      destination buffer type appropriately.
      
      cc: <stable@vger.kernel.org> # 3.10 to current
      Tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fe22d552
    • D
      xfs: sanitise sb_bad_features2 handling · 074e427b
      Dave Chinner 提交于
      We currently have to ensure that every time we update sb_features2
      that we update sb_bad_features2. Now that we log and format the
      superblock in it's entirety we actually don't have to care because
      we can simply update the sb_bad_features2 when we format it into the
      buffer. This removes the need for anything but the mount and
      superblock formatting code to care about sb_bad_features2, and
      hence removes the possibility that we forget to update bad_features2
      when necessary in the future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      074e427b
    • D
      xfs: consolidate superblock logging functions · 61e63ecb
      Dave Chinner 提交于
      We now have several superblock loggin functions that are identical
      except for the transaction reservation and whether it shoul dbe a
      synchronous transaction or not. Consolidate these all into a single
      function, a single reserveration and a sync flag and call it
      xfs_sync_sb().
      
      Also, xfs_mod_sb() is not really a modification function - it's the
      operation of logging the superblock buffer. hence change the name of
      it to reflect this.
      
      Note that we have to change the mp->m_update_flags that are passed
      around at mount time to a boolean simply to indicate a superblock
      update is needed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      61e63ecb
    • D
      xfs: remove bitfield based superblock updates · 4d11a402
      Dave Chinner 提交于
      When we log changes to the superblock, we first have to write them
      to the on-disk buffer, and then log that. Right now we have a
      complex bitfield based arrangement to only write the modified field
      to the buffer before we log it.
      
      This used to be necessary as a performance optimisation because we
      logged the superblock buffer in every extent or inode allocation or
      freeing, and so performance was extremely important. We haven't done
      this for years, however, ever since the lazy superblock counters
      pulled the superblock logging out of the transaction commit
      fast path.
      
      Hence we have a bunch of complexity that is not necessary that makes
      writing the in-core superblock to disk much more complex than it
      needs to be. We only need to log the superblock now during
      management operations (e.g. during mount, unmount or quota control
      operations) so it is not a performance critical path anymore.
      
      As such, remove the complex field based logging mechanism and
      replace it with a simple conversion function similar to what we use
      for all other on-disk structures.
      
      This means we always log the entirity of the superblock, but again
      because we rarely modify the superblock this is not an issue for log
      bandwidth or CPU time. Indeed, if we do log the superblock
      frequently, delayed logging will minimise the impact of this
      overhead.
      
      [Fixed gquota/pquota inode sharing regression noticed by bfoster.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4d11a402
  7. 09 1月, 2015 4 次提交
  8. 24 12月, 2014 1 次提交
    • J
      xfs: Keep sb_bad_features2 consistent with sb_features2 · 1a43ec03
      Jan Kara 提交于
      Currently when we modify sb_features2, we store the same value also in
      sb_bad_features2. However in most places we forget to mark field
      sb_bad_features2 for logging and thus it can happen that a change to it
      is lost. This results in an inconsistent sb_features2 and
      sb_bad_features2 fields e.g. after xfstests test xfs/187.
      
      Fix the problem by changing XFS_SB_FEATURES2 to actually mean both
      sb_features2 and sb_bad_features2 fields since this is always what we
      want to log. This isn't ideal because the fact that XFS_SB_FEATURES2
      means two fields could cause some problem in future however the code is
      hopefully less error prone that it is now.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1a43ec03
  9. 04 12月, 2014 6 次提交
    • D
      xfs: fix set-but-unused warnings · 32296f86
      Dave Chinner 提交于
      The kernel compile doesn't turn on these checks by default, so it's
      only when I do a kernel-user sync that I find that there are lots of
      compiler warnings waiting to be fixed. Fix up these set-but-unused
      warnings.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      32296f86
    • D
      xfs: move type conversion functions to xfs_dir.h · 9a2cc41c
      Dave Chinner 提交于
      These are currently considered private to libxfs, but they are
      widely used by the userspace code to decode, walk and check
      directory structures. Hence they really form part of the external
      API and as such need to bemoved to xfs_dir2.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      9a2cc41c
    • D
      xfs: move ftype conversion functions to libxfs · 1b767ee3
      Dave Chinner 提交于
      These functions are needed in userspace for repair and mkfs to
      do the right thing. Move them to libxfs so they can be easily
      shared.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1b767ee3
    • D
      xfs: cleanup xfs_bmse_merge returns · 4db431f5
      Dave Chinner 提交于
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      
      xfs_bmse_merge() has a jump label for return that just returns the
      error value. Convert all the code to just return the error directly
      and use XFS_WANT_CORRUPTED_RETURN. This also allows the final call
      to xfs_bmbt_update() to return directly.
      
      Noticed while reviewing coccinelle return cleanup patches and
      wondering why the same return pattern as in xfs_bmse_shift_one()
      wasn't picked up by the checker pattern...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4db431f5
    • D
      xfs: cleanup xfs_bmse_shift_one goto mess · b11bd671
      Dave Chinner 提交于
      xfs_bmse_shift_one() jumps around determining whether to shift or
      merge, making the code flow difficult to follow. Clean it up and
      use direct error returns (including XFS_WANT_CORRUPTED_RETURN) to
      make the code flow better and be easier to read.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b11bd671
    • D
      xfs: fix premature enospc on inode allocation · 7a1df156
      Dave Chinner 提交于
      After growing a filesystem, XFS can fail to allocate inodes even
      though there is a large amount of space available in the filesystem
      for inodes. The issue is caused by a nearly full allocation group
      having enough free space in it to be considered for inode
      allocation, but not enough contiguous free space to actually
      allocation inodes.  This situation results in successful selection
      of the AG for allocation, then failure of the allocation resulting
      in ENOSPC being reported to the caller.
      
      It is caused by two possible issues. Firstly, we only consider the
      lognest free extent and whether it would fit an inode chunk. If the
      extent is not correctly aligned, then we can't allocate an inode
      chunk in it regardless of the fact that it is large enough. This
      tends to be a permanent error until space in the AG is freed.
      
      The second issue is that we don't actually lock the AGI or AGF when
      we are doing these checks, and so by the time we get to actually
      allocating the inode chunk the space we thought we had in the AG may
      have been allocated. This tends to be a spurious error as it
      requires a race to trigger. Hence this case is ignored in this patch
      as the reported problem is for permanent errors.
      
      The first issue could be addressed by simply taking into account the
      alignment when checking the longest extent. This, however, would
      prevent allocation in AGs that have aligned, exact sized extents
      free. However, this case should be fairly rare compared to the
      number of allocations that occur near ENOSPC that would trigger this
      condition.
      
      Hence, when selecting the inode AG, take into account the inode
      cluster alignment when checking the lognest free extent in the AG.
      If we can't find any AGs with a contiguous free space large
      enough to be aligned, drop the alignment addition and just try for
      an AG that has enough contiguous free space available for an inode
      chunk. This won't prevent issues from occurring, but should avoid
      situations where other AGs have lots of free space but the selected
      AG can't allocate due to alignment constraints.
      Reported-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7a1df156
  10. 01 12月, 2014 2 次提交
  11. 28 11月, 2014 5 次提交