1. 04 4月, 2017 3 次提交
  2. 29 3月, 2017 1 次提交
    • D
      xfs: rework the inline directory verifiers · 005c5db8
      Darrick J. Wong 提交于
      The inline directory verifiers should be called on the inode fork data,
      which means after iformat_local on the read side, and prior to
      ifork_flush on the write side.  This makes the fork verifier more
      consistent with the way buffer verifiers work -- i.e. they will operate
      on the memory buffer that the code will be reading and writing directly.
      
      Furthermore, revise the verifier function to return -EFSCORRUPTED so
      that we don't flood the logs with corruption messages and assert
      notices.  This has been a particular problem with xfs/348, which
      triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
      kernel when CONFIG_XFS_DEBUG=y.  Disk corruption isn't supposed to do
      that, at least not in a verifier.
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ---
      v2: get the inode d_ops the proper way
      v3: describe the bug that this patch fixes; no code changes
      005c5db8
  3. 15 3月, 2017 1 次提交
  4. 09 3月, 2017 2 次提交
    • C
      xfs: try any AG when allocating the first btree block when reflinking · 2fcc319d
      Christoph Hellwig 提交于
      When a reflink operation causes the bmap code to allocate a btree block
      we're currently doing single-AG allocations due to having ->firstblock
      set and then try any higher AG due a little reflink quirk we've put in
      when adding the reflink code.  But given that we do not have a minleft
      reservation of any kind in this AG we can still not have any space in
      the same or higher AG even if the file system has enough free space.
      To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
      path instead.
      
      [And yes, we need to redo this properly instead of piling hacks over
       hacks.  I'm working on that, but it's not going to be a small series.
       In the meantime this fixes the customer reported issue]
      
      Also add a warning for failing allocations to make it easier to debug.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      2fcc319d
    • B
      xfs: use iomap new flag for newly allocated delalloc blocks · f65e6fad
      Brian Foster 提交于
      Commit fa7f138a ("xfs: clear delalloc and cache on buffered write
      failure") fixed one regression in the iomap error handling code and
      exposed another. The fundamental problem is that if a buffered write
      is a rewrite of preexisting delalloc blocks and the write fails, the
      failure handling code can punch out preexisting blocks with valid
      file data.
      
      This was reproduced directly by sub-block writes in the LTP
      kernel/syscalls/write/write03 test. A first 100 byte write allocates
      a single block in a file. A subsequent 100 byte write fails and
      punches out the block, including the data successfully written by
      the previous write.
      
      To address this problem, update the ->iomap_begin() handler to
      distinguish newly allocated delalloc blocks from preexisting
      delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
      ->iomap_end() handler to decide when a failed or short write should
      punch out delalloc blocks.
      
      This introduces the subtle requirement that ->iomap_begin() should
      never combine newly allocated delalloc blocks with existing blocks
      in the resulting iomap descriptor. This can occur when a new
      delalloc reservation merges with a neighboring extent that is part
      of the current write, for example. Therefore, drop the
      post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
      just return the record inserted into the fork. This ensures only new
      blocks are returned and thus that preexisting delalloc blocks are
      always handled as "found" blocks and not punched out on a failed
      rewrite.
      Reported-by: NXiong Zhou <xzhou@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f65e6fad
  5. 18 2月, 2017 1 次提交
  6. 17 2月, 2017 4 次提交
    • C
      xfs: tune down agno asserts in the bmap code · 410d17f6
      Christoph Hellwig 提交于
      In various places we currently assert that xfs_bmap_btalloc allocates
      from the same as the firstblock value passed in, unless it's either
      NULLAGNO or the dop_low flag is set.  But the reflink code does not
      fully follow this convention as it passes in firstblock purely as
      a hint for the allocator without actually having previous allocations
      in the transaction, and without having a minleft check on the current
      AG, leading to the assert firing on a very full and heavily used
      file system.  As even the reflink code only allocates from equal or
      higher AGs for now we can simply the check to always allow for equal
      or higher AGs.
      
      Note that we need to eventually split the two meanings of the firstblock
      value.  At that point we can also allow the reflink code to allocate
      from any AG instead of limiting it in any way.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      410d17f6
    • C
      xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment · 8ee9fdbe
      Chandan Rajendra 提交于
      On a ppc64 system, executing generic/256 test with 32k block size gives the following call trace,
      
      XFS: Assertion failed: args->maxlen > 0, file: /root/repos/linux/fs/xfs/libxfs/xfs_alloc.c, line: 2026
      
      kernel BUG at /root/repos/linux/fs/xfs/xfs_message.c:113!
      Oops: Exception in kernel mode, sig: 5 [#1]
      SMP NR_CPUS=2048
      DEBUG_PAGEALLOC
      NUMA
      pSeries
      Modules linked in:
      CPU: 2 PID: 19361 Comm: mkdir Not tainted 4.10.0-rc5 #58
      task: c000000102606d80 task.stack: c0000001026b8000
      NIP: c0000000004ef798 LR: c0000000004ef798 CTR: c00000000082b290
      REGS: c0000001026bb090 TRAP: 0700   Not tainted  (4.10.0-rc5)
      MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>
      CR: 28004428  XER: 00000000
      CFAR: c0000000004ef180 SOFTE: 1
      GPR00: c0000000004ef798 c0000001026bb310 c000000001157300 ffffffffffffffea
      GPR04: 000000000000000a c0000001026bb130 0000000000000000 ffffffffffffffc0
      GPR08: 00000000000000d1 0000000000000021 00000000ffffffd1 c000000000dd4990
      GPR12: 0000000022004444 c00000000fe00800 0000000020000000 0000000000000000
      GPR16: 0000000000000000 0000000043a606fc 0000000043a76c08 0000000043a1b3d0
      GPR20: 000001002a35cd60 c0000001026bbb80 0000000000000000 0000000000000001
      GPR24: 0000000000000240 0000000000000004 c00000062dc55000 0000000000000000
      GPR28: 0000000000000004 c00000062ecd9200 0000000000000000 c0000001026bb6c0
      NIP [c0000000004ef798] .assfail+0x28/0x30
      LR [c0000000004ef798] .assfail+0x28/0x30
      Call Trace:
      [c0000001026bb310] [c0000000004ef798] .assfail+0x28/0x30 (unreliable)
      [c0000001026bb380] [c000000000455d74] .xfs_alloc_space_available+0x194/0x1b0
      [c0000001026bb410] [c00000000045b914] .xfs_alloc_fix_freelist+0x144/0x480
      [c0000001026bb580] [c00000000045c368] .xfs_alloc_vextent+0x698/0xa90
      [c0000001026bb650] [c0000000004a6200] .xfs_ialloc_ag_alloc+0x170/0x820
      [c0000001026bb7c0] [c0000000004a9098] .xfs_dialloc+0x158/0x320
      [c0000001026bb8a0] [c0000000004e628c] .xfs_ialloc+0x7c/0x610
      [c0000001026bb990] [c0000000004e8138] .xfs_dir_ialloc+0xa8/0x2f0
      [c0000001026bbaa0] [c0000000004e8814] .xfs_create+0x494/0x790
      [c0000001026bbbf0] [c0000000004e5ebc] .xfs_generic_create+0x2bc/0x410
      [c0000001026bbce0] [c0000000002b4a34] .vfs_mkdir+0x154/0x230
      [c0000001026bbd70] [c0000000002bc444] .SyS_mkdirat+0x94/0x120
      [c0000001026bbe30] [c00000000000b760] system_call+0x38/0xfc
      Instruction dump:
      4e800020 60000000 7c0802a6 7c862378 3c82ffca 7ca72b78 38841c18 7c651b78
      38600000 f8010010 f821ff91 4bfff94d <0fe00000> 60000000 7c0802a6 7c892378
      
      When block size is larger than inode cluster size, the call to
      XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
      would have set xfs_sb->sb_inoalignmt to 0. This causes
      xfs_ialloc_cluster_alignment() to return 0.  Due to this
      args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned
      equivalent of -1 assigned to it. This later causes alloc_len in
      xfs_alloc_space_available() to have a value of 0. In such a scenario
      when args.total is also 0, the assert statement "ASSERT(args->maxlen >
      0);" fails.
      
      This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in
      xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb().
      Suggested-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      8ee9fdbe
    • B
      xfs: split indlen reservations fairly when under reserved · 75d65361
      Brian Foster 提交于
      Certain workoads that punch holes into speculative preallocation can
      cause delalloc indirect reservation splits when the delalloc extent is
      split in two. If further splits occur, an already short-handed extent
      can be split into two in a manner that leaves zero indirect blocks for
      one of the two new extents. This occurs because the shortage is large
      enough that the xfs_bmap_split_indlen() algorithm completely drains the
      requested indlen of one of the extents before it honors the existing
      reservation.
      
      This ultimately results in a warning from xfs_bmap_del_extent(). This
      has been observed during file copies of large, sparse files using 'cp
      --sparse=always.'
      
      To avoid this problem, update xfs_bmap_split_indlen() to explicitly
      apply the reservation shortage fairly between both extents. This smooths
      out the overall indlen shortage and defers the situation where we end up
      with a delalloc extent with zero indlen reservation to extreme
      circumstances.
      Reported-by: NPatrick Dung <mpatdung@gmail.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      75d65361
    • B
      xfs: handle indlen shortage on delalloc extent merge · 0e339ef8
      Brian Foster 提交于
      When a delalloc extent is created, it can be merged with pre-existing,
      contiguous, delalloc extents. When this occurs,
      xfs_bmap_add_extent_hole_delay() merges the extents along with the
      associated indirect block reservations. The expectation here is that the
      combined worst case indlen reservation is always less than or equal to
      the indlen reservation for the individual extents.
      
      This is not always the case, however, as existing extents can less than
      the expected indlen reservation if the extent was previously split due
      to a hole punch. If a new extent merges with such an extent, the total
      indlen requirement may be larger than the sum of the indlen reservations
      held by both extents.
      
      xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
      reservation is always available and assigns it to the merged extent
      without consideration for the indlen held by the pre-existing extent. As
      a result, the subsequent xfs_mod_fdblocks() call can attempt an
      unintentional allocation rather than a free (indicated by an ASSERT()
      failure). Further, if the allocation happens to fail in this context,
      the failure goes unhandled and creates a filesystem wide block
      accounting inconsistency.
      
      Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
      indlen reservation assigned to the merged extent to the sum of the
      indlen reservations held by each of the individual extents.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e339ef8
  7. 10 2月, 2017 1 次提交
  8. 07 2月, 2017 1 次提交
    • C
      xfs: go straight to real allocations for direct I/O COW writes · a14234c7
      Christoph Hellwig 提交于
      When we allocate COW fork blocks for direct I/O writes we currently first
      create a delayed allocation, and then convert it to a real allocation
      once we've got the delayed one.
      
      As there is no good reason for that this patch instead makes use call
      xfs_bmapi_write from the COW allocation path.  The only interesting bits
      are a few tweaks the low-level allocator to allow for this, most notably
      the need to remove the call to xfs_bmap_extsize_align for the cowextsize
      in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
      for the direct allocation case it would blow up our block reservation
      way beyond what we reserved for the transaction.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a14234c7
  9. 03 2月, 2017 6 次提交
  10. 31 1月, 2017 4 次提交
  11. 26 1月, 2017 1 次提交
  12. 25 1月, 2017 4 次提交
  13. 24 1月, 2017 1 次提交
    • C
      xfs: fix COW writeback race · d2b3964a
      Christoph Hellwig 提交于
      Due to the way how xfs_iomap_write_allocate tries to convert the whole
      found extents from delalloc to real space we can run into a race
      condition with multiple threads doing writes to this same extent.
      For the non-COW case that is harmless as the only thing that can happen
      is that we call xfs_bmapi_write on an extent that has already been
      converted to a real allocation.  For COW writes where we move the extent
      from the COW to the data fork after I/O completion the race is, however,
      not quite as harmless.  In the worst case we are now calling
      xfs_bmapi_write on a region that contains hole in the COW work, which
      will trip up an assert in debug builds or lead to file system corruption
      in non-debug builds.  This seems to be reproducible with workloads of
      small O_DSYNC write, although so far I've not managed to come up with
      a with an isolated reproducer.
      
      The fix for the issue is relatively simple:  tell xfs_bmapi_write
      that we are only asked to convert delayed allocations and skip holes
      in that case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d2b3964a
  14. 19 1月, 2017 1 次提交
  15. 18 1月, 2017 4 次提交
  16. 10 1月, 2017 4 次提交
    • C
      xfs: don't rely on ->total in xfs_alloc_space_available · 12ef8301
      Christoph Hellwig 提交于
      ->total is a bit of an odd parameter passed down to the low-level
      allocator all the way from the high-level callers.  It's supposed to
      contain the maximum number of blocks to be allocated for the whole
      transaction [1].
      
      But in xfs_iomap_write_allocate we only convert existing delayed
      allocations and thus only have a minimal block reservation for the
      current transaction, so xfs_alloc_space_available can't use it for
      the allocation decisions.  Use the maximum of args->total and the
      calculated block requirement to make a decision.  We probably should
      get rid of args->total eventually and instead apply ->minleft more
      broadly, but that will require some extensive changes all over.
      
      [1] which creates lots of confusion as most callers don't decrement it
      once doing a first allocation.  But that's for a separate series.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      12ef8301
    • C
      xfs: adjust allocation length in xfs_alloc_space_available · 54fee133
      Christoph Hellwig 提交于
      We must decide in xfs_alloc_fix_freelist if we can perform an
      allocation from a given AG is possible or not based on the available
      space, and should not fail the allocation past that point on a
      healthy file system.
      
      But currently we have two additional places that second-guess
      xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
      maxlen parameter to remove the reservation before doing the
      allocation (but ignores the various minium freespace requirements),
      and xfs_alloc_fix_minleft tries to fix up the allocated length
      after we've found an extent, but ignores the reservations and also
      doesn't take the AGFL into account (and thus fails allocations
      for not matching minlen in some cases).
      
      Remove all these later fixups and just correct the maxlen argument
      inside xfs_alloc_fix_freelist once we have the AGF buffer locked.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      54fee133
    • C
      xfs: fix bogus minleft manipulations · 255c5162
      Christoph Hellwig 提交于
      We can't just set minleft to 0 when we're low on space - that's exactly
      what we need minleft for: to protect space in the AG for btree block
      allocations when we are low on free space.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      255c5162
    • C
      xfs: bump up reserved blocks in xfs_alloc_set_aside · 5149fd32
      Christoph Hellwig 提交于
      Setting aside 4 blocks globally for bmbt splits isn't all that useful,
      as different threads can allocate space in parallel.  Bump it to 4
      blocks per AG to allow each thread that is currently doing an
      allocation to dip into it separately.  Without that we may no have
      enough reserved blocks if there are enough parallel transactions
      in an almost out space file system that all run into bmap btree
      splits.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5149fd32
  17. 04 1月, 2017 1 次提交