1. 13 8月, 2013 1 次提交
  2. 21 5月, 2013 1 次提交
    • J
      xfs: Avoid pathological backwards allocation · 211d022c
      Jan Kara 提交于
      Writing a large file using direct IO in 16 MB chunks sometimes results
      in a pathological allocation pattern where 16 MB chunks of large free
      extent are allocated to a file in a reversed order. So extents of a file
      look for example as:
      
       ext logical physical expected length flags
         0        0        13          4550656
         1  4550656 188136807   4550668 12562432
         2 17113088 200699240 200699238 622592
         3 17735680 182046055 201321831   4096
         4 17739776 182041959 182050150   4096
         5 17743872 182037863 182046054   4096
         6 17747968 182033767 182041958   4096
         7 17752064 182029671 182037862   4096
      ...
      6757 45400064 154381644 154389835   4096
      6758 45404160 154377548 154385739   4096
      6759 45408256 252951571 154381643  73728 eof
      
      This happens because XFS_ALLOCTYPE_THIS_BNO allocation fails (the last
      extent in the file cannot be further extended) so we fall back to
      XFS_ALLOCTYPE_NEAR_BNO allocation which picks end of a large free
      extent as the best place to continue the file. Since the chunk at the
      end of the free extent again cannot be further extended, this behavior
      repeats until the whole free extent is consumed in a reversed order.
      
      For data allocations this backward allocation isn't beneficial so make
      xfs_alloc_compute_diff() pick start of a free extent instead of its end
      for them. That avoids the backward allocation pattern.
      
      See thread at http://oss.sgi.com/archives/xfs/2013-03/msg00144.html for
      more details about the reproduction case and why this solution was
      chosen.
      
      Based on idea by Dave Chinner <dchinner@redhat.com>.
      
      CC: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      211d022c
  3. 28 4月, 2013 1 次提交
    • D
      xfs: buffer type overruns blf_flags field · 61fe135c
      Dave Chinner 提交于
      The buffer type passed to log recvoery in the buffer log item
      overruns the blf_flags field. I had assumed that flags field was a
      32 bit value, and it turns out it is a unisgned short. Therefore
      having 19 flags doesn't really work.
      
      Convert the buffer type field to numeric value, and use the top 5
      bits of the flags field for it. We currently have 17 types of
      buffers, so using 5 bits gives us plenty of room for expansion in
      future....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      61fe135c
  4. 22 4月, 2013 2 次提交
    • C
      xfs: add CRC checks to the AGFL · 77c95bba
      Christoph Hellwig 提交于
      Add CRC checks, location information and a magic number to the AGFL.
      Previously the AGFL was just a block containing nothing but the
      free block pointers.  The new AGFL has a real header with the usual
      boilerplate instead, so that we can verify it's not corrupted and
      written into the right place.
      
      [dchinner@redhat.com] Added LSN field, reworked significantly to fit
      into new verifier structure and growfs structure, enabled full
      verifier functionality now there is a header to verify and we can
      guarantee an initialised AGFL.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      77c95bba
    • D
      xfs: add CRC checks to the AGF · 4e0e6040
      Dave Chinner 提交于
      The AGF already has some self identifying fields (e.g. the sequence
      number) so we only need to add the uuid to it to identify the
      filesystem it belongs to. The location is fixed based on the
      sequence number, so there's no need to add a block number, either.
      
      Hence the only additional fields are the CRC and LSN fields. These
      are unlogged, so place some space between the end of the logged
      fields and them so that future expansion of the AGF for logged
      fields can be placed adjacent to the existing logged fields and
      hence not complicate the field-derived range based logging we
      currently have.
      
      Based originally on a patch from myself, modified further by
      Christoph Hellwig and then modified again to fit into the
      verifier structure with additional fields by myself. The multiple
      signed-off-by tags indicate the age and history of this patch.
      Signed-off-by: NDave Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4e0e6040
  5. 08 3月, 2013 1 次提交
  6. 04 1月, 2013 1 次提交
  7. 16 11月, 2012 6 次提交
    • D
      xfs: convert buffer verifiers to an ops structure. · 1813dd64
      Dave Chinner 提交于
      To separate the verifiers from iodone functions and associate read
      and write verifiers at the same time, introduce a buffer verifier
      operations structure to the xfs_buf.
      
      This avoids the need for assigning the write verifier, clearing the
      iodone function and re-running ioend processing in the read
      verifier, and gets rid of the nasty "b_pre_io" name for the write
      verifier function pointer. If we ever need to, it will also be
      easier to add further content specific callbacks to a buffer with an
      ops structure in place.
      
      We also avoid needing to export verifier functions, instead we
      can simply export the ops structures for those that are needed
      outside the function they are defined in.
      
      This patch also fixes a directory block readahead verifier issue
      it exposed.
      
      This patch also adds ops callbacks to the inode/alloc btree blocks
      initialised by growfs. These will need more work before they will
      work with CRCs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1813dd64
    • D
      xfs: connect up write verifiers to new buffers · b0f539de
      Dave Chinner 提交于
      Metadata buffers that are read from disk have write verifiers
      already attached to them, but newly allocated buffers do not. Add
      appropriate write verifiers to all new metadata buffers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b0f539de
    • D
      xfs: add pre-write metadata buffer verifier callbacks · 612cfbfe
      Dave Chinner 提交于
      These verifiers are essentially the same code as the read verifiers,
      but do not require ioend processing. Hence factor the read verifier
      functions and add a new write verifier wrapper that is used as the
      callback.
      
      This is done as one large patch for all verifiers rather than one
      patch per verifier as the change is largely mechanical. This
      includes hooking up the write verifier via the read verifier
      function.
      
      Hooking up the write verifier for buffers obtained via
      xfs_trans_get_buf() will be done in a separate patch as that touches
      code in many different places rather than just the verifier
      functions.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      612cfbfe
    • D
      xfs: verify AGFL blocks as they are read from disk · bb80c6d7
      Dave Chinner 提交于
      Add an AGFL block verify callback function and pass it into the
      buffer read functions.
      
      While this commit adds verification code to the AGFL, it cannot be
      used reliably until the CRC format change comes along as mkfs does
      not initialise the full AGFL. Hence it can be full of garbage at the
      first mount and will fail verification right now. CRC enabled
      filesystems won't have this problem, so leave the code that has
      already been written ifdef'd out until the proper time.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      bb80c6d7
    • D
      xfs: verify AGF blocks as they are read from disk · 5d5f527d
      Dave Chinner 提交于
      Add an AGF block verify callback function and pass it into the
      buffer read functions. This replaces the existing verification that
      is done after the read completes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5d5f527d
    • D
      xfs: make buffer read verication an IO completion function · c3f8fc73
      Dave Chinner 提交于
      Add a verifier function callback capability to the buffer read
      interfaces.  This will be used by the callers to supply a function
      that verifies the contents of the buffer when it is read from disk.
      This patch does not provide callback functions, but simply modifies
      the interfaces to allow them to be called.
      
      The reason for adding this to the read interfaces is that it is very
      difficult to tell fom the outside is a buffer was just read from
      disk or whether we just pulled it out of cache. Supplying a callbck
      allows the buffer cache to use it's internal knowledge of the buffer
      to execute it only when the buffer is read from disk.
      
      It is intended that the verifier functions will mark the buffer with
      an EFSCORRUPTED error when verification fails. This allows the
      reading context to distinguish a verification error from an IO
      error, and potentially take further actions on the buffer (e.g.
      attempt repair) based on the error reported.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c3f8fc73
  8. 09 11月, 2012 3 次提交
    • D
      xfs: move allocation stack switch up to xfs_bmapi_allocate · 1f3c785c
      Dave Chinner 提交于
      Switching stacks are xfs_alloc_vextent can cause deadlocks when we
      run out of worker threads on the allocation workqueue. This can
      occur because xfs_bmap_btalloc can make multiple calls to
      xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can
      return with the AGF locked in the current allocation transaction.
      
      If we then need to make another allocation, and all the allocation
      worker contexts are exhausted because the are blocked waiting for
      the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent
      work completed to release the AGF.  Hence allocation effectively
      deadlocks.
      
      To avoid this, move the stack switch one layer up to
      xfs_bmapi_allocate() so that all of the allocation attempts in a
      single switched stack transaction occur in a single worker context.
      This avoids the problem of an allocation being blocked waiting for
      a worker thread whilst holding the AGF.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1f3c785c
    • D
      xfs: introduce XFS_BMAPI_STACK_SWITCH · 326c0355
      Dave Chinner 提交于
      Certain allocation paths through xfs_bmapi_write() are in situations
      where we have limited stack available. These are almost always in
      the buffered IO writeback path when convertion delayed allocation
      extents to real extents.
      
      The current stack switch occurs for userdata allocations, which
      means we also do stack switches for preallocation, direct IO and
      unwritten extent conversion, even those these call chains have never
      been implicated in a stack overrun.
      
      Hence, let's target just the single stack overun offended for stack
      switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
      the caller can pass xfs_bmapi_write() to indicate it should switch
      stacks if it needs to do allocation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      326c0355
    • M
      xfs: zero allocation_args on the kernel stack · 408cc4e9
      Mark Tinguely 提交于
      Zero the kernel stack space that makes up the xfs_alloc_arg structures.
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      408cc4e9
  9. 19 10月, 2012 3 次提交
    • D
      xfs: move allocation stack switch up to xfs_bmapi_allocate · e04426b9
      Dave Chinner 提交于
      Switching stacks are xfs_alloc_vextent can cause deadlocks when we
      run out of worker threads on the allocation workqueue. This can
      occur because xfs_bmap_btalloc can make multiple calls to
      xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can
      return with the AGF locked in the current allocation transaction.
      
      If we then need to make another allocation, and all the allocation
      worker contexts are exhausted because the are blocked waiting for
      the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent
      work completed to release the AGF.  Hence allocation effectively
      deadlocks.
      
      To avoid this, move the stack switch one layer up to
      xfs_bmapi_allocate() so that all of the allocation attempts in a
      single switched stack transaction occur in a single worker context.
      This avoids the problem of an allocation being blocked waiting for
      a worker thread whilst holding the AGF.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      e04426b9
    • D
      xfs: introduce XFS_BMAPI_STACK_SWITCH · 2455881c
      Dave Chinner 提交于
      Certain allocation paths through xfs_bmapi_write() are in situations
      where we have limited stack available. These are almost always in
      the buffered IO writeback path when convertion delayed allocation
      extents to real extents.
      
      The current stack switch occurs for userdata allocations, which
      means we also do stack switches for preallocation, direct IO and
      unwritten extent conversion, even those these call chains have never
      been implicated in a stack overrun.
      
      Hence, let's target just the single stack overun offended for stack
      switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
      the caller can pass xfs_bmapi_write() to indicate it should switch
      stacks if it needs to do allocation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2455881c
    • M
      xfs: zero allocation_args on the kernel stack · a0041684
      Mark Tinguely 提交于
      Zero the kernel stack space that makes up the xfs_alloc_arg structures.
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a0041684
  10. 14 7月, 2012 4 次提交
  11. 22 6月, 2012 2 次提交
  12. 21 6月, 2012 1 次提交
    • J
      xfs: fix debug_object WARN at xfs_alloc_vextent() · 3b876c8f
      Jeff Liu 提交于
      Fengguang reports:
      
      [  780.529603] XFS (vdd): Ending clean mount
      [  781.454590] ODEBUG: object is on stack, but not annotated
      [  781.455433] ------------[ cut here ]------------
      [  781.455433] WARNING: at /c/kernel-tests/sound/lib/debugobjects.c:301 __debug_object_init+0x173/0x1f1()
      [  781.455433] Hardware name: Bochs
      [  781.455433] Modules linked in:
      [  781.455433] Pid: 26910, comm: kworker/0:2 Not tainted 3.4.0+ #51
      [  781.455433] Call Trace:
      [  781.455433]  [<ffffffff8106bc84>] warn_slowpath_common+0x83/0x9b
      [  781.455433]  [<ffffffff8106bcb6>] warn_slowpath_null+0x1a/0x1c
      [  781.455433]  [<ffffffff814919a5>] __debug_object_init+0x173/0x1f1
      [  781.455433]  [<ffffffff81491c65>] debug_object_init+0x14/0x16
      [  781.455433]  [<ffffffff8108842a>] __init_work+0x20/0x22
      [  781.455433]  [<ffffffff8134ea56>] xfs_alloc_vextent+0x6c/0xd5
      
      Use INIT_WORK_ONSTACK in xfs_alloc_vextent instead of INIT_WORK.
      Reported-by: NWu Fengguang <wfg@linux.intel.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3b876c8f
  13. 15 6月, 2012 1 次提交
    • J
      xfs: fix debug_object WARN at xfs_alloc_vextent() · 0f2cf9d3
      Jeff Liu 提交于
      Fengguang reports:
      
      [  780.529603] XFS (vdd): Ending clean mount
      [  781.454590] ODEBUG: object is on stack, but not annotated
      [  781.455433] ------------[ cut here ]------------
      [  781.455433] WARNING: at /c/kernel-tests/sound/lib/debugobjects.c:301 __debug_object_init+0x173/0x1f1()
      [  781.455433] Hardware name: Bochs
      [  781.455433] Modules linked in:
      [  781.455433] Pid: 26910, comm: kworker/0:2 Not tainted 3.4.0+ #51
      [  781.455433] Call Trace:
      [  781.455433]  [<ffffffff8106bc84>] warn_slowpath_common+0x83/0x9b
      [  781.455433]  [<ffffffff8106bcb6>] warn_slowpath_null+0x1a/0x1c
      [  781.455433]  [<ffffffff814919a5>] __debug_object_init+0x173/0x1f1
      [  781.455433]  [<ffffffff81491c65>] debug_object_init+0x14/0x16
      [  781.455433]  [<ffffffff8108842a>] __init_work+0x20/0x22
      [  781.455433]  [<ffffffff8134ea56>] xfs_alloc_vextent+0x6c/0xd5
      
      Use INIT_WORK_ONSTACK in xfs_alloc_vextent instead of INIT_WORK.
      Reported-by: NWu Fengguang <wfg@linux.intel.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      0f2cf9d3
  14. 15 5月, 2012 4 次提交
  15. 28 3月, 2012 1 次提交
    • D
      xfs: fix fstrim offset calculations · a66d6363
      Dave Chinner 提交于
      xfs_ioc_fstrim() doesn't treat the incoming offset and length
      correctly. It treats them as a filesystem block address, rather than
      a disk address. This is wrong because the range passed in is a
      linear representation, while the filesystem block address notation
      is a sparse representation. Hence we cannot convert the range direct
      to filesystem block units and then use that for calculating the
      range to trim.
      
      While this sounds dangerous, the problem is limited to calculating
      what AGs need to be trimmed. The code that calcuates the actual
      ranges to trim gets the right result (i.e. only ever discards free
      space), even though it uses the wrong ranges to limit what is
      trimmed. Hence this is not a bug that endangers user data.
      
      Fix this by treating the range as a disk address range and use the
      appropriate functions to convert the range into the desired formats
      for calculations.
      
      Further, fix the first free extent lookup (the longest) to actually
      find the largest free extent. Currently this lookup uses a <=
      lookup, which results in finding the extent to the left of the
      largest because we can never get an exact match on the largest
      extent. This is due to the fact that while we know it's size, we
      don't know it's location and so the exact match fails and we move
      one record to the left to get the next largest extent. Instead, use
      a >= search so that the lookup returns the largest extent regardless
      of the fact we don't get an exact match on it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a66d6363
  16. 23 3月, 2012 1 次提交
    • D
      xfs: introduce an allocation workqueue · c999a223
      Dave Chinner 提交于
      We currently have significant issues with the amount of stack that
      allocation in XFS uses, especially in the writeback path. We can
      easily consume 4k of stack between mapping the page, manipulating
      the bmap btree and allocating blocks from the free list. Not to
      mention btree block readahead and other functionality that issues IO
      in the allocation path.
      
      As a result, we can no longer fit allocation in the writeback path
      in the stack space provided on x86_64. To alleviate this problem,
      introduce an allocation workqueue and move all allocations to a
      seperate context. This can be easily added as an interposing layer
      into xfs_alloc_vextent(), which takes a single argument structure
      and does not return until the allocation is complete or has failed.
      
      To do this, add a work structure and a completion to the allocation
      args structure. This allows xfs_alloc_vextent to queue the args onto
      the workqueue and wait for it to be completed by the worker. This
      can be done completely transparently to the caller.
      
      The worker function needs to ensure that it sets and clears the
      PF_TRANS flag appropriately as it is being run in an active
      transaction context. Work can also be queued in a memory reclaim
      context, so a rescuer is needed for the workqueue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c999a223
  17. 12 10月, 2011 1 次提交
  18. 26 7月, 2011 1 次提交
  19. 09 7月, 2011 1 次提交
  20. 08 7月, 2011 1 次提交
  21. 25 5月, 2011 2 次提交
    • C
      xfs: do not discard alloc btree blocks · 55a7bc5a
      Christoph Hellwig 提交于
      Blocks for the allocation btree are allocated from and released to
      the AGFL, and thus frequently reused.  Even worse we do not have an
      easy way to avoid using an AGFL block when it is discarded due to
      the simple FILO list of free blocks, and thus can frequently stall
      on blocks that are currently undergoing a discard.
      
      Add a flag to the busy extent tracking structure to skip the discard
      for allocation btree blocks.  In normal operation these blocks are
      reused frequently enough that there is no need to discard them
      anyway, but if they spill over to the allocation btree as part of a
      balance we "leak" blocks that we would otherwise discard.  We could
      fix this by adding another flag and keeping these block in the
      rbtree even after they aren't busy any more so that we could discard
      them when they migrate out of the AGFL.  Given that this would cause
      significant overhead I don't think it's worthwile for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      55a7bc5a
    • C
      xfs: add online discard support · e84661aa
      Christoph Hellwig 提交于
      Now that we have reliably tracking of deleted extents in a
      transaction we can easily implement "online" discard support
      which calls blkdev_issue_discard once a transaction commits.
      
      The actual discard is a two stage operation as we first have
      to mark the busy extent as not available for reuse before we
      can start the actual discard.  Note that we don't bother
      supporting discard for the non-delaylog mode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e84661aa
  22. 20 5月, 2011 1 次提交
    • D
      xfs: obey minleft values during extent allocation correctly · bf59170a
      Dave Chinner 提交于
      When allocating an extent that is long enough to consume the
      remaining free space in an AG, we need to ensure that the allocation
      leaves enough space in the AG for any subsequent bmap btree blocks
      that are needed to track the new extent. These have to be allocated
      in the same AG as we only reserve enough blocks in an allocation
      transaction for modification of the freespace trees in a single AG.
      
      xfs_alloc_fix_minleft() has been considering blocks on the AGFL as
      free blocks available for extent and bmbt block allocation, which is
      not correct - blocks on the AGFL are there exclusively for the use
      of the free space btrees. As a result, when minleft is less than the
      number of blocks on the AGFL, xfs_alloc_fix_minleft() does not trim
      the given extent to leave minleft blocks available for bmbt
      allocation, and hence we can fail allocation during bmbt record
      insertion.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      bf59170a