1. 29 5月, 2015 6 次提交
    • B
      xfs: sparse inode chunks feature helpers and mount requirements · e5376fc1
      Brian Foster 提交于
      The sparse inode chunks feature uses the helper function to enable the
      allocation of sparse inode chunks. The incompatible feature bit is set
      on disk at mkfs time to prevent mount from unsupported kernels.
      
      Also, enforce the inode alignment requirements required for sparse inode
      chunks at mount time. When enabled, full inode chunks (and all inode
      record) alignment is increased from cluster size to inode chunk size.
      Sparse inode alignment must match the cluster size of the fs. Both
      superblock alignment fields are set as such by mkfs when sparse inode
      support is enabled.
      
      Finally, warn that sparse inode chunks is an experimental feature until
      further notice.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e5376fc1
    • B
      xfs: use sparse chunk alignment for min. inode allocation requirement · 066a1884
      Brian Foster 提交于
      xfs_ialloc_ag_select() iterates through the allocation groups looking
      for free inodes or free space to determine whether to allow an inode
      allocation to proceed. If no free inodes are available, it assumes that
      an AG must have an extent longer than mp->m_ialloc_blks.
      
      Sparse inode chunk support currently allows for allocations smaller than
      the traditional inode chunk size specified in m_ialloc_blks. The current
      minimum sparse allocation is set in the superblock sb_spino_align field
      at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use
      this to represent the minimum supported allocation size for inode
      chunks. Initialize m_ialloc_min_blks at mount time based on whether
      sparse inodes are supported.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      066a1884
    • B
      xfs: add sparse inode chunk alignment superblock field · fb4f2b4e
      Brian Foster 提交于
      Add sb_spino_align to the superblock to specify sparse inode chunk
      alignment. This also currently represents the minimum allowable sparse
      chunk allocation size.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      fb4f2b4e
    • B
      xfs: support min/max agbno args in block allocator · bfe46d4e
      Brian Foster 提交于
      The block allocator supports various arguments to tweak block allocation
      behavior and set allocation requirements. The sparse inode chunk feature
      introduces a new requirement not supported by the current arguments.
      Sparse inode allocations must convert or merge into an inode record that
      describes a fixed length chunk (64 inodes x inodesize). Full inode chunk
      allocations by definition always result in valid inode records. Sparse
      chunk allocations are smaller and the associated records can refer to
      blocks not owned by the inode chunk. This model can result in invalid
      inode records in certain cases.
      
      For example, if a sparse allocation occurs near the start of an AG, the
      aligned inode record for that chunk might refer to agbno 0. If an
      allocation occurs towards the end of the AG and the AG size is not
      aligned, the inode record could refer to blocks beyond the end of the
      AG. While neither of these scenarios directly result in corruption, they
      both insert invalid inode records and at minimum cause repair to
      complain, are unlikely to merge into full chunks over time and set land
      mines for other areas of code.
      
      To guarantee sparse inode chunk allocation creates valid inode records,
      support the ability to specify an agbno range limit for
      XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are
      specified in the allocation arguments and limit the block allocation
      algorithms to that range. The starting 'agbno' hint is clamped to the
      range if the specified agbno is out of range. If no sufficient extent is
      available within the range, the allocation fails. For backwards
      compatibility, the min/max fields can be initialized to 0 to disable
      range limiting (e.g., equivalent to min=0,max=agsize).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bfe46d4e
    • B
      xfs: update free inode record logic to support sparse inode records · 999633d3
      Brian Foster 提交于
      xfs_difree_inobt() uses logic in a couple places that assume inobt
      records refer to fully allocated chunks. Specifically, the use of
      mp->m_ialloc_inos can cause problems for inode chunks that are sparsely
      allocated. Sparse inode chunks can, by definition, define a smaller
      number of inodes than a full inode chunk.
      
      Fix the logic that determines whether an inode record should be removed
      from the inobt to use the ir_free mask rather than ir_freecount. Fix the
      agi counters modification to use ir_freecount to add the actual number
      of inodes freed rather than assuming a full inode chunk.
      
      Also make sure that we preserve the behavior to not remove inode chunks
      if the block size is large enough for multiple inode chunks (e.g.,
      bsize=64k, isize=512). This behavior was previously implicit in that in
      such configurations, ir.freecount of a single record never matches
      m_ialloc_inos. Hence, add some comments as well.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      999633d3
    • B
      xfs: create individual inode alloc. helper · d4cc540b
      Brian Foster 提交于
      Inode allocation from sparse inode records must filter the ir_free mask
      against ir_holemask.  In preparation for this requirement, create a
      helper to allocate an individual inode from an inode record.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d4cc540b
  2. 16 4月, 2015 9 次提交
    • D
      xfs: using generic_file_direct_write() is unnecessary · 0cefb29e
      Dave Chinner 提交于
      generic_file_direct_write() does all sorts of things to make DIO
      work "sorta ok" with mixed buffered IO workloads. We already do
      most of this work in xfs_file_aio_dio_write() because of the locking
      requirements, so there's only a couple of things it does for us.
      
      The first thing is that it does a page cache invalidation after the
      ->direct_IO callout. This can easily be added to the XFS code.
      
      The second thing it does is that if data was written, it updates the
      iov_iter structure to reflect the data written, and then does EOF
      size updates if necessary. For XFS, these EOF size updates are now
      not necessary, as we do them safely and race-free in IO completion
      context. That leaves just the iov_iter update, and that's also moved
      to the XFS code.
      
      Therefore we don't need to call generic_file_direct_write() and in
      doing so remove redundant buffered writeback and page cache
      invalidation calls from the DIO submission path. We also remove a
      racy EOF size update, and make the DIO submission code in XFS much
      easier to follow. Wins all round, really.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0cefb29e
    • D
      xfs: direct IO EOF zeroing needs to drain AIO · 40c63fbc
      Dave Chinner 提交于
      When we are doing AIO DIO writes, the IOLOCK only provides an IO
      submission barrier. When we need to do EOF zeroing, we need to ensure
      that no other IO is in progress and all pending in-core EOF updates
      have been completed. This requires us to wait for all outstanding
      AIO DIO writes to the inode to complete and, if necessary, run their
      EOF updates.
      
      Once all the EOF updates are complete, we can then restart
      xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing
      that EOF is up to date and we have exclusive IO access to the file
      so we can run EOF block zeroing if we need to without interference.
      This gives EOF zeroing the same exclusivity against other IO as we
      provide truncate operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      40c63fbc
    • D
      xfs: DIO write completion size updates race · b9d59846
      Dave Chinner 提交于
      xfs_end_io_direct_write() can race with other IO completions when
      updating the in-core inode size. The IO completion processing is not
      serialised for direct IO - they are done either under the
      IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
      during AIO DIO completion. Hence the non-atomic test-and-set update
      of the in-core inode size is racy and can result in the in-core
      inode size going backwards if the race if hit just right.
      
      If the inode size goes backwards, this can trigger the EOF zeroing
      code to run incorrectly on the next IO, which then will zero data
      that has successfully been written to disk by a previous DIO.
      
      To fix this bug, we need to serialise the test/set updates of the
      in-core inode size. This first patch introduces locking around the
      relevant updates and checks in the DIO path. Because we now have an
      ioend in xfs_end_io_direct_write(), we know exactly then we are
      doing an IO that requires an in-core EOF update, and we know that
      they are not running in interrupt context. As such, we do not need to
      use irqsave() spinlock variants to protect against interrupts while
      the lock is held.
      
      Hence we can use an existing spinlock in the inode to do this
      serialisation and so not need to grow the struct xfs_inode just to
      work around this problem.
      
      This patch does not address the test/set EOF update in
      generic_file_write_direct() for various reasons - that will be done
      as a followup with separate explanation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b9d59846
    • D
      xfs: DIO writes within EOF don't need an ioend · a06c277a
      Dave Chinner 提交于
      DIO writes that lie entirely within EOF have nothing to do in IO
      completion. In this case, we don't need no steekin' ioend, and so we
      can avoid allocating an ioend until we have a mapping that spans
      EOF.
      
      This means that IO completion has two contexts - deferred completion
      to the dio workqueue that uses an ioend, and interrupt completion
      that does nothing because there is nothing that can be done in this
      context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a06c277a
    • D
      xfs: handle DIO overwrite EOF update completion correctly · 6dfa1b67
      Dave Chinner 提交于
      Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
      write into allocated blocks beyond EOF) requires a transaction for
      the EOF update. Thi is done in IO completion context, but we aren't
      explicitly handling this situation properly and so it can run in
      interrupt context. Ensure that we defer IO that spans EOF correctly
      to the DIO completion workqueue, and now that we have an ioend in IO
      completion we can use the common ioend completion path to do all the
      work.
      
      Note: we do not preallocate the append transaction as we can have
      multiple mapping and allocation calls per direct IO. hence
      preallocating can still leave us with nested transactions by
      attempting to map and allocate more blocks after we've preallocated
      an append transaction.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6dfa1b67
    • D
      xfs: DIO needs an ioend for writes · d5cc2e3f
      Dave Chinner 提交于
      Currently we can only tell DIO completion that an IO requires
      unwritten extent completion. This is done by a hacky non-null
      private pointer passed to Io completion, but the private pointer
      does not actually contain any information that is used.
      
      We also need to pass to IO completion the fact that the IO may be
      beyond EOF and so a size update transaction needs to be done. This
      is currently determined by checks in the io completion, but we need
      to determine if this is necessary at block mapping time as we need
      to defer the size update transactions to a completion workqueue,
      just like unwritten extent conversion.
      
      To do this, first we need to allocate and pass an ioend to to IO
      completion. Add this for unwritten extent conversion; we'll do the
      EOF updates in the next commit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d5cc2e3f
    • D
      xfs: move DIO mapping size calculation · 1fdca9c2
      Dave Chinner 提交于
      The mapping size calculation is done last in __xfs_get_blocks(), but
      we are going to need the actual mapping size we will use to map the
      direct IO correctly in xfs_map_direct(). Factor out the calculation
      for code clarity, and move the call to be the first operation in
      mapping the extent to the returned buffer.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1fdca9c2
    • D
      xfs: factor DIO write mapping from get_blocks · a719370b
      Dave Chinner 提交于
      Clarify and separate the buffer mapping logic so that the direct IO mapping is
      not tangled up in propagating the extent status to teh mapping buffer. This
      makes it easier to extend the direct IO mapping to use an ioend in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a719370b
    • D
      VFS: normal filesystems (and lustre): d_inode() annotations · 2b0143b5
      David Howells 提交于
      that's the bulk of filesystem drivers dealing with inodes of their own
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2b0143b5
  3. 13 4月, 2015 6 次提交
  4. 12 4月, 2015 8 次提交
  5. 26 3月, 2015 1 次提交
  6. 25 3月, 2015 10 次提交