1. 07 11月, 2017 2 次提交
    • C
    • C
      xfs: use a b+tree for the in-core extent list · 6bdcf26a
      Christoph Hellwig 提交于
      Replace the current linear list and the indirection array for the in-core
      extent list with a b+tree to avoid the need for larger memory allocations
      for the indirection array when lots of extents are present.  The current
      extent list implementations leads to heavy pressure on the memory
      allocator when modifying files with a high extent count, and can lead
      to high latencies because of that.
      
      The replacement is a b+tree with a few quirks.  The leaf nodes directly
      store the extent record in two u64 values.  The encoding is a little bit
      different from the existing in-core extent records so that the start
      offset and length which are required for lookups can be retreived with
      simple mask operations.  The inner nodes store a 64-bit key containing
      the start offset in the first half of the node, and the pointers to the
      next lower level in the second half.  In either case we walk the node
      from the beginninig to the end and do a linear search, as that is more
      efficient for the low number of cache lines touched during a search
      (2 for the inner nodes, 4 for the leaf nodes) than a binary search.
      We store termination markers (zero length for the leaf nodes, an
      otherwise impossible high bit for the inner nodes) to terminate the key
      list / records instead of storing a count to use the available cache
      lines as efficiently as possible.
      
      One quirk of the algorithm is that while we normally split a node half and
      half like usual btree implementations we just spill over entries added at
      the very end of the list to a new node on its own.  This means we get a
      100% fill grade for the common cases of bulk insertion when reading an
      inode into memory, and when only sequentially appending to a file.  The
      downside is a slightly higher chance of splits on the first random
      insertions.
      
      Both insert and removal manually recurse into the lower levels, but
      the bulk deletion of the whole tree is still implemented as a recursive
      function call, although one limited by the overall depth and with very
      little stack usage in every iteration.
      
      For the first few extents we dynamically grow the list from a single
      extent to the next powers of two until we have a first full leaf block
      and that building the actual tree.
      
      The code started out based on the generic lib/btree.c code from Joern
      Engel based on earlier work from Peter Zijlstra, but has since been
      rewritten beyond recognition.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6bdcf26a
  2. 02 11月, 2017 1 次提交
  3. 27 10月, 2017 1 次提交
  4. 26 9月, 2017 1 次提交
  5. 02 9月, 2017 3 次提交
  6. 05 8月, 2017 1 次提交
  7. 28 6月, 2017 1 次提交
  8. 20 6月, 2017 2 次提交
    • I
      sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name · 21417136
      Ingo Molnar 提交于
      Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
      name it as a wait-queue entry.
      
      Propagate it to a couple of usage sites where the wait-bit-queue internals
      are exposed.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21417136
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  9. 12 4月, 2017 1 次提交
  10. 04 4月, 2017 1 次提交
    • D
      xfs: rework the inline directory verifiers · 78420281
      Darrick J. Wong 提交于
      The inline directory verifiers should be called on the inode fork data,
      which means after iformat_local on the read side, and prior to
      ifork_flush on the write side.  This makes the fork verifier more
      consistent with the way buffer verifiers work -- i.e. they will operate
      on the memory buffer that the code will be reading and writing directly.
      
      Furthermore, revise the verifier function to return -EFSCORRUPTED so
      that we don't flood the logs with corruption messages and assert
      notices.  This has been a particular problem with xfs/348, which
      triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
      kernel when CONFIG_XFS_DEBUG=y.  Disk corruption isn't supposed to do
      that, at least not in a verifier.
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      78420281
  11. 29 3月, 2017 1 次提交
    • D
      xfs: rework the inline directory verifiers · 005c5db8
      Darrick J. Wong 提交于
      The inline directory verifiers should be called on the inode fork data,
      which means after iformat_local on the read side, and prior to
      ifork_flush on the write side.  This makes the fork verifier more
      consistent with the way buffer verifiers work -- i.e. they will operate
      on the memory buffer that the code will be reading and writing directly.
      
      Furthermore, revise the verifier function to return -EFSCORRUPTED so
      that we don't flood the logs with corruption messages and assert
      notices.  This has been a particular problem with xfs/348, which
      triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
      kernel when CONFIG_XFS_DEBUG=y.  Disk corruption isn't supposed to do
      that, at least not in a verifier.
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ---
      v2: get the inode d_ops the proper way
      v3: describe the bug that this patch fixes; no code changes
      005c5db8
  12. 15 3月, 2017 1 次提交
  13. 08 3月, 2017 1 次提交
  14. 31 1月, 2017 1 次提交
  15. 25 1月, 2017 1 次提交
    • C
      xfs: use per-AG reservations for the finobt · 76d771b4
      Christoph Hellwig 提交于
      Currently we try to rely on the global reserved block pool for block
      allocations for the free inode btree, but I have customer reports
      (fairly complex workload, need to find an easier reproducer) where that
      is not enough as the AG where we free an inode that requires a new
      finobt block is entirely full.  This causes us to cancel a dirty
      transaction and thus a file system shutdown.
      
      I think the right way to guard against this is to treat the finot the same
      way as the refcount btree and have a per-AG reservations for the possible
      worst case size of it, and the patch below implements that.
      
      Note that this could increase mount times with large finobt trees.  In
      an ideal world we would have added a field for the number of finobt
      fields to the AGI, similar to what we did for the refcount blocks.
      We should do add it next time we rev the AGI or AGF format by adding
      new fields.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      76d771b4
  16. 05 12月, 2016 1 次提交
  17. 30 11月, 2016 1 次提交
  18. 06 10月, 2016 4 次提交
  19. 05 10月, 2016 1 次提交
    • D
      xfs: when replaying bmap operations, don't let unlinked inodes get reaped · 17c12bcd
      Darrick J. Wong 提交于
      Log recovery will iget an inode to replay BUI items and iput the inode
      when it's done.  Unfortunately, if the inode was unlinked, the iput
      will see that i_nlink == 0 and decide to truncate & free the inode,
      which prevents us from replaying subsequent BUIs.  We can't skip the
      BUIs because we have to replay all the redo items to ensure that
      atomic operations complete.
      
      Since unlinked inode recovery will reap the inode anyway, we can
      safely introduce a new inode flag to indicate that an inode is in this
      'unlinked recovery' state and should not be auto-reaped in the
      drop_inode path.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      17c12bcd
  20. 28 9月, 2016 1 次提交
  21. 22 9月, 2016 1 次提交
  22. 03 8月, 2016 3 次提交
  23. 01 6月, 2016 1 次提交
  24. 18 5月, 2016 6 次提交
  25. 06 4月, 2016 2 次提交