1. 28 11月, 2017 1 次提交
    • D
      xfs: always free inline data before resetting inode fork during ifree · 98c4f78d
      Darrick J. Wong 提交于
      In xfs_ifree, we reset the data/attr forks to extents format without
      bothering to free any inline data buffer that might still be around
      after all the blocks have been truncated off the file.  Prior to commit
      43518812 ("xfs: remove support for inlining data/extents into the
      inode fork") nobody noticed because the leftover inline data after
      truncation was small enough to fit inside the inline buffer inside the
      fork itself.
      
      However, now that we've removed the inline buffer, we /always/ have to
      free the inline data buffer or else we leak them like crazy.  This test
      was found by turning on kmemleak for generic/001 or generic/388.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      98c4f78d
  2. 17 11月, 2017 1 次提交
    • D
      xfs: fix forgotten rcu read unlock when skipping inode reclaim · 962cc1ad
      Darrick J. Wong 提交于
      In commit f2e9ad21 ("xfs: check for race with xfs_reclaim_inode"), we
      skip an inode if we're racing with freeing the inode via
      xfs_reclaim_inode, but we forgot to release the rcu read lock when
      dumping the inode, with the result that we exit to userspace with a lock
      held.  Don't do that; generic/320 with a 1k block size fails this
      very occasionally.
      
      ================================================
      WARNING: lock held when returning to user space!
      4.14.0-rc6-djwong #4 Tainted: G        W
      ------------------------------------------------
      rm/30466 is leaving the kernel with locks still held!
      1 lock held by rm/30466:
       #0:  (rcu_read_lock){....}, at: [<ffffffffa01364d3>] xfs_ifree_cluster.isra.17+0x2c3/0x6f0 [xfs]
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 30466 at kernel/rcu/tree_plugin.h:329 rcu_note_context_switch+0x71/0x700
      Modules linked in: deadline_iosched dm_snapshot dm_bufio ext4 mbcache jbd2 dm_flakey xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: scsi_debug]
      CPU: 1 PID: 30466 Comm: rm Tainted: G        W       4.14.0-rc6-djwong #4
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
      task: ffff880037680000 task.stack: ffffc90001064000
      RIP: 0010:rcu_note_context_switch+0x71/0x700
      RSP: 0000:ffffc90001067e50 EFLAGS: 00010002
      RAX: 0000000000000001 RBX: ffff880037680000 RCX: ffff88003e73d200
      RDX: 0000000000000002 RSI: ffffffff819e53e9 RDI: ffffffff819f4375
      RBP: 0000000000000000 R08: 0000000000000000 R09: ffff880062c900d0
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff880037680000
      R13: 0000000000000000 R14: ffffc90001067eb8 R15: ffff880037680690
      FS:  00007fa3b8ce8700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f69bf77c000 CR3: 000000002450a000 CR4: 00000000000006e0
      Call Trace:
       __schedule+0xb8/0xb10
       schedule+0x40/0x90
       exit_to_usermode_loop+0x6b/0xa0
       prepare_exit_to_usermode+0x7a/0x90
       retint_user+0x8/0x20
      RIP: 0033:0x7fa3b87fda87
      RSP: 002b:00007ffe41206568 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
      RAX: 0000000000000000 RBX: 00000000010e88c0 RCX: 00007fa3b87fda87
      RDX: 0000000000000000 RSI: 00000000010e89c8 RDI: 0000000000000005
      RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000000
      R10: 000000000000015e R11: 0000000000000246 R12: 00000000010c8060
      R13: 00007ffe41206690 R14: 0000000000000000 R15: 0000000000000000
      ---[ end trace e88f83bf0cfbd07d ]---
      
      Fixes: f2e9ad21
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      962cc1ad
  3. 07 11月, 2017 2 次提交
    • C
    • C
      xfs: use a b+tree for the in-core extent list · 6bdcf26a
      Christoph Hellwig 提交于
      Replace the current linear list and the indirection array for the in-core
      extent list with a b+tree to avoid the need for larger memory allocations
      for the indirection array when lots of extents are present.  The current
      extent list implementations leads to heavy pressure on the memory
      allocator when modifying files with a high extent count, and can lead
      to high latencies because of that.
      
      The replacement is a b+tree with a few quirks.  The leaf nodes directly
      store the extent record in two u64 values.  The encoding is a little bit
      different from the existing in-core extent records so that the start
      offset and length which are required for lookups can be retreived with
      simple mask operations.  The inner nodes store a 64-bit key containing
      the start offset in the first half of the node, and the pointers to the
      next lower level in the second half.  In either case we walk the node
      from the beginninig to the end and do a linear search, as that is more
      efficient for the low number of cache lines touched during a search
      (2 for the inner nodes, 4 for the leaf nodes) than a binary search.
      We store termination markers (zero length for the leaf nodes, an
      otherwise impossible high bit for the inner nodes) to terminate the key
      list / records instead of storing a count to use the available cache
      lines as efficiently as possible.
      
      One quirk of the algorithm is that while we normally split a node half and
      half like usual btree implementations we just spill over entries added at
      the very end of the list to a new node on its own.  This means we get a
      100% fill grade for the common cases of bulk insertion when reading an
      inode into memory, and when only sequentially appending to a file.  The
      downside is a slightly higher chance of splits on the first random
      insertions.
      
      Both insert and removal manually recurse into the lower levels, but
      the bulk deletion of the whole tree is still implemented as a recursive
      function call, although one limited by the overall depth and with very
      little stack usage in every iteration.
      
      For the first few extents we dynamically grow the list from a single
      extent to the next powers of two until we have a first full leaf block
      and that building the actual tree.
      
      The code started out based on the generic lib/btree.c code from Joern
      Engel based on earlier work from Peter Zijlstra, but has since been
      rewritten beyond recognition.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6bdcf26a
  4. 02 11月, 2017 1 次提交
  5. 27 10月, 2017 1 次提交
  6. 26 9月, 2017 1 次提交
  7. 02 9月, 2017 3 次提交
  8. 05 8月, 2017 1 次提交
  9. 28 6月, 2017 1 次提交
  10. 20 6月, 2017 2 次提交
    • I
      sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name · 21417136
      Ingo Molnar 提交于
      Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
      name it as a wait-queue entry.
      
      Propagate it to a couple of usage sites where the wait-bit-queue internals
      are exposed.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21417136
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  11. 12 4月, 2017 1 次提交
  12. 04 4月, 2017 1 次提交
    • D
      xfs: rework the inline directory verifiers · 78420281
      Darrick J. Wong 提交于
      The inline directory verifiers should be called on the inode fork data,
      which means after iformat_local on the read side, and prior to
      ifork_flush on the write side.  This makes the fork verifier more
      consistent with the way buffer verifiers work -- i.e. they will operate
      on the memory buffer that the code will be reading and writing directly.
      
      Furthermore, revise the verifier function to return -EFSCORRUPTED so
      that we don't flood the logs with corruption messages and assert
      notices.  This has been a particular problem with xfs/348, which
      triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
      kernel when CONFIG_XFS_DEBUG=y.  Disk corruption isn't supposed to do
      that, at least not in a verifier.
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      78420281
  13. 29 3月, 2017 1 次提交
    • D
      xfs: rework the inline directory verifiers · 005c5db8
      Darrick J. Wong 提交于
      The inline directory verifiers should be called on the inode fork data,
      which means after iformat_local on the read side, and prior to
      ifork_flush on the write side.  This makes the fork verifier more
      consistent with the way buffer verifiers work -- i.e. they will operate
      on the memory buffer that the code will be reading and writing directly.
      
      Furthermore, revise the verifier function to return -EFSCORRUPTED so
      that we don't flood the logs with corruption messages and assert
      notices.  This has been a particular problem with xfs/348, which
      triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
      kernel when CONFIG_XFS_DEBUG=y.  Disk corruption isn't supposed to do
      that, at least not in a verifier.
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ---
      v2: get the inode d_ops the proper way
      v3: describe the bug that this patch fixes; no code changes
      005c5db8
  14. 15 3月, 2017 1 次提交
  15. 08 3月, 2017 1 次提交
  16. 31 1月, 2017 1 次提交
  17. 25 1月, 2017 1 次提交
    • C
      xfs: use per-AG reservations for the finobt · 76d771b4
      Christoph Hellwig 提交于
      Currently we try to rely on the global reserved block pool for block
      allocations for the free inode btree, but I have customer reports
      (fairly complex workload, need to find an easier reproducer) where that
      is not enough as the AG where we free an inode that requires a new
      finobt block is entirely full.  This causes us to cancel a dirty
      transaction and thus a file system shutdown.
      
      I think the right way to guard against this is to treat the finot the same
      way as the refcount btree and have a per-AG reservations for the possible
      worst case size of it, and the patch below implements that.
      
      Note that this could increase mount times with large finobt trees.  In
      an ideal world we would have added a field for the number of finobt
      fields to the AGI, similar to what we did for the refcount blocks.
      We should do add it next time we rev the AGI or AGF format by adding
      new fields.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      76d771b4
  18. 05 12月, 2016 1 次提交
  19. 30 11月, 2016 1 次提交
  20. 06 10月, 2016 4 次提交
  21. 05 10月, 2016 1 次提交
    • D
      xfs: when replaying bmap operations, don't let unlinked inodes get reaped · 17c12bcd
      Darrick J. Wong 提交于
      Log recovery will iget an inode to replay BUI items and iput the inode
      when it's done.  Unfortunately, if the inode was unlinked, the iput
      will see that i_nlink == 0 and decide to truncate & free the inode,
      which prevents us from replaying subsequent BUIs.  We can't skip the
      BUIs because we have to replay all the redo items to ensure that
      atomic operations complete.
      
      Since unlinked inode recovery will reap the inode anyway, we can
      safely introduce a new inode flag to indicate that an inode is in this
      'unlinked recovery' state and should not be auto-reaped in the
      drop_inode path.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      17c12bcd
  22. 28 9月, 2016 1 次提交
  23. 22 9月, 2016 1 次提交
  24. 03 8月, 2016 3 次提交
  25. 01 6月, 2016 1 次提交
  26. 18 5月, 2016 6 次提交