1. 13 1月, 2018 2 次提交
  2. 09 1月, 2018 2 次提交
  3. 07 11月, 2017 2 次提交
    • C
      xfs: use a b+tree for the in-core extent list · 6bdcf26a
      Christoph Hellwig 提交于
      Replace the current linear list and the indirection array for the in-core
      extent list with a b+tree to avoid the need for larger memory allocations
      for the indirection array when lots of extents are present.  The current
      extent list implementations leads to heavy pressure on the memory
      allocator when modifying files with a high extent count, and can lead
      to high latencies because of that.
      
      The replacement is a b+tree with a few quirks.  The leaf nodes directly
      store the extent record in two u64 values.  The encoding is a little bit
      different from the existing in-core extent records so that the start
      offset and length which are required for lookups can be retreived with
      simple mask operations.  The inner nodes store a 64-bit key containing
      the start offset in the first half of the node, and the pointers to the
      next lower level in the second half.  In either case we walk the node
      from the beginninig to the end and do a linear search, as that is more
      efficient for the low number of cache lines touched during a search
      (2 for the inner nodes, 4 for the leaf nodes) than a binary search.
      We store termination markers (zero length for the leaf nodes, an
      otherwise impossible high bit for the inner nodes) to terminate the key
      list / records instead of storing a count to use the available cache
      lines as efficiently as possible.
      
      One quirk of the algorithm is that while we normally split a node half and
      half like usual btree implementations we just spill over entries added at
      the very end of the list to a new node on its own.  This means we get a
      100% fill grade for the common cases of bulk insertion when reading an
      inode into memory, and when only sequentially appending to a file.  The
      downside is a slightly higher chance of splits on the first random
      insertions.
      
      Both insert and removal manually recurse into the lower levels, but
      the bulk deletion of the whole tree is still implemented as a recursive
      function call, although one limited by the overall depth and with very
      little stack usage in every iteration.
      
      For the first few extents we dynamically grow the list from a single
      extent to the next powers of two until we have a first full leaf block
      and that building the actual tree.
      
      The code started out based on the generic lib/btree.c code from Joern
      Engel based on earlier work from Peter Zijlstra, but has since been
      rewritten beyond recognition.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6bdcf26a
    • C
      xfs: introduce the xfs_iext_cursor abstraction · b2b1712a
      Christoph Hellwig 提交于
      Add a new xfs_iext_cursor structure to hide the direct extent map
      index manipulations. In addition to the existing lookup/get/insert/
      remove and update routines new primitives to get the first and last
      extent cursor, as well as moving up and down by one extent are
      provided.  Also new are convenience to increment/decrement the
      cursor and retreive the new extent, as well as to peek into the
      previous/next extent without updating the cursor and last but not
      least a macro to iterate over all extents in a fork.
      
      [darrick: rename for_each_iext to for_each_xfs_iext]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b2b1712a
  4. 03 11月, 2017 1 次提交
  5. 27 10月, 2017 2 次提交
  6. 02 9月, 2017 2 次提交
  7. 23 8月, 2017 1 次提交
  8. 20 6月, 2017 1 次提交
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  9. 19 6月, 2017 2 次提交
    • S
      xfs: remove lsn relevant fields from xfs_trans structure and its users · f990fc5a
      Shan Hai 提交于
      The t_lsn is not used anymore and the t_commit_lsn is used as a tmp
      storage for the checkpoint sequence number only in the current code.
      
      And the start/commit lsn are tracked as a transaction group tag in
      the xfs_cil_ctx instead of a single transaction, so remove them from
      the xfs_trans structure and their users to match with the design.
      Signed-off-by: NShan Hai <shan.hai@oracle.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f990fc5a
    • B
      xfs: push buffer of flush locked dquot to avoid quotacheck deadlock · 7912e7fe
      Brian Foster 提交于
      Reclaim during quotacheck can lead to deadlocks on the dquot flush
      lock:
      
       - Quotacheck populates a local delwri queue with the physical dquot
         buffers.
       - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
         dirties all of the dquots.
       - Reclaim kicks in and attempts to flush a dquot whose buffer is
         already queud on the quotacheck queue. The flush succeeds but
         queueing to the reclaim delwri queue fails as the backing buffer is
         already queued. The flush unlock is now deferred to I/O completion
         of the buffer from the quotacheck queue.
       - The dqadjust bulkstat continues and dirties the recently flushed
         dquot once again.
       - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
         the flush lock to update the backing buffers with the in-core
         recalculated values. It deadlocks on the redirtied dquot as the
         flush lock was already acquired by reclaim, but the buffer resides
         on the local delwri queue which isn't submitted until the end of
         quotacheck.
      
      This is reproduced by running quotacheck on a filesystem with a
      couple million inodes in low memory (512MB-1GB) situations. This is
      a regression as of commit 43ff2122 ("xfs: on-stack delayed write
      buffer lists"), which removed a trylock and buffer I/O submission
      from the quotacheck dquot flush sequence.
      
      Quotacheck first resets and collects the physical dquot buffers in a
      delwri queue. Then, it traverses the filesystem inodes via bulkstat,
      updates the in-core dquots, flushes the corrected dquots to the
      backing buffers and finally submits the delwri queue for I/O. Since
      the backing buffers are queued across the entire quotacheck
      operation, dquot reclaim cannot possibly complete a dquot flush
      before quotacheck completes.
      
      Therefore, quotacheck must submit the buffer for I/O in order to
      cycle the flush lock and flush the dirty in-core dquot to the
      buffer. Add a delwri queue buffer push mechanism to submit an
      individual buffer for I/O without losing the delwri queue status and
      use it from quotacheck to avoid the deadlock. This restores
      quotacheck behavior to as before the regression was introduced.
      Reported-by: NMartin Svec <martin.svec@zoner.cz>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7912e7fe
  10. 26 4月, 2017 3 次提交
  11. 04 4月, 2017 1 次提交
  12. 25 2月, 2017 1 次提交
    • D
      mm,fs,dax: change ->pmd_fault to ->huge_fault · a2d58167
      Dave Jiang 提交于
      Patch series "1G transparent hugepage support for device dax", v2.
      
      The following series implements support for 1G trasparent hugepage on
      x86 for device dax.  The bulk of the code was written by Mathew Wilcox a
      while back supporting transparent 1G hugepage for fs DAX.  I have
      forward ported the relevant bits to 4.10-rc.  The current submission has
      only the necessary code to support device DAX.
      
      Comments from Dan Williams: So the motivation and intended user of this
      functionality mirrors the motivation and users of 1GB page support in
      hugetlbfs.  Given expected capacities of persistent memory devices an
      in-memory database may want to reduce tlb pressure beyond what they can
      already achieve with 2MB mappings of a device-dax file.  We have
      customer feedback to that effect as Willy mentioned in his previous
      version of these patches [1].
      
      [1]: https://lkml.org/lkml/2016/1/31/52
      
      Comments from Nilesh @ Oracle:
      
      There are applications which have a process model; and if you assume
      10,000 processes attempting to mmap all the 6TB memory available on a
      server; we are looking at the following:
      
      processes         : 10,000
      memory            :    6TB
      pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
      pmd @ 2M page size: 120,000 / 512 = ~240GB
      pud @ 1G page size: 240GB / 512 = ~480MB
      
      As you can see with 2M pages, this system will use up an exorbitant
      amount of DRAM to hold the page tables; but the 1G pages finally brings
      it down to a reasonable level.  Memory sizes will keep increasing; so
      this number will keep increasing.
      
      An argument can be made to convert the applications from process model
      to thread model, but in the real world that may not be always practical.
      Hopefully this helps explain the use case where this is valuable.
      
      This patch (of 3):
      
      In preparation for adding the ability to handle PUD pages, convert
      vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault.  The
      vm_fault structure is extended to include a union of the different page
      table pointers that may be needed, and three flag bits are reserved to
      indicate which type of pointer is in the union.
      
      [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
        Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
      [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
        Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
      Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2d58167
  13. 07 2月, 2017 2 次提交
  14. 03 2月, 2017 1 次提交
    • D
      xfs: mark speculative prealloc CoW fork extents unwritten · 5eda4300
      Darrick J. Wong 提交于
      Christoph Hellwig pointed out that there's a potentially nasty race when
      performing simultaneous nearby directio cow writes:
      
      "Thread 1 writes a range from B to c
      
      "                    B --------- C
                                 p
      
      "a little later thread 2 writes from A to B
      
      "        A --------- B
                     p
      
      [editor's note: the 'p' denote cowextsize boundaries, which I added to
      make this more clear]
      
      "but the code preallocates beyond B into the range where thread
      "1 has just written, but ->end_io hasn't been called yet.
      "But once ->end_io is called thread 2 has already allocated
      "up to the extent size hint into the write range of thread 1,
      "so the end_io handler will splice the unintialized blocks from
      "that preallocation back into the file right after B."
      
      We can avoid this race by ensuring that thread 1 cannot accidentally
      remap the blocks that thread 2 allocated (as part of speculative
      preallocation) as part of t2's write preparation in t1's end_io handler.
      The way we make this happen is by taking advantage of the unwritten
      extent flag as an intermediate step.
      
      Recall that when we begin the process of writing data to shared blocks,
      we create a delayed allocation extent in the CoW fork:
      
      D: --RRRRRRSSSRRRRRRRR---
      C: ------DDDDDDD---------
      
      When a thread prepares to CoW some dirty data out to disk, it will now
      convert the delalloc reservation into an /unwritten/ allocated extent in
      the cow fork.  The da conversion code tries to opportunistically
      allocate as much of a (speculatively prealloc'd) extent as possible, so
      we may end up allocating a larger extent than we're actually writing
      out:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UUUUUUU---------
      
      Next, we convert only the part of the extent that we're actively
      planning to write to normal (i.e. not unwritten) status:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UURRUUU---------
      
      If the write succeeds, the end_cow function will now scan the relevant
      range of the CoW fork for real extents and remap only the real extents
      into the data fork:
      
      D: --RRRRRRRRSRRRRRRRR---
      U: ------UU--UUU---------
      
      This ensures that we never obliterate valid data fork extents with
      unwritten blocks from the CoW fork.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5eda4300
  15. 31 1月, 2017 1 次提交
  16. 09 12月, 2016 1 次提交
  17. 20 10月, 2016 2 次提交
  18. 06 10月, 2016 6 次提交
    • D
      xfs: implement swapext for rmap filesystems · 1f08af52
      Darrick J. Wong 提交于
      Implement swapext for filesystems that have reverse mapping.  Back in
      the reflink patches, we augmented the bmap code with a 'REMAP' flag
      that updates only the bmbt and doesn't touch the allocator and
      implemented log redo items for those two operations.  Now we can
      rewrite extent swapping as a (looong) series of remap operations.
      
      This is far less efficient than the fork swapping method implemented
      in the past, so we only switch this on for rmap.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      1f08af52
    • D
      xfs: use interval query for rmap alloc operations on shared files · ceeb9c83
      Darrick J. Wong 提交于
      When it's possible for reverse mappings to overlap (data fork extents
      of files on reflink filesystems), use the interval query function to
      find the left neighbor of an extent we're trying to add; and be
      careful to use the lookup functions to update the neighbors and/or
      add new extents.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      ceeb9c83
    • D
      xfs: garbage collect old cowextsz reservations · 83104d44
      Darrick J. Wong 提交于
      Trim CoW reservations made on behalf of a cowextsz hint if they get too
      old or we run low on quota, so long as we don't have dirty data awaiting
      writeback or directio operations in progress.
      
      Garbage collection of the cowextsize extents are kept separate from
      prealloc extent reaping because setting the CoW prealloc lifetime to a
      (much) higher value than the regular prealloc extent lifetime has been
      useful for combatting CoW fragmentation on VM hosts where the VMs
      experience bursty write behaviors and we can keep the utilization ratios
      low enough that we don't start to run out of space.  IOWs, it benefits
      us to keep the CoW fork reservations around for as long as we can unless
      we run out of blocks or hit inode reclaim.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83104d44
    • D
      xfs: store in-progress CoW allocations in the refcount btree · 174edb0e
      Darrick J. Wong 提交于
      Due to the way the CoW algorithm in XFS works, there's an interval
      during which blocks allocated to handle a CoW can be lost -- if the FS
      goes down after the blocks are allocated but before the block
      remapping takes place.  This is exacerbated by the cowextsz hint --
      allocated reservations can sit around for a while, waiting to get
      used.
      
      Since the refcount btree doesn't normally store records with refcount
      of 1, we can use it to record these in-progress extents.  In-progress
      blocks cannot be shared because they're not user-visible, so there
      shouldn't be any conflicts with other programs.  This is a better
      solution than holding EFIs during writeback because (a) EFIs can't be
      relogged currently, (b) even if they could, EFIs are bound by
      available log space, which puts an unnecessary upper bound on how much
      CoW we can have in flight, and (c) we already have a mechanism to
      track blocks.
      
      At mount time, read the refcount records and free anything we find
      with a refcount of 1 because those were in-progress when the FS went
      down.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      174edb0e
    • D
      xfs: implement CoW for directio writes · 0613f16c
      Darrick J. Wong 提交于
      For O_DIRECT writes to shared blocks, we have to CoW them just like
      we would with buffered writes.  For writes that are not block-aligned,
      just bounce them to the page cache.
      
      For block-aligned writes, however, we can do better than that.  Use
      the same mechanisms that we employ for buffered CoW to set up a
      delalloc reservation, allocate all the blocks at once, issue the
      writes against the new blocks and use the same ioend functions to
      remap the blocks after the write.  This should be fairly performant.
      
      Christoph discovered that xfs_reflink_allocate_cow_range may stumble
      over invalid entries in the extent array given that it drops the ilock
      but still expects the index to be stable.  Simple fixing it to a new
      lookup for every iteration still isn't correct given that
      xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
      there is nothing preventing a xfs_bunmapi_cow call removing extents
      once we dropped the ilock either.
      
      This patch duplicates the inner loop of xfs_bmapi_allocate into a
      helper for xfs_reflink_allocate_cow_range so that it can be done under
      the same ilock critical section as our CoW fork delayed allocation.
      The directio CoW warts will be revisited in a later patch.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      0613f16c
    • A
      switch generic_file_splice_read() to use of ->read_iter() · 82c156f8
      Al Viro 提交于
      ... and kill the ->splice_read() instances that can be switched to it
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      82c156f8
  19. 05 10月, 2016 5 次提交
  20. 04 10月, 2016 2 次提交