1. 16 1月, 2016 1 次提交
    • D
      pmem, dax: clean up clear_pmem() · 52db400f
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 25):
      
      Both __dax_pmd_fault, and clear_pmem() were taking special steps to
      clear memory a page at a time to take advantage of non-temporal
      clear_page() implementations.  However, x86_64 does not use non-temporal
      instructions for clear_page(), and arch_clear_pmem() was always
      incurring the cost of __arch_wb_cache_pmem().
      
      Clean up the assumption that doing clear_pmem() a page at a time is more
      performant.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52db400f
  2. 17 11月, 2015 1 次提交
    • D
      dax: disable pmd mappings · ee82c9ed
      Dan Williams 提交于
      While dax pmd mappings are functional in the nominal path they trigger
      kernel crashes in the following paths:
      
       BUG: unable to handle kernel paging request at ffffea0004098000
       IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0
       [..]
       Call Trace:
        [<ffffffff811f6573>] follow_page_mask+0x2d3/0x380
        [<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0
        [<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0
        [<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0
      
       kernel BUG at arch/x86/mm/gup.c:131!
       [..]
       Call Trace:
        [<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220
        [<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0
      
       BUG: unable to handle kernel paging request at ffffea0004088000
       IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350
       [..]
       Call Trace:
        [<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0
        [<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10
        [<ffffffff810a11c1>] _do_fork+0x91/0x590
      
      All of these paths are interpreting a dax pmd mapping as a transparent
      huge page and making the assumption that the pfn is covered by the
      memmap, i.e. that the pfn has an associated struct page.  PTE mappings
      do not suffer the same fate since they have the _PAGE_SPECIAL flag to
      cause the gup path to fault.  We can do something similar for the PMD
      path, or otherwise defer pmd support for cases where a struct page is
      available.  For now, 4.4-rc and -stable need to disable dax pmd support
      by default.
      
      For development the "depends on BROKEN" line can be removed from
      CONFIG_FS_DAX_PMD.
      
      Cc: <stable@vger.kernel.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ee82c9ed
  3. 13 11月, 2015 1 次提交
    • D
      dax: fix __dax_pmd_fault crash · 152d7bd8
      Dan Williams 提交于
      Since 4.3 introduced devm_memremap_pages() the pfns handled by DAX may
      optionally have a struct page backing.  When a mapped pfn reaches
      vmf_insert_pfn_pmd() it fails with a crash signature like the following:
      
       kernel BUG at mm/huge_memory.c:905!
       [..]
       Call Trace:
        [<ffffffff812a73ba>] __dax_pmd_fault+0x2ea/0x5b0
        [<ffffffffa01a4182>] xfs_filemap_pmd_fault+0x92/0x150 [xfs]
        [<ffffffff811fbe02>] handle_mm_fault+0x312/0x1b50
      
      Fix this by falling back to 4K mappings in the pfn_valid() case.  Longer
      term, vmf_insert_pfn_pmd() needs to grow support for architectures that
      can provide a 'pmd_special' capability.
      
      Cc: <stable@vger.kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      152d7bd8
  4. 12 11月, 2015 1 次提交
  5. 03 11月, 2015 1 次提交
    • D
      xfs: Don't use unwritten extents for DAX · 1ca19157
      Dave Chinner 提交于
      DAX has a page fault serialisation problem with block allocation.
      Because it allows concurrent page faults and does not have a page
      lock to serialise faults to the same page, it can get two concurrent
      faults to the page that race.
      
      When two read faults race, this isn't a huge problem as the data
      underlying the page is not changing and so "detect and drop" works
      just fine. The issues are to do with write faults.
      
      When two write faults occur, we serialise block allocation in
      get_blocks() so only one faul will allocate the extent. It will,
      however, be marked as an unwritten extent, and that is where the
      problem lies - the DAX fault code cannot differentiate between a
      block that was just allocated and a block that was preallocated and
      needs zeroing. The result is that both write faults end up zeroing
      the block and attempting to convert it back to written.
      
      The problem is that the first fault can zero and convert before the
      second fault starts zeroing, resulting in the zeroing for the second
      fault overwriting the data that the first fault wrote with zeros.
      The second fault then attempts to convert the unwritten extent,
      which is then a no-op because it's already written. Data loss occurs
      as a result of this race.
      
      Because there is no sane locking construct in the page fault code
      that we can use for serialisation across the page faults, we need to
      ensure block allocation and zeroing occurs atomically in the
      filesystem. This means we can still take concurrent page faults and
      the only time they will serialise is in the filesystem
      mapping/allocation callback. The page fault code will always see
      written, initialised extents, so we will be able to remove the
      unwritten extent handling from the DAX code when all filesystems are
      converted.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1ca19157
  6. 17 10月, 2015 1 次提交
  7. 02 10月, 2015 1 次提交
  8. 16 9月, 2015 1 次提交
  9. 10 9月, 2015 1 次提交
    • R
      dax: update PMD fault handler with PMEM API · d77e92e2
      Ross Zwisler 提交于
      As part of the v4.3 merge window the DAX code was updated by Matthew and
      Kirill to handle PMD pages.  Also as part of the v4.3 merge window we
      updated the DAX code to do proper PMEM flushing (commit 2765cfbb:
      "dax: update I/O path to do proper PMEM flushing").
      
      The additional code added by the DAX PMD patches also needs to be
      updated to properly use the PMEM API.  This ensures that after a PMD
      fault is handled the zeros written to the newly allocated pages are
      durable on the DIMMs.
      
      linux/dax.h is included to get rid of a bunch of sparse warnings.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>,
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d77e92e2
  10. 09 9月, 2015 8 次提交
  11. 21 8月, 2015 2 次提交
  12. 29 7月, 2015 1 次提交
    • D
      xfs: call dax_fault on read page faults for DAX · b2442c5a
      Dave Chinner 提交于
      When modifying the patch series to handle the XFS MMAP_LOCK nesting
      of page faults, I botched the conversion of the read page fault
      path, and so it is only every calling through the page cache. Re-add
      the necessary __dax_fault() call for such files.
      
      Because the get_blocks callback on read faults may not set up the
      mapping buffer correctly to allow unwritten extent completion to be
      run, we need to allow callers of __dax_fault() to pass a null
      complete_unwritten() callback. The DAX code always zeros the
      unwritten page when it is read faulted so there are no stale data
      exposure issues with not doing the conversion. The only downside
      will be the potential for increased CPU overhead on repeated read
      faults of the same page. If this proves to be a problem, then the
      filesystem needs to fix it's get_block callback and provide a
      convert_unwritten() callback to the read fault path.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMatthew Wilcox <willy@linux.intel.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b2442c5a
  13. 05 7月, 2015 2 次提交
  14. 04 6月, 2015 2 次提交
    • D
      dax: expose __dax_fault for filesystems with locking constraints · ce5c5d55
      Dave Chinner 提交于
      Some filesystems cannot call dax_fault() directly because they have
      different locking and/or allocation constraints in the page fault IO
      path. To handle this, we need to follow the same model as the
      generic block_page_mkwrite code, where the internals are exposed via
      __block_page_mkwrite() so that filesystems can wrap the correct
      locking and operations around the outside. 
      
      This is loosely based on a patch originally from Matthew Willcox.
      Unlike the original patch, it does not change ext4 code, error
      returns or unwritten extent conversion handling.  It also adds a
      __dax_mkwrite() wrapper for .page_mkwrite implementations to do the
      right thing, too.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ce5c5d55
    • D
      dax: don't abuse get_block mapping for endio callbacks · e842f290
      Dave Chinner 提交于
      dax_fault() currently relies on the get_block callback to attach an
      io completion callback to the mapping buffer head so that it can
      run unwritten extent conversion after zeroing allocated blocks.
      
      Instead of this hack, pass the conversion callback directly into
      dax_fault() similar to the get_block callback. When the filesystem
      allocates unwritten extents, it will set the buffer_unwritten()
      flag, and hence the dax_fault code can call the completion function
      in the contexts where it is necessary without overloading the
      mapping buffer head.
      
      Note: The changes to ext4 to use this interface are suspect at best.
      In fact, the way ext4 did this end_io assignment in the first place
      looks suspect because it only set a completion callback when there
      wasn't already some other write() call taking place on the same
      inode. The ext4 end_io code looks rather intricate and fragile with
      all it's reference counting and passing to different contexts for
      modification via inode private pointers that aren't protected by
      locks...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e842f290
  15. 25 4月, 2015 1 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  16. 16 4月, 2015 1 次提交
  17. 12 4月, 2015 1 次提交
  18. 17 2月, 2015 5 次提交