1. 30 4月, 2020 2 次提交
  2. 24 7月, 2018 1 次提交
    • D
      filesystem-dax: Introduce dax_lock_mapping_entry() · c2a7d2a1
      Dan Williams 提交于
      In preparation for implementing support for memory poison (media error)
      handling via dax mappings, implement a lock_page() equivalent. Poison
      error handling requires rmap and needs guarantees that the page->mapping
      association is maintained / valid (inode not freed) for the duration of
      the lookup.
      
      In the device-dax case it is sufficient to simply hold a dev_pagemap
      reference. In the filesystem-dax case we need to use the entry lock.
      
      Export the entry lock via dax_lock_mapping_entry() that uses
      rcu_read_lock() to protect against the inode being freed, and
      revalidates the page->mapping association under xa_lock().
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      c2a7d2a1
  3. 29 6月, 2018 1 次提交
  4. 08 6月, 2018 1 次提交
  5. 31 5月, 2018 2 次提交
  6. 23 5月, 2018 1 次提交
    • D
      dax: Introduce a ->copy_to_iter dax operation · b3a9a0c3
      Dan Williams 提交于
      Similar to the ->copy_from_iter() operation, a platform may want to
      deploy an architecture or device specific routine for handling reads
      from a dax_device like /dev/pmemX. On x86 this routine will point to a
      machine check safe version of copy_to_iter(). For now, add the plumbing
      to device-mapper and the dax core.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b3a9a0c3
  7. 22 5月, 2018 1 次提交
    • D
      mm, fs, dax: handle layout changes to pinned dax mappings · 5fac7408
      Dan Williams 提交于
      Background:
      
      get_user_pages() in the filesystem pins file backed memory pages for
      access by devices performing dma. However, it only pins the memory pages
      not the page-to-file offset association. If a file is truncated the
      pages are mapped out of the file and dma may continue indefinitely into
      a page that is owned by a device driver. This breaks coherency of the
      file vs dma, but the assumption is that if userspace wants the
      file-space truncated it does not matter what data is inbound from the
      device, it is not relevant anymore. The only expectation is that dma can
      safely continue while the filesystem reallocates the block(s).
      
      Problem:
      
      This expectation that dma can safely continue while the filesystem
      changes the block map is broken by dax. With dax the target dma page
      *is* the filesystem block. The model of leaving the page pinned for dma,
      but truncating the file block out of the file, means that the filesytem
      is free to reallocate a block under active dma to another file and now
      the expected data-incoherency situation has turned into active
      data-corruption.
      
      Solution:
      
      Defer all filesystem operations (fallocate(), truncate()) on a dax mode
      file while any page/block in the file is under active dma. This solution
      assumes that dma is transient. Cases where dma operations are known to
      not be transient, like RDMA, have been explicitly disabled via
      commits like 5f1d43de "IB/core: disable memory registration of
      filesystem-dax vmas".
      
      The dax_layout_busy_page() routine is called by filesystems with a lock
      held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
      The process of looking up a busy page invalidates all mappings
      to trigger any subsequent get_user_pages() to block on i_mmap_lock.
      The filesystem continues to call dax_layout_busy_page() until it finally
      returns no more active pages. This approach assumes that the page
      pinning is transient, if that assumption is violated the system would
      have likely hung from the uncompleted I/O.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5fac7408
  8. 03 4月, 2018 1 次提交
  9. 31 3月, 2018 1 次提交
  10. 08 1月, 2018 1 次提交
  11. 03 11月, 2017 2 次提交
  12. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  13. 11 9月, 2017 1 次提交
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
  14. 07 9月, 2017 3 次提交
    • R
      dax: move all DAX radix tree defs to fs/dax.c · 527b19d0
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees the
      page cache code no longer needs to know anything about DAX exceptional
      entries.  Move all the DAX exceptional entry definitions from dax.h to
      fs/dax.c.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527b19d0
    • R
      dax: remove DAX code from page_cache_tree_insert() · d01ad197
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees we
      can remove the special casing for DAX in page_cache_tree_insert().
      
      This also allows us to make dax_wake_mapping_entry_waiter() local to
      fs/dax.c, removing it from dax.h.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d01ad197
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
  15. 31 8月, 2017 1 次提交
  16. 27 7月, 2017 1 次提交
  17. 11 7月, 2017 1 次提交
  18. 30 6月, 2017 1 次提交
    • D
      libnvdimm, pmem, dax: export a cache control attribute · 6e0c90d6
      Dan Williams 提交于
      The dax_flush() operation can be turned into a nop on platforms where
      firmware arranges for cpu caches to be flushed on a power-fail event.
      The ACPI 6.2 specification defines a mechanism for the platform to
      indicate this capability so the kernel can select the proper default.
      However, for other platforms, the administrator must toggle this setting
      manually.
      
      Given this flush setting is a dax-specific mechanism we advertise it
      through a 'dax' attribute group hanging off a host device. For example,
      a 'pmem0' block-device gets a 'dax' sysfs-subdirectory with a
      'write_cache' attribute to control response to dax cache flush requests.
      This is similar to the 'queue/write_cache' attribute that appears under
      block devices.
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6e0c90d6
  19. 28 6月, 2017 1 次提交
  20. 16 6月, 2017 2 次提交
    • D
      dm: add ->flush() dax operation support · abebfbe2
      Dan Williams 提交于
      Allow device-mapper to route flush operations to the
      per-target implementation. In order for the device stacking to work we
      need a dax_dev and a pgoff relative to that device. This gives each
      layer of the stack the information it needs to look up the operation
      pointer for the next level.
      
      This conceptually allows for an array of mixed device drivers with
      varying flush implementations.
      Reviewed-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      abebfbe2
    • D
      dax, pmem: introduce an optional 'flush' dax_operation · 3c1cebff
      Dan Williams 提交于
      Filesystem-DAX flushes caches whenever it writes to the address returned
      through dax_direct_access() and when writing back dirty radix entries.
      That flushing is only required in the pmem case, so add a dax operation
      to allow pmem to take this extra action, but skip it for other dax
      capable devices that do not provide a flush routine.
      
      An example for this differentiation might be a volatile ram disk where
      there is no expectation of persistence. In fact the pmem driver itself might
      front such an address range specified by the NFIT. So, this "no flush"
      property might be something passed down by the bus / libnvdimm.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      3c1cebff
  21. 10 6月, 2017 2 次提交
    • D
      dm: add ->copy_from_iter() dax operation support · 7e026c8c
      Dan Williams 提交于
      Allow device-mapper to route copy_from_iter operations to the
      per-target implementation. In order for the device stacking to work we
      need a dax_dev and a pgoff relative to that device. This gives each
      layer of the stack the information it needs to look up the operation
      pointer for the next level.
      
      This conceptually allows for an array of mixed device drivers with
      varying copy_from_iter implementations.
      Reviewed-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      7e026c8c
    • D
      x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations · 0aed55af
      Dan Williams 提交于
      The pmem driver has a need to transfer data with a persistent memory
      destination and be able to rely on the fact that the destination writes are not
      cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
      (non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
      to ensure data-writes have reached a power-fail-safe zone in the platform. The
      fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
      around and fence previous writes with an "sfence".
      
      Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
      memcpy_flushcache, that guarantee that the destination buffer is not dirty in
      the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
      will be used to replace the "pmem api" (include/linux/pmem.h +
      arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
      and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
      config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
      otherwise.
      
      This is meant to satisfy the concern from Linus that if a driver wants to do
      something beyond the normal nocache semantics it should be something private to
      that driver [1], and Al's concern that anything uaccess related belongs with
      the rest of the uaccess code [2].
      
      The first consumer of this interface is a new 'copy_from_iter' dax operation so
      that pmem can inject cache maintenance operations without imposing this
      overhead on other dax-capable drivers.
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
      
      Cc: <x86@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0aed55af
  22. 14 5月, 2017 1 次提交
    • D
      dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case · f5705aa8
      Dan Williams 提交于
      Tetsuo reports:
      
        fs/built-in.o: In function `xfs_file_iomap_end':
        xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
        fs/built-in.o: In function `xfs_file_iomap_begin':
        xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
        make: *** [vmlinux] Error 1
        $ grep DAX .config
        CONFIG_DAX=m
        # CONFIG_DEV_DAX is not set
        # CONFIG_FS_DAX is not set
      
      When FS_DAX=n we can/must throw away the dax code in filesystems.
      Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
      nops in the FS_DAX=n case.
      
      Cc: <linux-xfs@vger.kernel.org>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Fixes: ef510424 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f5705aa8
  23. 13 5月, 2017 1 次提交
    • R
      dax: prevent invalidation of mapped DAX entries · 4636e70b
      Ross Zwisler 提交于
      Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
      v4.
      
      This series fixes data corruption that can happen for DAX mounts when
      page faults race with write(2) and as a result page tables get out of
      sync with block mappings in the filesystem and thus data seen through
      mmap is different from data seen through read(2).
      
      The series passes testing with t_mmap_stale test program from Ross and
      also other mmap related tests on DAX filesystem.
      
      This patch (of 4):
      
      dax_invalidate_mapping_entry() currently removes DAX exceptional entries
      only if they are clean and unlocked.  This is done via:
      
        invalidate_mapping_pages()
          invalidate_exceptional_entry()
            dax_invalidate_mapping_entry()
      
      However, for page cache pages removed in invalidate_mapping_pages()
      there is an additional criteria which is that the page must not be
      mapped.  This is noted in the comments above invalidate_mapping_pages()
      and is checked in invalidate_inode_page().
      
      For DAX entries this means that we can can end up in a situation where a
      DAX exceptional entry, either a huge zero page or a regular DAX entry,
      could end up mapped but without an associated radix tree entry.  This is
      inconsistent with the rest of the DAX code and with what happens in the
      page cache case.
      
      We aren't able to unmap the DAX exceptional entry because according to
      its comments invalidate_mapping_pages() isn't allowed to block, and
      unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
      
      Since we essentially never have unmapped DAX entries to evict from the
      radix tree, just remove dax_invalidate_mapping_entry().
      
      Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
      Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4636e70b
  24. 09 5月, 2017 1 次提交
    • D
      block, dax: move "select DAX" from BLOCK to FS_DAX · ef510424
      Dan Williams 提交于
      For configurations that do not enable DAX filesystems or drivers, do not
      require the DAX core to be built.
      
      Given that the 'direct_access' method has been removed from
      'block_device_operations', we can also go ahead and remove the
      block-related dax helper functions from fs/block_dev.c to
      drivers/dax/super.c. This keeps dax details out of the block layer and
      lets the DAX core be built as a module in the FS_DAX=n case.
      
      Filesystems need to include dax.h to call bdev_dax_supported().
      
      Cc: linux-xfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef510424
  25. 26 4月, 2017 2 次提交
    • D
      filesystem-dax: convert to dax_direct_access() · cccbce67
      Dan Williams 提交于
      Now that a dax_device is plumbed through all dax-capable drivers we can
      switch from block_device_operations to dax_operations for invoking
      ->direct_access.
      
      This also lets us kill off some usages of struct blk_dax_ctl on the way
      to its eventual removal.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      cccbce67
    • D
      Revert "block: use DAX for partition table reads" · a41fe02b
      Dan Williams 提交于
      commit d1a5f2b4 ("block: use DAX for partition table reads") was
      part of a stalled effort to allow dax mappings of block devices. Since
      then the device-dax mechanism has filled the role of dax-mapping static
      device ranges.
      
      Now that we are moving ->direct_access() from a block_device operation
      to a dax_inode operation we would need block devices to map and carry
      their own dax_inode reference.
      
      Unless / until we decide to revive dax mapping of raw block devices
      through the dax_inode scheme, there is no need to carry
      read_dax_sector(). Its removal in turn allows for the removal of
      bdev_direct_access() and should have been included in commit
      22375701 ("block_dev: remove DAX leftovers").
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a41fe02b
  26. 21 4月, 2017 1 次提交
    • D
      dax: introduce dax_direct_access() · b0686260
      Dan Williams 提交于
      Replace bdev_direct_access() with dax_direct_access() that uses
      dax_device and dax_operations instead of a block_device and
      block_device_operations for dax. Once all consumers of the old api have
      been converted bdev_direct_access() will be deleted.
      
      Given that block device partitioning decisions can cause dax page
      alignment constraints to be violated this also introduces the
      bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
      to the dax_device and also checks for page alignment.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b0686260
  27. 20 4月, 2017 3 次提交
    • D
      pmem: add dax_operations support · c1d6e828
      Dan Williams 提交于
      Setup a dax_device to have the same lifetime as the pmem block device
      and add a ->direct_access() method that is equivalent to
      pmem_direct_access(). Once fs/dax.c has been converted to use
      dax_operations the old pmem_direct_access() will be removed.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c1d6e828
    • D
      dax: introduce dax_operations · 6568b08b
      Dan Williams 提交于
      Track a set of dax_operations per dax_device that can be set at
      alloc_dax() time. These operations will be used to stop the abuse of
      block_device_operations for communicating dax capabilities to
      filesystems. It will also be used to replace the "pmem api" and move
      pmem-specific cache maintenance, and other dax-driver-specific
      filesystem-dax operations, to dax device methods. In particular this
      allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(),
      with a driver specific replacement.
      
      This is a standalone introduction of the operations. Follow on patches
      convert each dax-driver and teach fs/dax.c to use ->direct_access() from
      dax_operations instead of block_device_operations.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6568b08b
    • D
      dax: add a facility to lookup a dax device by 'host' device name · 72058005
      Dan Williams 提交于
      For the current block_device based filesystem-dax path, we need a way
      for it to lookup the dax_device associated with a block_device. Add a
      'host' property of a dax_device that can be used for this purpose. It is
      a free form string, but for a dax_device associated with a block device
      it is the bdev name.
      
      This is a stop-gap until filesystems are able to mount on a dax-inode
      directly.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      72058005
  28. 13 4月, 2017 1 次提交
    • D
      dax: refactor dax-fs into a generic provider of 'struct dax_device' instances · 7b6be844
      Dan Williams 提交于
      We want dax capable drivers to be able to publish a set of dax
      operations [1]. However, we do not want to further abuse block_devices
      to advertise these operations. Instead we will attach these operations
      to a dax device and add a lookup mechanism to go from block device path
      to a dax device. A dax capable driver like pmem or brd is responsible
      for registering a dax device, alongside a block device, and then a dax
      capable filesystem is responsible for retrieving the dax device by path
      name if it wants to call dax_operations.
      
      For now, we refactor the dax pseudo-fs to be a generic facility, rather
      than an implementation detail, of the device-dax use case. Where a "dax
      device" is just an inode + dax infrastructure, and "Device DAX" is a
      mapping service layered on top of that base 'struct dax_device'.
      "Filesystem DAX" is then a mapping service that layers a filesystem on
      top of that same base device. Filesystem DAX is associated with a
      block_device for now, but perhaps directly to a dax device in the
      future, or for new pmem-only filesystems.
      
      [1]: https://lkml.org/lkml/2017/1/19/880Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      7b6be844
  29. 25 2月, 2017 2 次提交
    • D
      mm: replace FAULT_FLAG_SIZE with parameter to huge_fault · c791ace1
      Dave Jiang 提交于
      Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
      been somewhat painful with getting the flags set and removed at the
      correct locations.  More than one kernel oops was introduced due to
      difficulties of getting the placement correctly.
      
      Remove the flag values and introduce an input parameter to huge_fault
      that indicates the size of the page entry.  This makes the code easier
      to trace and should avoid the issues we see with the fault flags where
      removal of the flag was necessary in the fallback paths.
      
      Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c791ace1
    • D
      mm,fs,dax: change ->pmd_fault to ->huge_fault · a2d58167
      Dave Jiang 提交于
      Patch series "1G transparent hugepage support for device dax", v2.
      
      The following series implements support for 1G trasparent hugepage on
      x86 for device dax.  The bulk of the code was written by Mathew Wilcox a
      while back supporting transparent 1G hugepage for fs DAX.  I have
      forward ported the relevant bits to 4.10-rc.  The current submission has
      only the necessary code to support device DAX.
      
      Comments from Dan Williams: So the motivation and intended user of this
      functionality mirrors the motivation and users of 1GB page support in
      hugetlbfs.  Given expected capacities of persistent memory devices an
      in-memory database may want to reduce tlb pressure beyond what they can
      already achieve with 2MB mappings of a device-dax file.  We have
      customer feedback to that effect as Willy mentioned in his previous
      version of these patches [1].
      
      [1]: https://lkml.org/lkml/2016/1/31/52
      
      Comments from Nilesh @ Oracle:
      
      There are applications which have a process model; and if you assume
      10,000 processes attempting to mmap all the 6TB memory available on a
      server; we are looking at the following:
      
      processes         : 10,000
      memory            :    6TB
      pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
      pmd @ 2M page size: 120,000 / 512 = ~240GB
      pud @ 1G page size: 240GB / 512 = ~480MB
      
      As you can see with 2M pages, this system will use up an exorbitant
      amount of DRAM to hold the page tables; but the 1G pages finally brings
      it down to a reasonable level.  Memory sizes will keep increasing; so
      this number will keep increasing.
      
      An argument can be made to convert the applications from process model
      to thread model, but in the real world that may not be always practical.
      Hopefully this helps explain the use case where this is valuable.
      
      This patch (of 3):
      
      In preparation for adding the ability to handle PUD pages, convert
      vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault.  The
      vm_fault structure is extended to include a union of the different page
      table pointers that may be needed, and three flag bits are reserved to
      indicate which type of pointer is in the union.
      
      [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
        Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
      [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
        Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
      Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2d58167