1. 29 11月, 2018 2 次提交
  2. 19 11月, 2018 1 次提交
  3. 18 11月, 2018 2 次提交
    • M
      dax: Fix huge page faults · 0e40de03
      Matthew Wilcox 提交于
      Using xas_load() with a PMD-sized xa_state would work if either a
      PMD-sized entry was present or a PTE sized entry was present in the
      first 64 entries (of the 512 PTEs in a PMD on x86).  If there was no
      PTE in the first 64 entries, grab_mapping_entry() would believe there
      were no entries present, allocate a PMD-sized entry and overwrite the
      PTE in the page cache.
      
      Use xas_find_conflict() instead which turns out to simplify
      both get_unlocked_entry() and grab_mapping_entry().  Also remove a
      WARN_ON_ONCE from grab_mapping_entry() as it will have already triggered
      in get_unlocked_entry().
      
      Fixes: cfc93c6c ("dax: Convert dax_insert_pfn_mkwrite to XArray")
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      0e40de03
    • M
      dax: Fix dax_unlock_mapping_entry for PMD pages · fda490d3
      Matthew Wilcox 提交于
      Device DAX PMD pages do not set the PageHead bit for compound pages.
      Fix for now by retrieving the PMD bit from the entry, but eventually we
      will be passed the page size by the caller.
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Fixes: 9f32d221 ("dax: Convert dax_lock_mapping_entry to XArray")
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      fda490d3
  4. 17 11月, 2018 3 次提交
  5. 21 10月, 2018 8 次提交
  6. 09 10月, 2018 1 次提交
    • D
      filesystem-dax: Fix dax_layout_busy_page() livelock · d7782145
      Dan Williams 提交于
      In the presence of multi-order entries the typical
      pagevec_lookup_entries() pattern may loop forever:
      
      	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
      				min(end - index, (pgoff_t)PAGEVEC_SIZE),
      				indices)) {
      		...
      		for (i = 0; i < pagevec_count(&pvec); i++) {
      			index = indices[i];
      			...
      		}
      		index++; /* BUG */
      	}
      
      The loop updates 'index' for each index found and then increments to the
      next possible page to continue the lookup. However, if the last entry in
      the pagevec is multi-order then the next possible page index is more
      than 1 page away. Fix this locally for the filesystem-dax case by
      checking for dax-multi-order entries. Going forward new users of
      multi-order entries need to be similarly careful, or we need a generic
      way to report the page increment in the radix iterator.
      
      Fixes: 5fac7408 ("mm, fs, dax: handle layout changes to pinned dax...")
      Cc: <stable@vger.kernel.org>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d7782145
  7. 30 9月, 2018 1 次提交
    • M
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox 提交于
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      3159f943
  8. 28 9月, 2018 1 次提交
  9. 12 9月, 2018 1 次提交
  10. 31 7月, 2018 1 次提交
  11. 30 7月, 2018 1 次提交
  12. 24 7月, 2018 1 次提交
    • D
      filesystem-dax: Introduce dax_lock_mapping_entry() · c2a7d2a1
      Dan Williams 提交于
      In preparation for implementing support for memory poison (media error)
      handling via dax mappings, implement a lock_page() equivalent. Poison
      error handling requires rmap and needs guarantees that the page->mapping
      association is maintained / valid (inode not freed) for the duration of
      the lookup.
      
      In the device-dax case it is sufficient to simply hold a dev_pagemap
      reference. In the filesystem-dax case we need to use the entry lock.
      
      Export the entry lock via dax_lock_mapping_entry() that uses
      rcu_read_lock() to protect against the inode being freed, and
      revalidates the page->mapping association under xa_lock().
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      c2a7d2a1
  13. 21 7月, 2018 1 次提交
    • D
      filesystem-dax: Set page->index · 73449daf
      Dan Williams 提交于
      In support of enabling memory_failure() handling for filesystem-dax
      mappings, set ->index to the pgoff of the page. The rmap implementation
      requires ->index to bound the search through the vma interval tree. The
      index is set and cleared at dax_associate_entry() and
      dax_disassociate_entry() time respectively.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      73449daf
  14. 08 6月, 2018 1 次提交
  15. 03 6月, 2018 1 次提交
  16. 23 5月, 2018 2 次提交
    • D
      dax: Report bytes remaining in dax_iomap_actor() · a77d4786
      Dan Williams 提交于
      In preparation for protecting the dax read(2) path from media errors
      with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the
      implementation to report the bytes successfully transferred.
      
      Cc: <x86@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a77d4786
    • D
      dax: Introduce a ->copy_to_iter dax operation · b3a9a0c3
      Dan Williams 提交于
      Similar to the ->copy_from_iter() operation, a platform may want to
      deploy an architecture or device specific routine for handling reads
      from a dax_device like /dev/pmemX. On x86 this routine will point to a
      machine check safe version of copy_to_iter(). For now, add the plumbing
      to device-mapper and the dax core.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b3a9a0c3
  17. 22 5月, 2018 1 次提交
    • D
      mm, fs, dax: handle layout changes to pinned dax mappings · 5fac7408
      Dan Williams 提交于
      Background:
      
      get_user_pages() in the filesystem pins file backed memory pages for
      access by devices performing dma. However, it only pins the memory pages
      not the page-to-file offset association. If a file is truncated the
      pages are mapped out of the file and dma may continue indefinitely into
      a page that is owned by a device driver. This breaks coherency of the
      file vs dma, but the assumption is that if userspace wants the
      file-space truncated it does not matter what data is inbound from the
      device, it is not relevant anymore. The only expectation is that dma can
      safely continue while the filesystem reallocates the block(s).
      
      Problem:
      
      This expectation that dma can safely continue while the filesystem
      changes the block map is broken by dax. With dax the target dma page
      *is* the filesystem block. The model of leaving the page pinned for dma,
      but truncating the file block out of the file, means that the filesytem
      is free to reallocate a block under active dma to another file and now
      the expected data-incoherency situation has turned into active
      data-corruption.
      
      Solution:
      
      Defer all filesystem operations (fallocate(), truncate()) on a dax mode
      file while any page/block in the file is under active dma. This solution
      assumes that dma is transient. Cases where dma operations are known to
      not be transient, like RDMA, have been explicitly disabled via
      commits like 5f1d43de "IB/core: disable memory registration of
      filesystem-dax vmas".
      
      The dax_layout_busy_page() routine is called by filesystems with a lock
      held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
      The process of looking up a busy page invalidates all mappings
      to trigger any subsequent get_user_pages() to block on i_mmap_lock.
      The filesystem continues to call dax_layout_busy_page() until it finally
      returns no more active pages. This approach assumes that the page
      pinning is transient, if that assumption is violated the system would
      have likely hung from the uncompleted I/O.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5fac7408
  18. 17 4月, 2018 1 次提交
  19. 12 4月, 2018 1 次提交
  20. 03 4月, 2018 1 次提交
    • D
      fs, dax: use page->mapping to warn if truncate collides with a busy page · d2c997c0
      Dan Williams 提交于
      Catch cases where extent unmap operations encounter pages that are
      pinned / busy. Typically this is pinned pages that are under active dma.
      This warning is a canary for potential data corruption as truncated
      blocks could be allocated to a new file while the device is still
      performing i/o.
      
      Here is an example of a collision that this implementation catches:
      
       WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
       [..]
       Call Trace:
        __dax_invalidate_mapping_entry+0x6c/0xf0
        dax_delete_mapping_entry+0xf/0x20
        truncate_exceptional_pvec_entries.part.12+0x1af/0x200
        truncate_inode_pages_range+0x268/0x970
        ? tlb_gather_mmu+0x10/0x20
        ? up_write+0x1c/0x40
        ? unmap_mapping_range+0x73/0x140
        xfs_free_file_space+0x1b6/0x5b0 [xfs]
        ? xfs_file_fallocate+0x7f/0x320 [xfs]
        ? down_write_nested+0x40/0x70
        ? xfs_ilock+0x21d/0x2f0 [xfs]
        xfs_file_fallocate+0x162/0x320 [xfs]
        ? rcu_read_lock_sched_held+0x3f/0x70
        ? rcu_sync_lockdep_assert+0x2a/0x50
        ? __sb_start_write+0xd0/0x1b0
        ? vfs_fallocate+0x20c/0x270
        vfs_fallocate+0x154/0x270
        SyS_fallocate+0x43/0x80
        entry_SYSCALL_64_fastpath+0x1f/0x96
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d2c997c0
  21. 31 3月, 2018 1 次提交
  22. 01 2月, 2018 2 次提交
  23. 08 1月, 2018 1 次提交
  24. 16 12月, 2017 1 次提交
    • L
      Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" · f6f37321
      Linus Torvalds 提交于
      This reverts commits 5c9d2d5c, c7da82b8, and e7fe7b5c.
      
      We'll probably need to revisit this, but basically we should not
      complicate the get_user_pages_fast() case, and checking the actual page
      table protection key bits will require more care anyway, since the
      protection keys depend on the exact state of the VM in question.
      
      Particularly when doing a "remote" page lookup (ie in somebody elses VM,
      not your own), you need to be much more careful than this was.  Dave
      Hansen says:
      
       "So, the underlying bug here is that we now a get_user_pages_remote()
        and then go ahead and do the p*_access_permitted() checks against the
        current PKRU. This was introduced recently with the addition of the
        new p??_access_permitted() calls.
      
        We have checks in the VMA path for the "remote" gups and we avoid
        consulting PKRU for them. This got missed in the pkeys selftests
        because I did a ptrace read, but not a *write*. I also didn't
        explicitly test it against something where a COW needed to be done"
      
      It's also not entirely clear that it makes sense to check the protection
      key bits at this level at all.  But one possible eventual solution is to
      make the get_user_pages_fast() case just abort if it sees protection key
      bits set, which makes us fall back to the regular get_user_pages() case,
      which then has a vma and can do the check there if we want to.
      
      We'll see.
      
      Somewhat related to this all: what we _do_ want to do some day is to
      check the PAGE_USER bit - it should obviously always be set for user
      pages, but it would be a good check to have back.  Because we have no
      generic way to test for it, we lost it as part of moving over from the
      architecture-specific x86 GUP implementation to the generic one in
      commit e585513b ("x86/mm/gup: Switch GUP to the generic
      get_user_page_fast() implementation").
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6f37321
  25. 30 11月, 2017 1 次提交
  26. 16 11月, 2017 2 次提交
    • M
      mm, pagevec: remove cold parameter for pagevecs · 86679820
      Mel Gorman 提交于
      Every pagevec_init user claims the pages being released are hot even in
      cases where it is unlikely the pages are hot.  As no one cares about the
      hotness of pages being released to the allocator, just ditch the
      parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86679820
    • M
      mm, truncate: do not check mapping for every page being truncated · c7df8ad2
      Mel Gorman 提交于
      During truncation, the mapping has already been checked for shmem and
      dax so it's known that workingset_update_node is required.
      
      This patch avoids the checks on mapping for each page being truncated.
      In all other cases, a lookup helper is used to determine if
      workingset_update_node() needs to be called.  The one danger is that the
      API is slightly harder to use as calling workingset_update_node directly
      without checking for dax or shmem mappings could lead to surprises.
      However, the API rarely needs to be used and hopefully the comment is
      enough to give people the hint.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                   oneirq-v1r1        pickhelper-v1r1
      Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
      1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
      Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
      Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
      Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
      Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
      Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
      Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
      Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
      Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
      Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
      Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
      Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)
      
      As you'd expect, the gain is marginal but it can be detected.  The
      differences in bonnie are all within the noise which is not surprising
      given the impact on the microbenchmark.
      
      radix_tree_update_node_t is a callback for some radix operations that
      optionally passes in a private field.  The only user of the callback is
      workingset_update_node and as it no longer requires a mapping, the
      private field is removed.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7df8ad2