- 21 10月, 2018 4 次提交
-
-
由 Matthew Wilcox 提交于
Instead of using a pagevec, just use the XArray iterators. Add a conditional rescheduling point which probably should have been there in the original. Signed-off-by: NMatthew Wilcox <willy@infradead.org>
-
由 Matthew Wilcox 提交于
Add some XArray-based helper functions to replace the radix tree based metaphors currently in use. The biggest change is that converted code doesn't see its own lock bit; get_unlocked_entry() always returns an entry with the lock bit clear. So we don't have to mess around loading the current entry and clearing the lock bit; we can just store the unlocked entry that we already have. Signed-off-by: NMatthew Wilcox <willy@infradead.org>
-
由 Matthew Wilcox 提交于
Since the XArray is embedded in the struct address_space, its address contains exactly as much entropy as the address of the mapping. This patch is purely preparatory for later patches which will simplify the wait/wake interfaces. Signed-off-by: NMatthew Wilcox <willy@infradead.org>
-
由 Matthew Wilcox 提交于
Remove mentions of 'radix' and 'radix tree'. Simplify some names by dropping the word 'mapping'. Signed-off-by: NMatthew Wilcox <willy@infradead.org>
-
- 30 9月, 2018 1 次提交
-
-
由 Matthew Wilcox 提交于
Introduce xarray value entries and tagged pointers to replace radix tree exceptional entries. This is a slight change in encoding to allow the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a value entry). It is also a change in emphasis; exceptional entries are intimidating and different. As the comment explains, you can choose to store values or pointers in the xarray and they are both first-class citizens. Signed-off-by: NMatthew Wilcox <willy@infradead.org> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
- 12 9月, 2018 1 次提交
-
-
由 Matthew Wilcox 提交于
Use my_zero_pfn instead of ZERO_PAGE(), and pass the vaddr to it instead of zero so it works on MIPS and s390 who reference the vaddr to select a zero page. Cc: <stable@vger.kernel.org> Fixes: 91d25ba8 ("dax: use common 4k zero page for dax mmap reads") Signed-off-by: NMatthew Wilcox <willy@infradead.org> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 31 7月, 2018 1 次提交
-
-
由 Huaisheng Ye 提交于
Some functions within fs/dax don't need to get local pointer kaddr or variable pfn from direct_access. Using NULL instead of having to pass in useless pointer or variable that caller then just throw away. Signed-off-by: NHuaisheng Ye <yehs1@lenovo.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDave Jiang <dave.jiang@intel.com>
-
- 30 7月, 2018 1 次提交
-
-
由 Ross Zwisler 提交于
Inodes using DAX should only ever have exceptional entries in their page caches. Make this clear by warning if the iteration in dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a comment for the pagevec_release() call which only deals with struct page pointers. Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz>
-
- 24 7月, 2018 1 次提交
-
-
由 Dan Williams 提交于
In preparation for implementing support for memory poison (media error) handling via dax mappings, implement a lock_page() equivalent. Poison error handling requires rmap and needs guarantees that the page->mapping association is maintained / valid (inode not freed) for the duration of the lookup. In the device-dax case it is sufficient to simply hold a dev_pagemap reference. In the filesystem-dax case we need to use the entry lock. Export the entry lock via dax_lock_mapping_entry() that uses rcu_read_lock() to protect against the inode being freed, and revalidates the page->mapping association under xa_lock(). Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NDave Jiang <dave.jiang@intel.com>
-
- 21 7月, 2018 1 次提交
-
-
由 Dan Williams 提交于
In support of enabling memory_failure() handling for filesystem-dax mappings, set ->index to the pgoff of the page. The rmap implementation requires ->index to bound the search through the vma interval tree. The index is set and cleared at dax_associate_entry() and dax_disassociate_entry() time respectively. Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NDave Jiang <dave.jiang@intel.com>
-
- 08 6月, 2018 1 次提交
-
-
由 Souptick Joarder 提交于
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f4220 ("mm: change return type to vm_fault_t") There was an existing bug inside dax_load_hole() if vm_insert_mixed had failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of VM_FAULT_OOM. With new vmf_insert_mixed() this issue is addressed. vm_insert_mixed_mkwrite has inefficiency when it returns an error value, driver has to convert it to vm_fault_t type. With new vmf_insert_mixed_mkwrite() this limitation will be addressed. Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 6月, 2018 1 次提交
-
-
由 Matthew Wilcox 提交于
It does not return an error, so we don't need to check the return value for IS_ERR(). Indeed, it is a bug to do so; with a sufficiently large PFN, a legitimate DAX entry may be mistaken for an error return. Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 23 5月, 2018 2 次提交
-
-
由 Dan Williams 提交于
In preparation for protecting the dax read(2) path from media errors with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the implementation to report the bytes successfully transferred. Cc: <x86@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
Similar to the ->copy_from_iter() operation, a platform may want to deploy an architecture or device specific routine for handling reads from a dax_device like /dev/pmemX. On x86 this routine will point to a machine check safe version of copy_to_iter(). For now, add the plumbing to device-mapper and the dax core. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 22 5月, 2018 1 次提交
-
-
由 Dan Williams 提交于
Background: get_user_pages() in the filesystem pins file backed memory pages for access by devices performing dma. However, it only pins the memory pages not the page-to-file offset association. If a file is truncated the pages are mapped out of the file and dma may continue indefinitely into a page that is owned by a device driver. This breaks coherency of the file vs dma, but the assumption is that if userspace wants the file-space truncated it does not matter what data is inbound from the device, it is not relevant anymore. The only expectation is that dma can safely continue while the filesystem reallocates the block(s). Problem: This expectation that dma can safely continue while the filesystem changes the block map is broken by dax. With dax the target dma page *is* the filesystem block. The model of leaving the page pinned for dma, but truncating the file block out of the file, means that the filesytem is free to reallocate a block under active dma to another file and now the expected data-incoherency situation has turned into active data-corruption. Solution: Defer all filesystem operations (fallocate(), truncate()) on a dax mode file while any page/block in the file is under active dma. This solution assumes that dma is transient. Cases where dma operations are known to not be transient, like RDMA, have been explicitly disabled via commits like 5f1d43de "IB/core: disable memory registration of filesystem-dax vmas". The dax_layout_busy_page() routine is called by filesystems with a lock held against mm faults (i_mmap_lock) to find pinned / busy dax pages. The process of looking up a busy page invalidates all mappings to trigger any subsequent get_user_pages() to block on i_mmap_lock. The filesystem continues to call dax_layout_busy_page() until it finally returns no more active pages. This approach assumes that the page pinning is transient, if that assumption is violated the system would have likely hung from the uncompleted I/O. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reported-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 17 4月, 2018 1 次提交
-
-
由 Mike Rapoport 提交于
Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 12 4月, 2018 1 次提交
-
-
由 Matthew Wilcox 提交于
Remove the address_space ->tree_lock and use the xa_lock newly added to the radix_tree_root. Rename the address_space ->page_tree to ->i_pages, since we don't really care that it's a tree. [willy@infradead.org: fix nds32, fs/dax.c] Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Acked-by: NJeff Layton <jlayton@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 4月, 2018 1 次提交
-
-
由 Dan Williams 提交于
Catch cases where extent unmap operations encounter pages that are pinned / busy. Typically this is pinned pages that are under active dma. This warning is a canary for potential data corruption as truncated blocks could be allocated to a new file while the device is still performing i/o. Here is an example of a collision that this implementation catches: WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80 [..] Call Trace: __dax_invalidate_mapping_entry+0x6c/0xf0 dax_delete_mapping_entry+0xf/0x20 truncate_exceptional_pvec_entries.part.12+0x1af/0x200 truncate_inode_pages_range+0x268/0x970 ? tlb_gather_mmu+0x10/0x20 ? up_write+0x1c/0x40 ? unmap_mapping_range+0x73/0x140 xfs_free_file_space+0x1b6/0x5b0 [xfs] ? xfs_file_fallocate+0x7f/0x320 [xfs] ? down_write_nested+0x40/0x70 ? xfs_ilock+0x21d/0x2f0 [xfs] xfs_file_fallocate+0x162/0x320 [xfs] ? rcu_read_lock_sched_held+0x3f/0x70 ? rcu_sync_lockdep_assert+0x2a/0x50 ? __sb_start_write+0xd0/0x1b0 ? vfs_fallocate+0x20c/0x270 vfs_fallocate+0x154/0x270 SyS_fallocate+0x43/0x80 entry_SYSCALL_64_fastpath+0x1f/0x96 Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 31 3月, 2018 1 次提交
-
-
由 Dan Williams 提交于
In preparation for examining the busy state of dax pages in the truncate path, switch from sectors to pfns in the radix. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 01 2月, 2018 2 次提交
-
-
由 Matthew Wilcox 提交于
Several users of unmap_mapping_range() would prefer to express their range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you have to remember to cast your page number to a 64-bit type before shifting it, and four places in the current tree didn't remember to do that. That's a sign of a bad interface. Conveniently, unmap_mapping_range() actually converts from bytes into pages, so hoist the guts of unmap_mapping_range() into a new function unmap_mapping_pages() and convert the callers which want to use pages. Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Reported-by: N"zhangyi (F)" <yi.zhang@huawei.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan H. Schönherr 提交于
follow_pte_pmd() can theoretically return after having acquired a PMD lock, even when DAX was not compiled with CONFIG_FS_DAX_PMD. Release the PMD lock unconditionally. Link: http://lkml.kernel.org/r/20180118133839.20587-1-jschoenh@amazon.de Fixes: f729c8c9 ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean") Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 1月, 2018 1 次提交
-
-
由 Jan Kara 提交于
Ext4 needs to pass through error from its iomap handler to the page fault handler so that it can properly detect ENOSPC and force transaction commit and retry the fault (and block allocation). Add argument to dax_iomap_fault() for passing such error. Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 16 12月, 2017 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commits 5c9d2d5c, c7da82b8, and e7fe7b5c. We'll probably need to revisit this, but basically we should not complicate the get_user_pages_fast() case, and checking the actual page table protection key bits will require more care anyway, since the protection keys depend on the exact state of the VM in question. Particularly when doing a "remote" page lookup (ie in somebody elses VM, not your own), you need to be much more careful than this was. Dave Hansen says: "So, the underlying bug here is that we now a get_user_pages_remote() and then go ahead and do the p*_access_permitted() checks against the current PKRU. This was introduced recently with the addition of the new p??_access_permitted() calls. We have checks in the VMA path for the "remote" gups and we avoid consulting PKRU for them. This got missed in the pkeys selftests because I did a ptrace read, but not a *write*. I also didn't explicitly test it against something where a COW needed to be done" It's also not entirely clear that it makes sense to check the protection key bits at this level at all. But one possible eventual solution is to make the get_user_pages_fast() case just abort if it sees protection key bits set, which makes us fall back to the regular get_user_pages() case, which then has a vma and can do the check there if we want to. We'll see. Somewhat related to this all: what we _do_ want to do some day is to check the PAGE_USER bit - it should obviously always be set for user pages, but it would be a good check to have back. Because we have no generic way to test for it, we lost it as part of moving over from the architecture-specific x86 GUP implementation to the generic one in commit e585513b ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"). Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 11月, 2017 1 次提交
-
-
由 Dan Williams 提交于
The 'access_permitted' helper is used in the gup-fast path and goes beyond the simple _PAGE_RW check to also: - validate that the mapping is writable from a protection keys standpoint - validate that the pte has _PAGE_USER set since all fault paths where pmd_write is must be referencing user-memory. Link: http://lkml.kernel.org/r/151043111049.2842.15241454964150083466.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 11月, 2017 3 次提交
-
-
由 Mel Gorman 提交于
Every pagevec_init user claims the pages being released are hot even in cases where it is unlikely the pages are hot. As no one cares about the hotness of pages being released to the allocator, just ditch the parameter. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
During truncation, the mapping has already been checked for shmem and dax so it's known that workingset_update_node is required. This patch avoids the checks on mapping for each page being truncated. In all other cases, a lookup helper is used to determine if workingset_update_node() needs to be called. The one danger is that the API is slightly harder to use as calling workingset_update_node directly without checking for dax or shmem mappings could lead to surprises. However, the API rarely needs to be used and hopefully the comment is enough to give people the hint. sparsetruncate (tiny) 4.14.0-rc4 4.14.0-rc4 oneirq-v1r1 pickhelper-v1r1 Min Time 141.00 ( 0.00%) 140.00 ( 0.71%) 1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%) 2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%) 3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%) Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%) Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%) Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%) Max Time 230.00 ( 0.00%) 205.00 ( 10.87%) Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%) Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%) Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%) Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%) Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%) Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%) Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%) Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%) Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%) As you'd expect, the gain is marginal but it can be detected. The differences in bonnie are all within the noise which is not surprising given the impact on the microbenchmark. radix_tree_update_node_t is a callback for some radix operations that optionally passes in a private field. The only user of the callback is workingset_update_node and as it no longer requires a mapping, the private field is removed. Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
This patch only affects users of mmu_notifier->invalidate_range callback which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ... and it is an optimization for those users. Everyone else is unaffected by it. When clearing a pte/pmd we are given a choice to notify the event under the page table lock (notify version of *_clear_flush helpers do call the mmu_notifier_invalidate_range). But that notification is not necessary in all cases. This patch removes almost all cases where it is useless to have a call to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end. It also adds documentation in all those cases explaining why. Below is a more in depth analysis of why this is fine to do this: For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a process virtual address space). There is only 2 cases when you need to notify those secondary TLB while holding page table lock when clearing a pte/pmd: A) page backing address is free before mmu_notifier_invalidate_range_end B) a page table entry is updated to point to a new page (COW, write fault on zero page, __replace_page(), ...) Case A is obvious you do not want to take the risk for the device to write to a page that might now be used by something completely different. Case B is more subtle. For correctness it requires the following sequence to happen: - take page table lock - clear page table entry and notify (pmd/pte_huge_clear_flush_notify()) - set page table entry to point to new page If clearing the page table entry is not followed by a notify before setting the new pte/pmd value then you can break memory model like C11 or C++11 for the device. Consider the following scenario (device use a feature similar to ATS/ PASID): Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume they are write protected for COW (other case of B apply too). [Time N] ----------------------------------------------------------------- CPU-thread-0 {try to write to addrA} CPU-thread-1 {try to write to addrB} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA and populate device TLB} DEV-thread-2 {read addrB and populate device TLB} [Time N+1] --------------------------------------------------------------- CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+2] --------------------------------------------------------------- CPU-thread-0 {COW_step1: {update page table point to new page for addrA}} CPU-thread-1 {COW_step1: {update page table point to new page for addrB}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {write to addrA which is a write to new page} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {} CPU-thread-3 {write to addrB which is a write to new page} DEV-thread-0 {} DEV-thread-2 {} [Time N+4] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+5] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA from old page} DEV-thread-2 {read addrB from new page} So here because at time N+2 the clear page table entry was not pair with a notification to invalidate the secondary TLB, the device see the new value for addrB before seing the new value for addrA. This break total memory ordering for the device. When changing a pte to write protect or to point to a new write protected page with same content (KSM) it is ok to delay invalidate_range callback to mmu_notifier_invalidate_range_end() outside the page table lock. This is true even if the thread doing page table update is preempted right after releasing page table lock before calling mmu_notifier_invalidate_range_end Thanks to Andrea for thinking of a problematic scenario for COW. [jglisse@redhat.com: v2] Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alistair Popple <alistair@popple.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 11月, 2017 1 次提交
-
-
由 Jeff Moyer 提交于
PMD faults on a zero length file on a file system mounted with -o dax will not generate SIGBUS as expected. fd = open(...O_TRUNC); addr = mmap(NULL, 2*1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); *addr = 'a'; <expect SIGBUS> The problem is this code in dax_iomap_pmd_fault: max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT; If the inode size is zero, we end up with a max_pgoff that is way larger than 0. :) Fix it by using DIV_ROUND_UP, as is done elsewhere in the kernel. I tested this with some simple test code that ensured that SIGBUS was received where expected. Cc: <stable@vger.kernel.org> Fixes: 642261ac ("dax: add struct iomap based DAX PMD support") Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 14 11月, 2017 1 次提交
-
-
由 Dan Williams 提交于
While reviewing whether MAP_SYNC should strengthen its current guarantee of syncing writes from the initiating process to also include third-party readers observing dirty metadata, Dave pointed out that the check of IOMAP_WRITE is misplaced. The policy of what to with IOMAP_F_DIRTY should be separated from the generic filesystem mechanism of reporting dirty metadata. Move this policy to the fs-dax core to simplify the per-filesystem iomap handlers, and further centralize code that implements the MAP_SYNC policy. This otherwise should not change behavior, it just makes it easier to change behavior in the future. Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 03 11月, 2017 11 次提交
-
-
由 Jan Kara 提交于
Implement a function that filesystems can call to finish handling of synchronous page faults. It takes care of syncing appropriare file range and insertion of page table entry. Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
Add a flag to iomap interface informing the caller that inode needs fdstasync(2) for returned extent to become persistent and use it in DAX fault code so that we don't map such extents into page tables immediately. Instead we propagate the information that fdatasync(2) is necessary from dax_iomap_fault() with a new VM_FAULT_NEEDDSYNC flag. Filesystem fault handler is then responsible for calling fdatasync(2) and inserting pfn into page tables. Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
Currently we dirty radix tree entry whenever dax_insert_mapping_entry() gets called for a write fault. With synchronous page faults we would like to insert clean radix tree entry and dirty it only once we call fdatasync() and update page tables to save some unnecessary cache flushing. Add 'dirty' argument to dax_insert_mapping_entry() for that. Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
For synchronous page fault dax_iomap_fault() will need to return PFN which will then need to be inserted into page tables after fsync() completes. Add necessary parameter to dax_iomap_fault(). Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
Add missing argument description. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
dax_pmd_insert_mapping() has only one callsite and we will need to further fine tune what it does for synchronous faults. Just inline it into the callsite so that we don't have to pass awkward bools around. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
dax_insert_mapping() has only one callsite and we will need to further fine tune what it does for synchronous faults. Just inline it into the callsite so that we don't have to pass awkward bools around. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
There are already two users and more are coming. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
There are already two users and more are coming. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
Factor out code to get pfn out of iomap that is shared between PTE and PMD fault path. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
dax_insert_mapping() has lots of arguments and a lot of them is actuall duplicated by passing vm_fault structure as well. Change the function to take the same arguments as dax_pmd_insert_mapping(). Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-