- 11 1月, 2017 4 次提交
-
-
由 Ross Zwisler 提交于
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pmd_t of a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd_t dirty and writeable 2) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pmd_t as clean and write protected. 3) more mmap writes to the PMD. These don't cause any page faults since the pmd_t is dirty and writeable. The radix tree entry remains clean. 4) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 5) crash - dirty data that should have been fsync'd as part of 4) could still have been in the processor cache, and is lost. Fix this by marking the pmd_t clean and write protected in dax_mapping_entry_mkclean(), which is called as part of the fsync operation 2). This will cause the writes in step 3) above to generate page faults where we'll re-dirty the PMD radix tree entry, resulting in flushes in the fsync that happens in step 4). Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush") Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ross Zwisler 提交于
Patch series "Write protect DAX PMDs in *sync path". Currently dax_mapping_entry_mkclean() fails to clean and write protect the pmd_t of a DAX PMD entry during an *sync operation. This can result in data loss, as detailed in patch 2. This series is based on Dan's "libnvdimm-pending" branch, which is the current home for Jan's "dax: Page invalidation fixes" series. You can find a working tree here: https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_clean This patch (of 2): Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a huge page PMD leaf to be found and returned. Link: http://lkml.kernel.org/r/1482272586-21177-2-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Suggested-by: NDave Hansen <dave.hansen@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
With THP page cache, when trying to build a huge page from regular pte pages, we just clear the pmd entry. We will take another fault and at that point we will find the huge page in the radix tree, thereby using the huge page to complete the page fault The second fault path will allocate the needed pgtable_t page for archs like ppc64. So no need to deposit the same in collapse path. Depositing them in the collapse path resulting in a pgtable_t memory leak also giving errors like BUG: non-zero nr_ptes on freeing mm: 3 Fixes: 953c66c2 ("mm: THP page cache support for ppc64") Link: http://lkml.kernel.org/r/20161212163428.6780-2-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ross Zwisler 提交于
Currently in DAX if we have three read faults on the same hole address we can end up with the following: Thread 0 Thread 1 Thread 2 -------- -------- -------- dax_iomap_fault grab_mapping_entry lock_slot <locks empty DAX entry> dax_iomap_fault grab_mapping_entry get_unlocked_mapping_entry <sleeps on empty DAX entry> dax_iomap_fault grab_mapping_entry get_unlocked_mapping_entry <sleeps on empty DAX entry> dax_load_hole find_or_create_page ... page_cache_tree_insert dax_wake_mapping_entry_waiter <wakes one sleeper> __radix_tree_replace <swaps empty DAX entry with 4k zero page> <wakes> get_page lock_page ... put_locked_mapping_entry unlock_page put_page <sleeps forever on the DAX wait queue> The crux of the problem is that once we insert a 4k zero page, all locking from then on is done in terms of that 4k zero page and any additional threads sleeping on the empty DAX entry will never be woken. Fix this by waking all sleepers when we replace the DAX radix tree entry with a 4k zero page. This will allow all sleeping threads to successfully transition from locking based on the DAX empty entry to locking on the 4k zero page. With the test case reported by Xiong this happens very regularly in my test setup, with some runs resulting in 9+ threads in this deadlocked state. With this fix I've been able to run that same test dozens of times in a loop without issue. Fixes: ac401cc7 ("dax: New fault locking") Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reported-by: NXiong Zhou <xzhou@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 1月, 2017 2 次提交
-
-
由 Johannes Weiner 提交于
Several people report seeing warnings about inconsistent radix tree nodes followed by crashes in the workingset code, which all looked like use-after-free access from the shadow node shrinker. Dave Jones managed to reproduce the issue with a debug patch applied, which confirmed that the radix tree shrinking indeed frees shadow nodes while they are still linked to the shadow LRU: WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200 CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3 Call Trace: delete_node+0x1e4/0x200 __radix_tree_delete_node+0xd/0x10 shadow_lru_isolate+0xe6/0x220 __list_lru_walk_one.isra.4+0x9b/0x190 list_lru_walk_one+0x23/0x30 scan_shadow_nodes+0x2e/0x40 shrink_slab.part.44+0x23d/0x5d0 shrink_node+0x22c/0x330 kswapd+0x392/0x8f0 This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the inlined radix_tree_shrink(). The problem is with 14b46879 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking"), which passes an update callback into the radix tree to link and unlink shadow leaf nodes when tree entries change, but forgot to pass the callback when reclaiming a shadow node. While the reclaimed shadow node itself is unlinked by the shrinker, its deletion from the tree can cause the left-most leaf node in the tree to be shrunk. If that happens to be a shadow node as well, we don't unlink it from the LRU as we should. Consider this tree, where the s are shadow entries: root->rnode | [0 n] | | [s ] [sssss] Now the shadow node shrinker reclaims the rightmost leaf node through the shadow node LRU: root->rnode | [0 ] | [s ] Because the parent of the deleted node is the first level below the root and has only one child in the left-most slot, the intermediate level is shrunk and the node containing the single shadow is put in its place: root->rnode | [s ] The shrinker again sees a single left-most slot in a first level node and thus decides to store the shadow in root->rnode directly and free the node - which is a leaf node on the shadow node LRU. root->rnode | s Without the update callback, the freed node remains on the shadow LRU, where it causes later shrinker runs to crash. Pass the node updater callback into __radix_tree_delete_node() in case the deletion causes the left-most branch in the tree to collapse too. Also add warnings when linked nodes are freed right away, rather than wait for the use-after-free when the list is scanned much later. Fixes: 14b46879 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking") Reported-by: NDave Chinner <david@fromorbit.com> Reported-by: NHugh Dickins <hughd@google.com> Reported-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-and-tested-by: NDave Jones <davej@codemonkey.org.uk> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Leech <cleech@redhat.com> Cc: Lee Duncan <lduncan@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
4.10-rc loadtest (even on x86, and even without THPCache) fails with "fork: Cannot allocate memory" or some such; and /proc/meminfo shows PageTables growing. Commit 953c66c2 ("mm: THP page cache support for ppc64") that got merged in rc1 removed the freeing of an unused preallocated pagetable after do_fault_around() has called map_pages(). This is usually a good optimization, so that the followup doesn't have to reallocate one; but it's not sufficient to shift the freeing into alloc_set_pte(), since there are failure cases (most commonly VM_FAULT_RETRY) which never reach finish_fault(). Check and free it at the outer level in do_fault(), then we don't need to worry in alloc_set_pte(), and can restore that to how it was (I cannot find any reason to pte_free() under lock as it was doing). And fix a separate pagetable leak, or crash, introduced by the same change, that could only show up on some ppc64: why does do_set_pmd()'s failure case attempt to withdraw a pagetable when it never deposited one, at the same time overwriting (so leaking) the vmf->prealloc_pte? Residue of an earlier implementation, perhaps? Delete it. Fixes: 953c66c2 ("mm: THP page cache support for ppc64") Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 12月, 2016 2 次提交
-
-
由 Olof Johansson 提交于
mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte': mm/filemap.c:933:9: error: too few arguments to function 'test_bit' return test_bit(PG_waiters); ^~~~~~~~ Fixes: b91e1302 ('mm: optimize PageWaiters bit use for unlock_page()') Signed-off-by: NOlof Johansson <olof@lixom.net> Brown-paper-bag-by: NLinus Torvalds <dummy@duh.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
In commit 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit") Nick Piggin made our page locking no longer unconditionally touch the hashed page waitqueue, which not only helps performance in general, but is particularly helpful on NUMA machines where the hashed wait queues can bounce around a lot. However, the "clear lock bit atomically and then test the waiters bit" sequence turns out to be much more expensive than it needs to be, because you get a nasty stall when trying to access the same word that just got updated atomically. On architectures where locking is done with LL/SC, this would be trivial to fix with a new primitive that clears one bit and tests another atomically, but that ends up not working on x86, where the only atomic operations that return the result end up being cmpxchg and xadd. The atomic bit operations return the old value of the same bit we changed, not the value of an unrelated bit. On x86, we could put the lock bit in the high bit of the byte, and use "xadd" with that bit (where the overflow ends up not touching other bits), and look at the other bits of the result. However, an even simpler model is to just use a regular atomic "and" to clear the lock bit, and then the sign bit in eflags will indicate the resulting state of the unrelated bit #7. So by moving the PageWaiters bit up to bit #7, we can atomically clear the lock bit and test the waiters bit on x86 too. And architectures with LL/SC (which is all the usual RISC suspects), the particular bit doesn't matter, so they are fine with this approach too. This avoids the extra access to the same atomic word, and thus avoids the costly stall at page unlock time. The only downside is that the interface ends up being a bit odd and specialized: clear a bit in a byte, and test the sign bit. Nick doesn't love the resulting name of the new primitive, but I'd rather make the name be descriptive and very clear about the limitation imposed by trying to work across all relevant architectures than make it be some generic thing that doesn't make the odd semantics explicit. So this introduces the new architecture primitive clear_bit_unlock_is_negative_byte(); and adds the trivial implementation for x86. We have a generic non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)" combination) which can be overridden by any architecture that can do better. According to Nick, Power has the same hickup x86 has, for example, but some other architectures may not even care. All these optimizations mean that my page locking stress-test (which is just executing a lot of small short-lived shell scripts: "make test" in the git source tree) no longer makes our page locking look horribly bad. Before all these optimizations, just the unlock_page() costs were just over 3% of all CPU overhead on "make test". After this, it's down to 0.66%, so just a quarter of the cost it used to be. (The difference on NUMA is bigger, but there this micro-optimization is likely less noticeable, since the big issue on NUMA was not the accesses to 'struct page', but the waitqueue accesses that were already removed by Nick's earlier commit). Acked-by: NNick Piggin <npiggin@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 12月, 2016 1 次提交
-
-
由 Jan Kara 提交于
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages() just delete all exceptional radix tree entries they find. For DAX this is not desirable as we track cache dirtiness in these entries and when they are evicted, we may not flush caches although it is necessary. This can for example manifest when we write to the same block both via mmap and via write(2) (to different offsets) and fsync(2) then does not properly flush CPU caches when modification via write(2) was the last one. Create appropriate DAX functions to handle invalidation of DAX entries for invalidate_inode_pages2_range() and invalidate_mapping_pages() and wire them up into the corresponding mm functions. Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 26 12月, 2016 2 次提交
-
-
由 Nicholas Piggin 提交于
Add a new page flag, PageWaiters, to indicate the page waitqueue has tasks waiting. This can be tested rather than testing waitqueue_active which requires another cacheline load. This bit is always set when the page has tasks on page_waitqueue(page), and is set and cleared under the waitqueue lock. It may be set when there are no tasks on the waitqueue, which will cause a harmless extra wakeup check that will clears the bit. The generic bit-waitqueue infrastructure is no longer used for pages. Instead, waitqueues are used directly with a custom key type. The generic code was not flexible enough to have PageWaiters manipulation under the waitqueue lock (which simplifies concurrency). This improves the performance of page lock intensive microbenchmarks by 2-3%. Putting two bits in the same word opens the opportunity to remove the memory barrier between clearing the lock bit and testing the waiters bit, after some work on the arch primitives (e.g., ensuring memory operand widths match and cover both bits). Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
A page is not added to the swap cache without being swap backed, so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 12月, 2016 1 次提交
-
-
由 Linus Torvalds 提交于
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 12月, 2016 1 次提交
-
-
由 Johannes Weiner 提交于
When FADV_DONTNEED cannot drop all pages in the range, it observes that some pages might still be on per-cpu LRU caches after recent instantiation and so initiates remote calls to all CPUs to flush their local caches. However, in most cases, the fadvise happens from the same context that instantiated the pages, and any pre-LRU pages in the specified range are most likely sitting on the local CPU's LRU cache, and so in many cases this results in unnecessary remote calls, which, in a loaded system, can hold up the fadvise() call significantly. [ I didn't record it in the extreme case we observed at Facebook, unfortunately. We had a slow-to-respond system and noticed it lru_add_drain_all() leading the profile during fadvise calls. This patch came out of thinking about the code and how we commonly call FADV_DONTNEED. FWIW, I wrote a silly directory tree walker/searcher that recurses through /usr to read and FADV_DONTNEED each file it finds. On a 2 socket 40 ht machine, over 1% is spent in lru_add_drain_all(). With the patch, that cost is gone; the local drain cost shows at 0.09%. ] Try to avoid the remote call by flushing the local LRU cache before even attempting to invalidate anything. It's a cheap operation, and the local LRU cache is the most likely to hold any pre-LRU pages in the specified fadvise range. Link: http://lkml.kernel.org/r/20161214210017.GA1465@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 12月, 2016 26 次提交
-
-
由 Matthew Wilcox 提交于
This is an exceptionally complicated function with just one caller (tag_pages_for_writeback). We devote a large portion of the runtime of the test suite to testing this one function which has one caller. By introducing the new function radix_tree_iter_tag_set(), we can eliminate all of the complexity while keeping the performance. The caller can now use a fairly standard radix_tree_for_each() loop, and it doesn't need to worry about tricksy things like 'start' wrapping. The test suite continues to spend a large amount of time investigating this function, but now it's testing the underlying primitives such as radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator which are also used by other parts of the kernel. Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.comSigned-off-by: NMatthew Wilcox <willy@infradead.org> Tested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
This rather complicated function can be better implemented as an iterator. It has only one caller, so move the functionality to the only place that needs it. Update the test suite to follow the same pattern. Link: http://lkml.kernel.org/r/1480369871-5271-56-git-send-email-mawilcox@linuxonhyperv.comSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com> Tested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
This fixes several interlinked problems with the iterators in the presence of multiorder entries. 1. radix_tree_iter_next() would only advance by one slot, which would result in the iterators returning the same entry more than once if there were sibling entries. 2. radix_tree_next_slot() could return an internal pointer instead of a user pointer if a tagged multiorder entry was immediately followed by an entry of lower order. 3. radix_tree_next_slot() expanded to a lot more code than it used to when multiorder support was compiled in. And I wasn't comfortable with entry_to_node() being in a header file. Fixing radix_tree_iter_next() for the presence of sibling entries necessarily involves examining the contents of the radix tree, so we now need to pass 'slot' to radix_tree_iter_next(), and we need to change the calling convention so it is called *before* dropping the lock which protects the tree. Also rename it to radix_tree_iter_resume(), as some people thought it was necessary to call radix_tree_iter_next() each time around the loop. radix_tree_next_slot() becomes closer to how it looked before multiorder support was introduced. It only checks to see if the next entry in the chunk is a sibling entry or a pointer to a node; this should be rare enough that handling this case out of line is not a performance impact (and such impact is amortised by the fact that the entry we just processed was a multiorder entry). Also, radix_tree_next_slot() used to force a new chunk lookup for untagged entries, which is more expensive than the out of line sibling entry skipping. Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.comSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Tested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite() has released corresponding radix tree entry lock. When we want to writeprotect PTE on cache flush, we need PTE modification to happen under radix tree entry lock to ensure consistent updates of PTE and radix tree (standard faults use page lock to ensure this consistency). So move update of PTE bit into dax_pfn_mkwrite(). Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
DAX will need to implement its own version of page_check_address(). To avoid duplicating page table walking code, export follow_pte() which does what we need. Link: http://lkml.kernel.org/r/1479460644-25076-18-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Currently finish_mkwrite_fault() returns 0 when PTE got changed before we acquired PTE lock and VM_FAULT_WRITE when we succeeded in modifying the PTE. This is somewhat confusing since 0 generally means success, it is also inconsistent with finish_fault() which returns 0 on success. Change finish_mkwrite_fault() to return 0 on success and VM_FAULT_NOPAGE when PTE changed. Practically, there should be no behavioral difference since we bail out from the fault the same way regardless whether we return 0, VM_FAULT_NOPAGE, or VM_FAULT_WRITE. Also note that VM_FAULT_WRITE has no effect for shared mappings since the only two places that check it - KSM and GUP - care about private mappings only. Generally the meaning of VM_FAULT_WRITE for shared mappings is not well defined and we should probably clean that up. Link: http://lkml.kernel.org/r/1479460644-25076-17-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Provide a helper function for finishing write faults due to PTE being read-only. The helper will be used by DAX to avoid the need of complicating generic MM code with DAX locking specifics. Link: http://lkml.kernel.org/r/1479460644-25076-16-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
wp_page_reuse() handles write shared faults which is needed only in wp_page_shared(). Move the handling only into that location to make wp_page_reuse() simpler and avoid a strange situation when we sometimes pass in locked page, sometimes unlocked etc. Link: http://lkml.kernel.org/r/1479460644-25076-15-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
So far we set vmf->page during WP faults only when we needed to pass it to the ->page_mkwrite handler. Set it in all the cases now and use that instead of passing page pointer explicitly around. Link: http://lkml.kernel.org/r/1479460644-25076-14-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
We will need more information in the ->page_mkwrite() helper for DAX to be able to fully finish faults there. Pass vm_fault structure to do_page_mkwrite() and use it there so that information propagates properly from upper layers. Link: http://lkml.kernel.org/r/1479460644-25076-13-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Currently we duplicate handling of shared write faults in wp_page_reuse() and do_shared_fault(). Factor them out into a common function. Link: http://lkml.kernel.org/r/1479460644-25076-12-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Move final handling of COW faults from generic code into DAX fault handler. That way generic code doesn't have to be aware of peculiarities of DAX locking so remove that knowledge and make locking functions private to fs/dax.c. Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Introduce finish_fault() as a helper function for finishing page faults. It is rather thin wrapper around alloc_set_pte() but since we'd want to call this from DAX code or filesystems, it is still useful to avoid some boilerplate code. Link: http://lkml.kernel.org/r/1479460644-25076-10-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Patch series "dax: Clear dirty bits after flushing caches", v5. Patchset to clear dirty bits from radix tree of DAX inodes when caches for corresponding pfns have been flushed. In principle, these patches enable handlers to easily update PTEs and do other work necessary to finish the fault without duplicating the functionality present in the generic code. I'd like to thank Kirill and Ross for reviews of the series! This patch (of 20): To allow full handling of COW faults add memcg field to struct vm_fault and a return value of ->fault() handler meaning that COW fault is fully handled and memcg charge must not be canceled. This will allow us to remove knowledge about special DAX locking from the generic fault code. Link: http://lkml.kernel.org/r/1479460644-25076-9-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Add orig_pte field to vm_fault structure to allow ->page_mkwrite handlers to fully handle the fault. This also allows us to save some passing of extra arguments around. Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Instead of creating another vm_fault structure, use the one passed to wp_pfn_shared() for passing arguments into pfn_mkwrite handler. Link: http://lkml.kernel.org/r/1479460644-25076-7-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Use vm_fault structure to pass cow_page, page, and entry in and out of the function. That reduces number of __do_fault() arguments from 4 to 1. Link: http://lkml.kernel.org/r/1479460644-25076-6-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Instead of creating another vm_fault structure, use the one passed to __do_fault() for passing arguments into fault handler. Link: http://lkml.kernel.org/r/1479460644-25076-5-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
struct vm_fault has already pgoff entry. Use it instead of passing pgoff as a separate argument and then assigning it later. Link: http://lkml.kernel.org/r/1479460644-25076-4-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Every single user of vmf->virtual_address typed that entry to unsigned long before doing anything with it so the type of virtual_address does not really provide us any additional safety. Just use masked vmf->address which already has the appropriate type. Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Currently we have two different structures for passing fault information around - struct vm_fault and struct fault_env. DAX will need more information in struct vm_fault to handle its faults so the content of that structure would become event closer to fault_env. Furthermore it would need to generate struct fault_env to be able to call some of the generic functions. So at this point I don't think there's much use in keeping these two structures separate. Just embed into struct vm_fault all that is needed to use it for both purposes. Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
Unexport the low-level __get_user_pages_unlocked() function and replaces invocations with calls to more appropriate higher-level functions. In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked() with get_user_pages_unlocked() since we can now pass gup_flags. In async_pf_execute() and process_vm_rw_single_vec() we need to pass different tsk, mm arguments so get_user_pages_remote() is the sane replacement in these cases (having added manual acquisition and release of mmap_sem.) Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH flag. However, this flag was originally silently dropped by commit 1e987790 ("mm/gup: Introduce get_user_pages_remote()"), so this appears to have been unintentional and reintroducing it is therefore not an issue. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.comSigned-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
Patch series "mm: unexport __get_user_pages_unlocked()". This patch series continues the cleanup of get_user_pages*() functions taking advantage of the fact we can now pass gup_flags as we please. It firstly adds an additional 'locked' parameter to get_user_pages_remote() to allow for its callers to utilise VM_FAULT_RETRY functionality. This is necessary as the invocation of __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of this and no other existing higher level function would allow it to do so. Secondly existing callers of __get_user_pages_unlocked() are replaced with the appropriate higher-level replacement - get_user_pages_unlocked() if the current task and memory descriptor are referenced, or get_user_pages_remote() if other task/memory descriptors are referenced (having acquiring mmap_sem.) This patch (of 2): Add a int *locked parameter to get_user_pages_remote() to allow VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked(). Taking into account the previous adjustments to get_user_pages*() functions allowing for the passing of gup_flags, we are now in a position where __get_user_pages_unlocked() need only be exported for his ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to subsequently unexport __get_user_pages_unlocked() as well as allowing for future flexibility in the use of get_user_pages_remote(). [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change] Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.comSigned-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexander Duyck 提交于
Add a function that allows us to batch free a page that has multiple references outstanding. Specifically this function can be used to drop a page being used in the page frag alloc cache. With this drivers can make use of functionality similar to the page frag alloc cache without having to do any workarounds for the fact that there is no function that frees multiple references. Link: http://lkml.kernel.org/r/20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.comSigned-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> Cc: Helge Deller <deller@gmx.de> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Keguang Zhang <keguang.zhang@gmail.com> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Miao <realmz6@gmail.com> Cc: Tobias Klauser <tklauser@distanz.ch> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
compaction has been disabled for GFP_NOFS and GFP_NOIO requests since the direct compaction was introduced by commit 56de7263 ("mm: compaction: direct compact when a high-order allocation fails"). The main reason is that the migration of page cache pages might recurse back to fs/io layer and we could potentially deadlock. This is overly conservative because all the anonymous memory is migrateable in the GFP_NOFS context just fine. This might be a large portion of the memory in many/most workkloads. Remove the GFP_NOFS restriction and make sure that we skip all fs pages (those with a mapping) while isolating pages to be migrated. We cannot consider clean fs pages because they might need a metadata update so only isolate pages without any mapping for nofs requests. The effect of this patch will be probably very limited in many/most workloads because higher order GFP_NOFS requests are quite rare, although different configurations might lead to very different results. David Chinner has mentioned a heavy metadata workload with 64kB block which to quote him: : Unfortunately, there was an era of cargo cult configuration tweaks in the : Ceph community that has resulted in a large number of production machines : with XFS filesystems configured this way. And a lot of them store large : numbers of small files and run under significant sustained memory : pressure. : : I slowly working towards getting rid of these high order allocations and : replacing them with the equivalent number of single page allocations, but : I haven't got that (complex) change working yet. We can do the following to simulate that workload: $ mkfs.xfs -f -n size=64k <dev> $ mount <dev> /mnt/scratch $ time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \ -d /mnt/scratch/0 -d /mnt/scratch/1 \ -d /mnt/scratch/2 -d /mnt/scratch/3 \ -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 \ -d /mnt/scratch/8 -d /mnt/scratch/9 \ -d /mnt/scratch/10 -d /mnt/scratch/11 \ -d /mnt/scratch/12 -d /mnt/scratch/13 \ -d /mnt/scratch/14 -d /mnt/scratch/15 and indeed is hammers the system with many high order GFP_NOFS requests as per a simle tracepoint during the load: $ echo '!(gfp_flags & 0x80) && (gfp_flags &0x400000)' > $TRACE_MNT/events/kmem/mm_page_alloc/filter I am getting 5287609 order=0 37 order=1 1594905 order=2 3048439 order=3 6699207 order=4 66645 order=5 My testing was done in a kvm guest so performance numbers should be taken with a grain of salt but there seems to be a difference when the patch is applied: * Original kernel FSUse% Count Size Files/sec App Overhead 1 1600000 0 4300.1 20745838 3 3200000 0 4239.9 23849857 5 4800000 0 4243.4 25939543 6 6400000 0 4248.4 19514050 8 8000000 0 4262.1 20796169 9 9600000 0 4257.6 21288675 11 11200000 0 4259.7 19375120 13 12800000 0 4220.7 22734141 14 14400000 0 4238.5 31936458 16 16000000 0 4231.5 23409901 18 17600000 0 4045.3 23577700 19 19200000 0 2783.4 58299526 21 20800000 0 2678.2 40616302 23 22400000 0 2693.5 83973996 and xfs complaining about memory allocation not making progress [ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) [ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) [ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 46936 in kmem_alloc (mode:0x2408240) [ 4796.775329] XFS: fs_mark(3423) possible memory allocation deadlock size 51416 in kmem_alloc (mode:0x2408240) [ 4797.388808] XFS: fs_mark(3424) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) * Patched kernel FSUse% Count Size Files/sec App Overhead 1 1600000 0 4289.1 19243934 3 3200000 0 4241.6 32828865 5 4800000 0 4248.7 32884693 6 6400000 0 4314.4 19608921 8 8000000 0 4269.9 24953292 9 9600000 0 4270.7 33235572 11 11200000 0 4346.4 40817101 13 12800000 0 4285.3 29972397 14 14400000 0 4297.2 20539765 16 16000000 0 4219.6 18596767 18 17600000 0 4273.8 49611187 19 19200000 0 4300.4 27944451 21 20800000 0 4270.6 22324585 22 22400000 0 4317.6 22650382 24 24000000 0 4065.2 22297964 So the dropdown at Count 19200000 didn't happen and there was only a single warning about allocation not making progress [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) This suggests that the patch has helped even though there is not all that much of anonymous memory as the workload mostly generates fs metadata. I assume the success rate would be higher with more anonymous memory which should be the case in many workloads. [akpm@linux-foundation.org: fix comment] Link: http://lkml.kernel.org/r/20161012114721.31853-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
We truncated the possible read iterator to s_maxbytes in commit c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"), but our end condition handling was wrong: it's not an error to try to read at the end of the file. Reading past the end should return EOF (0), not EINVAL. See for example https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342 http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html where a md5sum of a maximally sized file fails because the final read is exactly at s_maxbytes. Fixes: c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()") Reported-by: NJoseph Salisbury <joseph.salisbury@canonical.com> Cc: Wei Fang <fangwei1@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2016 1 次提交
-
-
由 zijun_hu 提交于
As shown by pcpu_build_alloc_info(), the number of units within a percpu group is deduced by rounding up the number of CPUs within the group to @upa boundary/ Therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. However, pcpu_page_first_chunk() uses BUG_ON() to assert that one number is equal to the other roughly, so a panic is maybe triggered by the BUG_ON() incorrectly. In order to fix this issue, the number of CPUs is rounded up then compared with units's and the BUG_ON() is replaced with a warning and return of an error code as well, to keep system alive as much as possible. Link: http://lkml.kernel.org/r/57FCF07C.2020103@zoho.comSigned-off-by: Nzijun_hu <zijun_hu@htc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-