- 20 6月, 2017 2 次提交
-
-
由 Goldwyn Rodrigues 提交于
Find out if the I/O will trigger a wait due to writeback. If yes, return -EAGAIN. Return -EINVAL for buffered AIO: there are multiple causes of delay such as page locks, dirty throttling logic, page loading from disk etc. which cannot be taken care of. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Goldwyn Rodrigues 提交于
filemap_range_has_page() return true if the file's mapping has a page within the range mentioned. This function will be used to check if a write() call will cause a writeback of previous writes. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 5月, 2017 2 次提交
-
-
由 Tetsuo Handa 提交于
Commit afddba49 ("fs: introduce write_begin, write_end, and perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was checked in pagecache_write_begin(), but that check was removed by 4e02ed4b ("fs: remove prepare_write/commit_write"). Between these two commits, commit d9414774 ("cifs: Convert cifs to new aops.") added a check in cifs_write_begin(), but that check was soon removed by commit a98ee8c1 ("[CIFS] fix regression in cifs_write_begin/cifs_write_end"). Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's remove this flag. This patch has no functionality changes. Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: NJeff Layton <jlayton@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Nick Piggin <npiggin@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Al Viro 提交于
Wrong sign of iov_iter_revert() argument. Unfortunately, slipped through the testing, since most of the time we don't do anything to the iterator afterwards and potential oops on walking the iter->iov too far backwards is too infrequent to be easily triggered. Add a sanity check in iov_iter_revert() to catch bugs like this one; fortunately, the same braino hadn't happened in other callers, but we'd better have a warning if such thing crops up. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 04 5月, 2017 2 次提交
-
-
由 Andrey Ryabinin 提交于
Patch series "Properly invalidate data in the cleancache", v2. We've noticed that after direct IO write, buffered read sometimes gets stale data which is coming from the cleancache. The reason for this is that some direct write hooks call call invalidate_inode_pages2[_range]() conditionally iff mapping->nrpages is not zero, so we may not invalidate data in the cleancache. Another odd thing is that we check only for ->nrpages and don't check for ->nrexceptional, but invalidate_inode_pages2[_range] also invalidates exceptional entries as well. So we invalidate exceptional entries only if ->nrpages != 0? This doesn't feel right. - Patch 1 fixes direct IO writes by removing ->nrpages check. - Patch 2 fixes similar case in invalidate_bdev(). Note: I only fixed conditional cleancache_invalidate_inode() here. Do we also need to add ->nrexceptional check in into invalidate_bdev()? - Patches 3-4: some optimizations. This patch (of 4): Some direct IO write fs hooks call invalidate_inode_pages2[_range]() conditionally iff mapping->nrpages is not zero. This can't be right, because invalidate_inode_pages2[_range]() also invalidate data in the cleancache via cleancache_invalidate_inode() call. So if page cache is empty but there is some data in the cleancache, buffered read after direct IO write would get stale data from the cleancache. Also it doesn't feel right to check only for ->nrpages because invalidate_inode_pages2[_range] invalidates exceptional entries as well. Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages state. Note: nfs,cifs,9p doesn't need similar fix because the never call cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so they are not affected by this bug. Fixes: c515e1fd ("mm/fs: add hooks to support cleancache") Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
The round_up() macro generates a couple of unnecessary instructions in this usage: 48cd: 49 8b 47 50 mov 0x50(%r15),%rax 48d1: 48 83 e8 01 sub $0x1,%rax 48d5: 48 0d ff 0f 00 00 or $0xfff,%rax 48db: 48 83 c0 01 add $0x1,%rax 48df: 48 c1 f8 0c sar $0xc,%rax 48e3: 48 39 c3 cmp %rax,%rbx 48e6: 72 2e jb 4916 <filemap_fault+0x96> If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y) then GCC can see through it and remove the mask (because that would be dead code given the subsequent shift): 48cd: 49 8b 47 50 mov 0x50(%r15),%rax 48d1: 48 05 ff 0f 00 00 add $0xfff,%rax 48d7: 48 c1 e8 0c shr $0xc,%rax 48db: 48 39 c3 cmp %rax,%rbx 48de: 72 2e jb 490e <filemap_fault+0x8e> But that's problematic because we'd evaluate 'y' twice. Converting round_up into an inline function prevents it from being used in other definitions. The easiest thing to do is just change these three usages of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's heuristic is wrong in this case. Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 4月, 2017 2 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 03 4月, 2017 1 次提交
-
-
由 mchehab@s-opensource.com 提交于
./lib/string.c:134: WARNING: Inline emphasis start-string without end-string. ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string. ./mm/filemap.c:1283: ERROR: Unexpected indentation. ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string. ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string. ./mm/page_alloc.c:4245: ERROR: Unexpected indentation. ./ipc/util.c:676: ERROR: Unexpected indentation. ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent. ./security/security.c:109: ERROR: Unexpected indentation. ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent. ./block/genhd.c:275: WARNING: Inline strong start-string without end-string. ./block/genhd.c:283: WARNING: Inline strong start-string without end-string. ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string. ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string. ./ipc/util.c:477: ERROR: Unknown target name: "s". Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com> Acked-by: NBjorn Helgaas <bhelgaas@google.com> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 02 3月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 25 2月, 2017 2 次提交
-
-
由 Minchan Kim 提交于
With rw_page, page_endio is used for completing IO on a page and it propagates write error to the address space if the IO fails. The problem is it accesses page->mapping directly which might be okay for file-backed pages but it shouldn't for anonymous page. Otherwise, it can corrupt one of field from anon_vma under us and system goes panic randomly. swap_writepage bdev_writepage ops->rw_page I encountered the BUG during developing new zram feature and it was really hard to figure it out because it made random crash, somtime mmap_sem lockdep, sometime other places where places never related to zram/zsmalloc, and not reproducible with some configuration. When I consider how that bug is subtle and people do fast-swap test with brd, it's worth to add stable mark, I think. Fixes: dd6bd0d9 ("swap: use bdev_read_page() / bdev_write_page()") Signed-off-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Jiang 提交于
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2017 2 次提交
-
-
由 Randy Dunlap 提交于
Fix kernel-doc warnings in mm/filemap.c: mm/filemap.c:993: warning: No description found for parameter '__page' mm/filemap.c:993: warning: Excess function parameter 'page' description in '__lock_page' Link: http://lkml.kernel.org/r/a66fe492-518c-ad6c-5f03-5e8b721fb451@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
These are no longer used outside mm/filemap.c, so un-export them and make them static where possible. These were exported specifically for NFS use in commit a4796e37 ("MM: export page_wakeup functions"). Link: http://lkml.kernel.org/r/20170103182234.30141-3-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 2月, 2017 1 次提交
-
-
由 Michal Hocko 提交于
do_generic_file_read() can be told to perform a large request from userspace. If the system is under OOM and the reading task is the OOM victim then it has an access to memory reserves and finishing the full request can lead to the full memory depletion which is dangerous. Make sure we rather go with a short read and allow the killed task to terminate. Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 1月, 2017 1 次提交
-
-
由 Ross Zwisler 提交于
Currently in DAX if we have three read faults on the same hole address we can end up with the following: Thread 0 Thread 1 Thread 2 -------- -------- -------- dax_iomap_fault grab_mapping_entry lock_slot <locks empty DAX entry> dax_iomap_fault grab_mapping_entry get_unlocked_mapping_entry <sleeps on empty DAX entry> dax_iomap_fault grab_mapping_entry get_unlocked_mapping_entry <sleeps on empty DAX entry> dax_load_hole find_or_create_page ... page_cache_tree_insert dax_wake_mapping_entry_waiter <wakes one sleeper> __radix_tree_replace <swaps empty DAX entry with 4k zero page> <wakes> get_page lock_page ... put_locked_mapping_entry unlock_page put_page <sleeps forever on the DAX wait queue> The crux of the problem is that once we insert a 4k zero page, all locking from then on is done in terms of that 4k zero page and any additional threads sleeping on the empty DAX entry will never be woken. Fix this by waking all sleepers when we replace the DAX radix tree entry with a 4k zero page. This will allow all sleeping threads to successfully transition from locking based on the DAX empty entry to locking on the 4k zero page. With the test case reported by Xiong this happens very regularly in my test setup, with some runs resulting in 9+ threads in this deadlocked state. With this fix I've been able to run that same test dozens of times in a loop without issue. Fixes: ac401cc7 ("dax: New fault locking") Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reported-by: NXiong Zhou <xzhou@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 12月, 2016 2 次提交
-
-
由 Olof Johansson 提交于
mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte': mm/filemap.c:933:9: error: too few arguments to function 'test_bit' return test_bit(PG_waiters); ^~~~~~~~ Fixes: b91e1302 ('mm: optimize PageWaiters bit use for unlock_page()') Signed-off-by: NOlof Johansson <olof@lixom.net> Brown-paper-bag-by: NLinus Torvalds <dummy@duh.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
In commit 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit") Nick Piggin made our page locking no longer unconditionally touch the hashed page waitqueue, which not only helps performance in general, but is particularly helpful on NUMA machines where the hashed wait queues can bounce around a lot. However, the "clear lock bit atomically and then test the waiters bit" sequence turns out to be much more expensive than it needs to be, because you get a nasty stall when trying to access the same word that just got updated atomically. On architectures where locking is done with LL/SC, this would be trivial to fix with a new primitive that clears one bit and tests another atomically, but that ends up not working on x86, where the only atomic operations that return the result end up being cmpxchg and xadd. The atomic bit operations return the old value of the same bit we changed, not the value of an unrelated bit. On x86, we could put the lock bit in the high bit of the byte, and use "xadd" with that bit (where the overflow ends up not touching other bits), and look at the other bits of the result. However, an even simpler model is to just use a regular atomic "and" to clear the lock bit, and then the sign bit in eflags will indicate the resulting state of the unrelated bit #7. So by moving the PageWaiters bit up to bit #7, we can atomically clear the lock bit and test the waiters bit on x86 too. And architectures with LL/SC (which is all the usual RISC suspects), the particular bit doesn't matter, so they are fine with this approach too. This avoids the extra access to the same atomic word, and thus avoids the costly stall at page unlock time. The only downside is that the interface ends up being a bit odd and specialized: clear a bit in a byte, and test the sign bit. Nick doesn't love the resulting name of the new primitive, but I'd rather make the name be descriptive and very clear about the limitation imposed by trying to work across all relevant architectures than make it be some generic thing that doesn't make the odd semantics explicit. So this introduces the new architecture primitive clear_bit_unlock_is_negative_byte(); and adds the trivial implementation for x86. We have a generic non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)" combination) which can be overridden by any architecture that can do better. According to Nick, Power has the same hickup x86 has, for example, but some other architectures may not even care. All these optimizations mean that my page locking stress-test (which is just executing a lot of small short-lived shell scripts: "make test" in the git source tree) no longer makes our page locking look horribly bad. Before all these optimizations, just the unlock_page() costs were just over 3% of all CPU overhead on "make test". After this, it's down to 0.66%, so just a quarter of the cost it used to be. (The difference on NUMA is bigger, but there this micro-optimization is likely less noticeable, since the big issue on NUMA was not the accesses to 'struct page', but the waitqueue accesses that were already removed by Nick's earlier commit). Acked-by: NNick Piggin <npiggin@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 12月, 2016 1 次提交
-
-
由 Nicholas Piggin 提交于
Add a new page flag, PageWaiters, to indicate the page waitqueue has tasks waiting. This can be tested rather than testing waitqueue_active which requires another cacheline load. This bit is always set when the page has tasks on page_waitqueue(page), and is set and cleared under the waitqueue lock. It may be set when there are no tasks on the waitqueue, which will cause a harmless extra wakeup check that will clears the bit. The generic bit-waitqueue infrastructure is no longer used for pages. Instead, waitqueues are used directly with a custom key type. The generic code was not flexible enough to have PageWaiters manipulation under the waitqueue lock (which simplifies concurrency). This improves the performance of page lock intensive microbenchmarks by 2-3%. Putting two bits in the same word opens the opportunity to remove the memory barrier between clearing the lock bit and testing the waiters bit, after some work on the arch primitives (e.g., ensuring memory operand widths match and cover both bits). Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 12月, 2016 2 次提交
-
-
由 Jan Kara 提交于
Currently we have two different structures for passing fault information around - struct vm_fault and struct fault_env. DAX will need more information in struct vm_fault to handle its faults so the content of that structure would become event closer to fault_env. Furthermore it would need to generate struct fault_env to be able to call some of the generic functions. So at this point I don't think there's much use in keeping these two structures separate. Just embed into struct vm_fault all that is needed to use it for both purposes. Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
We truncated the possible read iterator to s_maxbytes in commit c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"), but our end condition handling was wrong: it's not an error to try to read at the end of the file. Reading past the end should return EOF (0), not EINVAL. See for example https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342 http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html where a md5sum of a maximally sized file fails because the final read is exactly at s_maxbytes. Fixes: c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()") Reported-by: NJoseph Salisbury <joseph.salisbury@canonical.com> Cc: Wei Fang <fangwei1@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2016 4 次提交
-
-
由 Johannes Weiner 提交于
Shadow entries in the page cache used to be accounted behind the radix tree implementation's back in the upper bits of node->count, and the radix tree code extending a single-entry tree with a shadow entry in root->rnode would corrupt that counter. As a result, we could not put shadow entries at index 0 if the tree didn't have any other entries, and that means no refault detection for any single-page file. Now that the shadow entries are tracked natively in the radix tree's exceptional counter, this is no longer necessary. Extending and shrinking the tree from and to single entries in root->rnode now does the right thing when the entry is exceptional, remove that limitation. Link: http://lkml.kernel.org/r/20161117193244.GF23430@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The bug in khugepaged fixed earlier in this series shows that radix tree slot replacement is fragile; and it will become more so when not only NULL<->!NULL transitions need to be caught but transitions from and to exceptional entries as well. We need checks. Re-implement radix_tree_replace_slot() on top of the sanity-checked __radix_tree_replace(). This requires existing callers to also pass the radix tree root, but it'll warn us when somebody replaces slots with contents that need proper accounting (transitions between NULL entries, real entries, exceptional entries) and where a replacement through the slot pointer would corrupt the radix tree node counts. Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Suggested-by: NJan Kara <jack@suse.cz> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Unlike THP, hugetlb pages are represented by one entry in the radix-tree. [akpm@linux-foundation.org: tweak comment] Link: http://lkml.kernel.org/r/20161110163640.126124-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 11月, 2016 1 次提交
-
-
由 Eryu Guan 提交于
Starting from 4.9-rc1 kernel, I started noticing some test failures of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when testing on sub-page block size filesystems (tested both XFS and ext4), these syscalls start to return EIO in the tests. e.g. sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1 sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1 sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1 sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1 This is because that in sub-page block size cases, we don't need the whole page to be uptodate, only the part we care about is uptodate is OK (if fs has ->is_partially_uptodate defined). But page_cache_pipe_buf_confirm() doesn't have the ability to check the partially-uptodate case, it needs the whole page to be uptodate. So it returns EIO in this case. This is a regression introduced by commit 82c156f8 ("switch generic_file_splice_read() to use of ->read_iter()"). Prior to the change, generic_file_splice_read() doesn't allow partially-uptodate page either, so it worked fine. Fix it by skipping the partially-uptodate check if we're working on a pipe in do_generic_file_read(), so we read the whole page from disk as long as the page is not uptodate. I think the other way to fix it is to add the ability to check & allow partially-uptodate page to page_cache_pipe_buf_confirm(), but that is much harder to do and seems gain little. Link: http://lkml.kernel.org/r/1477986187-12717-1-git-send-email-guaneryu@gmail.comSigned-off-by: NEryu Guan <guaneryu@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 11月, 2016 2 次提交
-
-
由 Ross Zwisler 提交于
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based locking. This patch allows DAX PMDs to participate in the DAX radix tree based locking scheme so that they can be re-enabled using the new struct iomap based fault handlers. There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX mappings that have an associated block allocation, and 4k DAX empty entries. The empty entries exist to provide locking for the duration of a given page fault. This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP) entries, PMD DAX entries that have associated block allocations, and 2 MiB DAX empty entries. Unlike the 4k case where we insert a struct page* into the radix tree for 4k zero pages, for HZP we insert a DAX exceptional entry with the new RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in every 2MiB hole mapping, and it doesn't make sense to have that same struct page* with multiple entries in multiple trees. This would cause contention on the single page lock for the one Huge Zero Page, and it would break the page->index and page->mapping associations that are assumed to be valid in many other places in the kernel. One difficult use case is when one thread is trying to use 4k entries in radix tree for a given offset, and another thread is using 2 MiB entries for that same offset. The current code handles this by making the 2 MiB user fall back to 4k entries for most cases. This was done because it is the simplest solution, and because the use of 2MiB pages is already opportunistic. If we were to try to upgrade from 4k pages to 2MiB pages for a given range, we run into the problem of how we lock out 4k page faults for the entire 2MiB range while we clean out the radix tree so we can insert the 2MiB entry. We can solve this problem if we need to, but I think that the cases where both 2MiB entries and 4K entries are being used for the same range will be rare enough and the gain small enough that it probably won't be worth the complexity. Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Ross Zwisler 提交于
DAX radix tree locking currently locks entries based on the unique combination of the 'mapping' pointer and the pgoff_t 'index' for the entry. This works for PTEs, but as we move to PMDs we will need to have all the offsets within the range covered by the PMD to map to the same bit lock. To accomplish this, for ranges covered by a PMD entry we will instead lock based on the page offset of the beginning of the PMD entry. The 'mapping' pointer is still used in the same way. Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 07 11月, 2016 1 次提交
-
-
由 Eryu Guan 提交于
Starting from 4.9-rc1 kernel, I started noticing some test failures of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when testing on sub-page block size filesystems (tested both XFS and ext4), these syscalls start to return EIO in the tests. e.g. sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1 sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1 sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1 sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1 This is because that in sub-page block size cases, we don't need the whole page to be uptodate, only the part we care about is uptodate is OK (if fs has ->is_partially_uptodate defined). But page_cache_pipe_buf_confirm() doesn't have the ability to check the partially-uptodate case, it needs the whole page to be uptodate. So it returns EIO in this case. This is a regression introduced by commit 82c156f8 ("switch generic_file_splice_read() to use of ->read_iter()"). Prior to the change, generic_file_splice_read() doesn't allow partially-uptodate page either, so it worked fine. Fix it by skipping the partially-uptodate check if we're working on a pipe in do_generic_file_read(), so we read the whole page from disk as long as the page is not uptodate. Signed-off-by: NEryu Guan <guaneryu@gmail.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 10月, 2016 1 次提交
-
-
由 Linus Torvalds 提交于
The per-zone waitqueues exist because of a scalability issue with the page waitqueues on some NUMA machines, but it turns out that they hurt normal loads, and now with the vmalloced stacks they also end up breaking gfs2 that uses a bit_wait on a stack object: wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE) where 'gh' can be a reference to the local variable 'mount_gh' on the stack of fill_super(). The reason the per-zone hash table breaks for this case is that there is no "zone" for virtual allocations, and trying to look up the physical page to get at it will fail (with a BUG_ON()). It turns out that I actually complained to the mm people about the per-zone hash table for another reason just a month ago: the zone lookup also hurts the regular use of "unlock_page()" a lot, because the zone lookup ends up forcing several unnecessary cache misses and generates horrible code. As part of that earlier discussion, we had a much better solution for the NUMA scalability issue - by just making the page lock have a separate contention bit, the waitqueue doesn't even have to be looked at for the normal case. Peter Zijlstra already has a patch for that, but let's see if anybody even notices. In the meantime, let's fix the actual gfs2 breakage by simplifying the bitlock waitqueues and removing the per-zone issue. Reported-by: NAndreas Gruenbacher <agruenba@redhat.com> Tested-by: NBob Peterson <rpeterso@redhat.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 10月, 2016 1 次提交
-
-
由 Al Viro 提交于
by making sure we call iov_iter_advance() on original iov_iter even if direct_IO (done on its copy) has returned 0. It's a no-op for old iov_iter flavours and does the right thing (== truncation of the stuff we'd allocated, but not filled) in ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with by cleanup in generic_file_read_iter(). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 08 10月, 2016 2 次提交
-
-
由 Wei Fang 提交于
We triggered a deadloop in truncate_inode_pages_range() on 32 bits architecture with the test case bellow: ... fd = open(); write(fd, buf, 4096); preadv64(fd, &iovec, 1, 0xffffffff000); ftruncate(fd, 0); ... Then ftruncate() will not return forever. The filesystem used in this case is ubifs, but it can be triggered on many other filesystems. When preadv64() is called with offset=0xffffffff000, a page with index=0xffffffff will be added to the radix tree of ->mapping. Then this page can be found in ->mapping with pagevec_lookup(). After that, truncate_inode_pages_range(), which is called in ftruncate(), will fall into an infinite loop: - find a page with index=0xffffffff, since index>=end, this page won't be truncated - index++, and index become 0 - the page with index=0xffffffff will be found again The data type of index is unsigned long, so index won't overflow to 0 on 64 bits architecture in this case, and the dead loop won't happen. Since truncate_inode_pages_range() is executed with holding lock of inode->i_rwsem, any operation related with this lock will be blocked, and a hung task will happen, e.g.: INFO: task truncate_test:3364 blocked for more than 120 seconds. ... call_rwsem_down_write_failed+0x17/0x30 generic_file_write_iter+0x32/0x1c0 ubifs_write_iter+0xcc/0x170 __vfs_write+0xc4/0x120 vfs_write+0xb2/0x1b0 SyS_write+0x46/0xa0 The page with index=0xffffffff added to ->mapping is useless. Fix this by checking the read position before allocating pages. Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.comSigned-off-by: NWei Fang <fangwei1@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bart Van Assche 提交于
If a fatal signal has been received, fail immediately instead of trying to read more data. If wait_on_page_locked_killable() was interrupted then this page is most likely is not PageUptodate() and in this case do_generic_file_read() will fail after lock_page_killable(). See also commit ebded027 ("mm: filemap: avoid unnecessary calls to lock_page when waiting for IO to complete during a read") [oleg@redhat.com: changelog addition] Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.comSigned-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 10月, 2016 2 次提交
-
-
由 Johannes Weiner 提交于
Commit 22f2ac51 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()") switched replace_page_cache() from raw radix tree operations to page_cache_tree_insert() but didn't take into account that the latter function, unlike the raw radix tree op, handles mapping->nrpages. As a result, that counter is bumped for each page replacement rather than balanced out even. The mapping->nrpages counter is used to skip needless radix tree walks when invalidating, truncating, syncing inodes without pages, as well as statistics for userspace. Since the error is positive, we'll do more page cache tree walks than necessary; we won't miss a necessary one. And we'll report more buffer pages to userspace than there are. The error is limited to fuse inodes. Fixes: 22f2ac51 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()") Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
When the underflow checks were added to workingset_node_shadow_dec(), they triggered immediately: kernel BUG at ./include/linux/swap.h:276! invalid opcode: 0000 [#1] SMP Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60b #1 Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016 task: ffff8faa93ecd940 task.stack: ffff8faa7f478000 RIP: page_cache_tree_insert+0xf1/0x100 Call Trace: __add_to_page_cache_locked+0x12e/0x270 add_to_page_cache_lru+0x4e/0xe0 mpage_readpages+0x112/0x1d0 blkdev_readpages+0x1d/0x20 __do_page_cache_readahead+0x1ad/0x290 force_page_cache_readahead+0xaa/0x100 page_cache_sync_readahead+0x3f/0x50 generic_file_read_iter+0x5af/0x740 blkdev_read_iter+0x35/0x40 __vfs_read+0xe1/0x130 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x13/0x8f Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00 RIP page_cache_tree_insert+0xf1/0x100 This is a long-standing bug in the way shadow entries are accounted in the radix tree nodes. The shrinker needs to know when radix tree nodes contain only shadow entries, no pages, so node->count is split in half to count shadows in the upper bits and pages in the lower bits. Unfortunately, the radix tree implementation doesn't know of this and assumes all entries are in node->count. When there is a shadow entry directly in root->rnode and the tree is later extended, the radix tree implementation will copy that entry into the new node and and bump its node->count, i.e. increases the page count bits. Once the shadow gets removed and we subtract from the upper counter, node->count underflows and triggers the warning. Afterwards, without node->count reaching 0 again, the radix tree node is leaked. Limit shadow entries to when we have actual radix tree nodes and can count them properly. That means we lose the ability to detect refaults from files that had only the first page faulted in at eviction time. Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check") Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-and-tested-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 10月, 2016 1 次提交
-
-
由 Christoph Hellwig 提交于
After the call to ->direct_IO the final reference to the file might have been dropped by aio_complete already, and the call to file_accessed might cause a use after free. Instead update the access time before the I/O, similar to how we update the time stamps before writes. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 01 10月, 2016 1 次提交
-
-
由 Johannes Weiner 提交于
Antonio reports the following crash when using fuse under memory pressure: kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346! invalid opcode: 0000 [#1] SMP Modules linked in: all of them CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000 RIP: shadow_lru_isolate+0x181/0x190 Call Trace: __list_lru_walk_one.isra.3+0x8f/0x130 list_lru_walk_one+0x23/0x30 scan_shadow_nodes+0x34/0x50 shrink_slab.part.40+0x1ed/0x3d0 shrink_zone+0x2ca/0x2e0 kswapd+0x51e/0x990 kthread+0xd8/0xf0 ret_from_fork+0x3f/0x70 which corresponds to the following sanity check in the shadow node tracking: BUG_ON(node->count & RADIX_TREE_COUNT_MASK); The workingset code tracks radix tree nodes that exclusively contain shadow entries of evicted pages in them, and this (somewhat obscure) line checks whether there are real pages left that would interfere with reclaim of the radix tree node under memory pressure. While discussing ways how fuse might sneak pages into the radix tree past the workingset code, Miklos pointed to replace_page_cache_page(), and indeed there is a problem there: it properly accounts for the old page being removed - __delete_from_page_cache() does that - but then does a raw raw radix_tree_insert(), not accounting for the replacement page. Eventually the page count bits in node->count underflow while leaving the node incorrectly linked to the shadow node LRU. To address this, make sure replace_page_cache_page() uses the tracked page insertion code, page_cache_tree_insert(). This fixes the page accounting and makes sure page-containing nodes are properly unlinked from the shadow node LRU again. Also, make the sanity checks a bit less obscure by using the helpers for checking the number of pages and shadows in a radix tree node. Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check") Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-by: NAntonio SJ Musumeci <trapexit@spawn.link> Debugged-by: NMiklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 8月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
Commit abf54548 changed it from an 'rw' flags type to the newer ops based interface, but now we're effectively leaking some bdev internals to the rest of the kernel. Since we only care about whether it's a read or a write at that level, just pass in a bool 'is_write' parameter instead. Then we can also move op_is_write() and friends back under CONFIG_BLOCK protection. Reviewed-by: NMike Christie <mchristi@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 05 8月, 2016 1 次提交
-
-
由 Mike Christie 提交于
The rw_page users were not converted to use bio/req ops. As a result bdev_write_page is not passing down REQ_OP_WRITE and the IOs will be sent down as reads. Signed-off-by: NMike Christie <mchristi@redhat.com> Fixes: 4e1b2d52 ("block, fs, drivers: remove REQ_OP compat defs and related code") Modified by me to: 1) Drop op_flags passing into ->rw_page(), as we don't use it. 2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 29 7月, 2016 1 次提交
-
-
由 Miklos Szeredi 提交于
Can be used by fuse, btrfs and f2fs to replace opencoded variants. Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-