- 03 1月, 2014 1 次提交
-
-
由 Vlastimil Babka 提交于
Since commit ff6a6da6 ("mm: accelerate munlock() treatment of THP pages") munlock skips tail pages of a munlocked THP page. However, when the head page already has PageMlocked unset, it will not skip the tail pages. Commit 7225522b ("mm: munlock: batch non-THP page isolation and munlock+putback using pagevec") has added a PageTransHuge() check which contains VM_BUG_ON(PageTail(page)). Sasha Levin found this triggered using trinity, on the first tail page of a THP page without PageMlocked flag. This patch fixes the issue by skipping tail pages also in the case when PageMlocked flag is unset. There is still a possibility of race with THP page split between clearing PageMlocked and determining how many pages to skip. The race might result in former tail pages not being skipped, which is however no longer a bug, as during the skip the PageTail flags are cleared. However this race also affects correctness of NR_MLOCK accounting, which is to be fixed in a separate patch. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Reported-by: NSasha Levin <sasha.levin@oracle.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 10月, 2013 1 次提交
-
-
由 Vlastimil Babka 提交于
The function __munlock_pagevec_fill() introduced in commit 7a8010cd ("mm: munlock: manual pte walk in fast path instead of follow_page_mask()") uses pmd_addr_end() for restricting its operation within current page table. This is insufficient on architectures/configurations where pmd is folded and pmd_addr_end() just returns the end of the full range to be walked. In this case, it allows pte++ to walk off the end of a page table resulting in unpredictable behaviour. This patch fixes the function by using pgd_addr_end() and pud_addr_end() before pmd_addr_end(), which will yield correct page table boundary on all configurations. This is similar to what existing page walkers do when walking each level of the page table. Additionaly, the patch clarifies a comment for get_locked_pte() call in the function. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Reported-by: NFengguang Wu <fengguang.wu@intel.com> Reviewed-by: NBob Liu <bob.liu@oracle.com> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 9月, 2013 1 次提交
-
-
由 Paul E. McKenney 提交于
There is a loop in do_mlockall() that lacks a preemption point, which means that the following can happen on non-preemptible builds of the kernel. Dave Jones reports: "My fuzz tester keeps hitting this. Every instance shows the non-irq stack came in from mlockall. I'm only seeing this on one box, but that has more ram (8gb) than my other machines, which might explain it. INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0) sending NMI to all CPUs: NMI backtrace for cpu 3 CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32 Call Trace: lru_add_drain_all+0x15/0x20 SyS_mlockall+0xa5/0x1a0 tracesys+0xdd/0xe2" This commit addresses this problem by inserting the required preemption point. Reported-by: NDave Jones <davej@redhat.com> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 9月, 2013 1 次提交
-
-
由 Paul E. McKenney 提交于
There is a loop in do_mlockall() that lacks a preemption point, which means that the following can happen on non-preemptible builds of the kernel: > My fuzz tester keeps hitting this. Every instance shows the non-irq stack > came in from mlockall. I'm only seeing this on one box, but that has more > ram (8gb) than my other machines, which might explain it. > > Dave > > INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0) > sending NMI to all CPUs: > NMI backtrace for cpu 3 > CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32 > task: ffff88023e743fc0 ti: ffff88022f6f2000 task.ti: ffff88022f6f2000 > RIP: 0010:[<ffffffff810bf7d1>] [<ffffffff810bf7d1>] trace_hardirqs_off_caller+0x21/0xb0 > RSP: 0018:ffff880244e03c30 EFLAGS: 00000046 > RAX: ffff88023e743fc0 RBX: 0000000000000001 RCX: 000000000000003c > RDX: 000000000000000f RSI: 0000000000000004 RDI: ffffffff81033cab > RBP: ffff880244e03c38 R08: ffff880243288a80 R09: 0000000000000001 > R10: 0000000000000000 R11: 0000000000000001 R12: ffff880243288a80 > R13: ffff8802437eda40 R14: 0000000000080000 R15: 000000000000d010 > FS: 00007f50ae33b740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000000000097f000 CR3: 0000000240fa0000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > Stack: > ffffffff810bf86d ffff880244e03c98 ffffffff81033cab 0000000000000096 > 000000000000d008 0000000300000002 0000000000000004 0000000000000003 > 0000000000002710 ffffffff81c50d00 ffffffff81c50d00 ffff880244fcde00 > Call Trace: > <IRQ> > [<ffffffff810bf86d>] ? trace_hardirqs_off+0xd/0x10 > [<ffffffff81033cab>] __x2apic_send_IPI_mask+0x1ab/0x1c0 > [<ffffffff81033cdc>] x2apic_send_IPI_all+0x1c/0x20 > [<ffffffff81030115>] arch_trigger_all_cpu_backtrace+0x65/0xa0 > [<ffffffff811144b1>] rcu_check_callbacks+0x331/0x8e0 > [<ffffffff8108bfa0>] ? hrtimer_run_queues+0x20/0x180 > [<ffffffff8109e905>] ? sched_clock_cpu+0xb5/0x100 > [<ffffffff81069557>] update_process_times+0x47/0x80 > [<ffffffff810bd115>] tick_sched_handle.isra.16+0x25/0x60 > [<ffffffff810bd231>] tick_sched_timer+0x41/0x60 > [<ffffffff8108ace1>] __run_hrtimer+0x81/0x4e0 > [<ffffffff810bd1f0>] ? tick_sched_do_timer+0x60/0x60 > [<ffffffff8108b93f>] hrtimer_interrupt+0xff/0x240 > [<ffffffff8102de84>] local_apic_timer_interrupt+0x34/0x60 > [<ffffffff81718c5f>] smp_apic_timer_interrupt+0x3f/0x60 > [<ffffffff817178ef>] apic_timer_interrupt+0x6f/0x80 > [<ffffffff8170e8e0>] ? retint_restore_args+0xe/0xe > [<ffffffff8105f101>] ? __do_softirq+0xb1/0x440 > [<ffffffff8105f64d>] irq_exit+0xcd/0xe0 > [<ffffffff81718c65>] smp_apic_timer_interrupt+0x45/0x60 > [<ffffffff817178ef>] apic_timer_interrupt+0x6f/0x80 > <EOI> > [<ffffffff8170e8e0>] ? retint_restore_args+0xe/0xe > [<ffffffff8170b830>] ? wait_for_completion_killable+0x170/0x170 > [<ffffffff8170c853>] ? preempt_schedule_irq+0x53/0x90 > [<ffffffff8170e9f6>] retint_kernel+0x26/0x30 > [<ffffffff8107a523>] ? queue_work_on+0x43/0x90 > [<ffffffff8107c369>] schedule_on_each_cpu+0xc9/0x1a0 > [<ffffffff81167770>] ? lru_add_drain+0x50/0x50 > [<ffffffff811677c5>] lru_add_drain_all+0x15/0x20 > [<ffffffff81186965>] SyS_mlockall+0xa5/0x1a0 > [<ffffffff81716e94>] tracesys+0xdd/0xe2 This commit addresses this problem by inserting the required preemption point. Reported-by: NDave Jones <davej@redhat.com> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
-
- 12 9月, 2013 6 次提交
-
-
由 Vlastimil Babka 提交于
Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The performance of the fast path in munlock_vma_range() can be further improved by avoiding atomic ops of a redundant get_page()/put_page() pair. When calling get_page() during page isolation, we already have the pin from follow_page_mask(). This pin will be then returned by __pagevec_lru_add(), after which we do not reference the pages anymore. After this patch, an 8% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NJörn Engel <joern@logfs.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
After introducing batching by pagevecs into munlock_vma_range(), we can further improve performance by bypassing the copying into per-cpu pagevec and the get_page/put_page pair associated with that. Instead we perform LRU putback directly from our pagevec. However, this is possible only for single-mapped pages that are evictable after munlock. Unevictable pages require rechecking after putting on the unevictable list, so for those we fallback to putback_lru_page(), hich handles that. After this patch, a 13% speedup was measured for munlocking a 56GB large memory area with THP disabled. [akpm@linux-foundation.org:clarify comment] Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NJörn Engel <joern@logfs.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Depending on previous batch which introduced batched isolation in munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats. After the whole pagevec is processed for page isolation, the stats are updated only once with the number of successful isolations. There were however no measurable perfomance gains. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NJörn Engel <joern@logfs.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop, which results in repeated taking and releasing of the lru_lock spinlock for isolating pages one by one. This patch batches the munlock operations using an on-stack pagevec, so that isolation is done under single lru_lock. For THP pages, the old behavior is preserved as they might be split while putting them into the pagevec. After this patch, a 9% speedup was measured for munlocking a 56GB large memory area with THP disabled. A new function __munlock_pagevec() is introduced that takes a pagevec and: 1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats can be also updated using the variant which assumes disabled interrupts. 2) It finishes the munlock and lru putback on all pages under their lock_page. Note that previously, lock_page covered also the PageMlocked clearing and page isolation, but it is not needed for those operations. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NJörn Engel <joern@logfs.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
In munlock_vma_range(), lru_add_drain() is currently called in a loop before each munlock_vma_page() call. This is suboptimal for performance when munlocking many pages. The benefits of per-cpu pagevec for batching the LRU putback are removed since the pagevec only holds at most one page from the previous loop's iteration. The lru_add_drain() call also does not serve any purposes for correctness - it does not even drain pagavecs of all cpu's. The munlock code already expects and handles situations where a page cannot be isolated from the LRU (e.g. because it is on some per-cpu pagevec). The history of the (not commented) call also suggest that it appears there as an oversight rather than intentionally. Before commit ff6a6da6 ("mm: accelerate munlock() treatment of THP pages") the call happened only once upon entering the function. The commit has moved the call into the while loope. So while the other changes in the commit improved munlock performance for THP pages, it introduced the abovementioned suboptimal per-cpu pagevec usage. Further in history, before commit 408e82b7 ("mm: munlock use follow_page"), munlock_vma_pages_range() was just a wrapper around __mlock_vma_pages_range which performed both mlock and munlock depending on a flag. However, before ba470de4 ("mmap: handle mlocked pages during map, remap, unmap") the function handled only mlock, not munlock. The lru_add_drain call thus comes from the implementation in commit b291f000 ("mlock: mlocked pages are unevictable" and was intended only for mlocking, not munlocking. The original intention of draining the LRU pagevec at mlock time was to ensure the pages were on the LRU before the lock operation so that they could be placed on the unevictable list immediately. There is very little motivation to do the same in the munlock path this, particularly for every single page. This patch therefore removes the call completely. After removing the call, a 10% speedup was measured for munlock() of a 56GB large memory area with THP disabled. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NJörn Engel <joern@logfs.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 3月, 2013 1 次提交
-
-
由 Michel Lespinasse 提交于
This reverts commit 18693050 ("mm: introduce VM_POPULATE flag to better deal with racy userspace programs"). VM_POPULATE only has any effect when userspace plays racy games with vmas by trying to unmap and remap memory regions that mmap or mlock are operating on. Also, the only effect of VM_POPULATE when userspace plays such games is that it avoids populating new memory regions that get remapped into the address range that was being operated on by the original mmap or mlock calls. Let's remove VM_POPULATE as there isn't any strong argument to mandate a new vm_flag. Signed-off-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 2月, 2013 1 次提交
-
-
由 Michel Lespinasse 提交于
munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE at a time. When munlocking THP pages (or the huge zero page), this resulted in taking the mm->page_table_lock 512 times in a row. We can do better by making use of the page_mask returned by follow_page_mask (for the huge zero page case), or the size of the page munlock_vma_page() operated on (for the true THP page case). Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 2月, 2013 5 次提交
-
-
由 Michel Lespinasse 提交于
Use long type for page counts in mm_populate() so as to avoid integer overflow when running the following test code: int main(void) { void *p = mmap(NULL, 0x100000000000, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); printf("p: %p\n", p); mlockall(MCL_CURRENT); printf("done\n"); return 0; } Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The fact that mlock calls get_user_pages, and get_user_pages might call mlock when expanding a stack looks like a potential recursion. However, mlock makes sure the requested range is already contained within a vma, so no stack expansion will actually happen from mlock. Should this ever change: the stack expansion mlocks only the newly expanded range and so will not result in recursive expansion. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-by: NAl Viro <viro@ZenIV.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Acked-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
The vm_populate() code populates user mappings without constantly holding the mmap_sem. This makes it susceptible to racy userspace programs: the user mappings may change while vm_populate() is running, and in this case vm_populate() may end up populating the new mapping instead of the old one. In order to reduce the possibility of userspace getting surprised by this behavior, this change introduces the VM_POPULATE vma flag which gets set on vmas we want vm_populate() to work on. This way vm_populate() may still end up populating the new mapping after such a race, but only if the new mapping is also one that the user has requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated. Signed-off-by: NMichel Lespinasse <walken@google.com> Acked-by: NRik van Riel <riel@redhat.com> Tested-by: NAndy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
In find_extend_vma(), we don't need mlock_vma_pages_range() to verify the vma type - we know we're working with a stack. So, we can call directly into __mlock_vma_pages_range(), and remove the last make_pages_present() call site. Note that we don't use mm_populate() here, so we can't release the mmap_sem while allocating new stack pages. This is deemed acceptable, because the stack vmas grow by a bounded number of pages at a time, and these are anon pages so we don't have to read from disk to populate them. Signed-off-by: NMichel Lespinasse <walken@google.com> Acked-by: NRik van Riel <riel@redhat.com> Tested-by: NAndy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or with MCL_FUTURE in effect), we want to populate the pages within the newly created vmas. This may take a while as we may have to read pages from disk, so ideally we want to do this outside of the write-locked mmap_sem region. This change introduces mm_populate(), which is used to defer populating such mappings until after the mmap_sem write lock has been released. This is implemented as a generalization of the former do_mlock_pages(), which accomplished the same task but was using during mlock() / mlockall(). Signed-off-by: NMichel Lespinasse <walken@google.com> Reported-by: NAndy Lutomirski <luto@amacapital.net> Acked-by: NRik van Riel <riel@redhat.com> Tested-by: NAndy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2013 1 次提交
-
-
由 Gerald Schaefer 提交于
With commit 8e72033f ("thp: make MADV_HUGEPAGE check for mm->def_flags") the VM_NOHUGEPAGE flag may be set on s390 in mm->def_flags for certain processes, to prevent future thp mappings. This would be overwritten by do_mlockall(), which sets it back to 0 with an optional VM_LOCKED flag set. To fix this, instead of overwriting mm->def_flags in do_mlockall(), only the VM_LOCKED flag should be set or cleared. Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Reported-by: NVivek Goyal <vgoyal@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 10月, 2012 3 次提交
-
-
由 David Rientjes 提交于
NR_MLOCK is only accounted in single page units: there's no logic to handle transparent hugepages. This patch checks the appropriate number of pages to adjust the statistics by so that the correct amount of memory is reflected. Currently: $ grep Mlocked /proc/meminfo Mlocked: 19636 kB #define MAP_SIZE (4 << 30) /* 4GB */ void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); mlock(ptr, MAP_SIZE); $ grep Mlocked /proc/meminfo Mlocked: 29844 kB munlock(ptr, MAP_SIZE); $ grep Mlocked /proc/meminfo Mlocked: 19636 kB And with this patch: $ grep Mlock /proc/meminfo Mlocked: 19636 kB mlock(ptr, MAP_SIZE); $ grep Mlock /proc/meminfo Mlocked: 4213664 kB munlock(ptr, MAP_SIZE); $ grep Mlock /proc/meminfo Mlocked: 19636 kB Signed-off-by: NDavid Rientjes <rientjes@google.com> Reported-by: NHugh Dickens <hughd@google.com> Acked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
We had thought that pages could no longer get freed while still marked as mlocked; but Johannes Weiner posted this program to demonstrate that truncating an mlocked private file mapping containing COWed pages is still mishandled: #include <sys/types.h> #include <sys/mman.h> #include <sys/stat.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <stdio.h> int main(void) { char *map; int fd; system("grep mlockfreed /proc/vmstat"); fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR); unlink("chigurh"); ftruncate(fd, 4096); map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0); map[0] = 11; mlock(map, sizeof(fd)); ftruncate(fd, 0); close(fd); munlock(map, sizeof(fd)); munmap(map, 4096); system("grep mlockfreed /proc/vmstat"); return 0; } The anon COWed pages are not caught by truncation's clear_page_mlock() of the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to look out for them there in page_remove_rmap(). Indeed, why should truncation or invalidation be doing the clear_page_mlock() when removing from pagecache? mlock is a property of mapping in userspace, not a property of pagecache: an mlocked unmapped page is nonsensical. Reported-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 3月, 2012 1 次提交
-
-
由 Linus Torvalds 提交于
Several users of "find_vma_prev()" were not in fact interested in the previous vma if there was no primary vma to be found either. And in those cases, we're much better off just using the regular "find_vma()", and then "prev" can be looked up by just checking vma->vm_prev. The find_vma_prev() semantics are fairly subtle (see Mikulas' recent commit 83cd904d: "mm: fix find_vma_prev"), and the whole "return prev by reference" means that it generates worse code too. Thus this "let's avoid using this inconvenient and clearly too subtle interface when we don't really have to" patch. Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 11月, 2011 2 次提交
-
-
由 Hugh Dickins 提交于
A process spent 30 minutes exiting, just munlocking the pages of a large anonymous area that had been alternately mprotected into page-sized vmas: for every single page there's an anon_vma walk through all the other little vmas to find the right one. A general fix to that would be a lot more complicated (use prio_tree on anon_vma?), but there's one very simple thing we can do to speed up the common case: if a page to be munlocked is mapped only once, then it is our vma that it is mapped into, and there's no need whatever to walk through all the others. Okay, there is a very remote race in munlock_vma_pages_range(), if between its follow_page() and lock_page(), another process were to munlock the same page, then page reclaim remove it from our vma, then another process mlock it again. We would find it with page_mapcount 1, yet it's still mlocked in another process. But never mind, that's much less likely than the down_read_trylock() failure which munlocking already tolerates (in try_to_unmap_one()): in due course page reclaim will discover and move the page to unevictable instead. [akpm@linux-foundation.org: add comment] Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
MCL_FUTURE does not move pages between lru list and draining the LRU per cpu pagevecs is a nasty activity. Avoid doing it unecessarily. Signed-off-by: NChristoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NJohannes Weiner <jweiner@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 10月, 2011 1 次提交
-
-
由 Paul Gortmaker 提交于
The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-
- 27 5月, 2011 1 次提交
-
-
由 KOSAKI Motohiro 提交于
The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor 'unsigned int'. This patch fixes such misuse. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [ Changed to use a typedef - we'll extend it to cover more cases later, since there has been discussion about making it a 64-bit type.. - Linus ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 5月, 2011 1 次提交
-
-
由 Linus Torvalds 提交于
The logic in __get_user_pages() used to skip the stack guard page lookup whenever the caller wasn't interested in seeing what the actual page was. But Michel Lespinasse points out that there are cases where we don't care about the physical page itself (so 'pages' may be NULL), but do want to make sure a page is mapped into the virtual address space. So using the existence of the "pages" array as an indication of whether to look up the guard page or not isn't actually so great, and we really should just use the FOLL_MLOCK bit. But because that bit was only set for the VM_LOCKED case (and not all vma's necessarily have it, even for mlock()), we couldn't do that originally. Fix that by moving the VM_LOCKED check deeper into the call-chain, which actually simplifies many things. Now mlock() gets simpler, and we can also check for FOLL_MLOCK in __get_user_pages() and the code ends up much more straightforward. Reported-and-reviewed-by: NMichel Lespinasse <walken@google.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 4月, 2011 1 次提交
-
-
由 Linus Torvalds 提交于
Commit 53a7706d ("mlock: do not hold mmap_sem for extended periods of time") changed mlock() to care about the exact number of pages that __get_user_pages() had brought it. Before, it would only care about errors. And that doesn't work, because we also handled one page specially in __mlock_vma_pages_range(), namely the stack guard page. So when that case was handled, the number of pages that the function returned was off by one. In particular, it could be zero, and then the caller would end up not making any progress at all. Rather than try to fix up that off-by-one error for the mlock case specially, this just moves the logic to handle the stack guard page into__get_user_pages() itself, thus making all the counts come out right automatically. Reported-by: NRobert Święcki <robert@swiecki.net> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 3月, 2011 1 次提交
-
-
由 Stephen Wilson 提交于
Morally, the presence of a gate vma is more an attribute of a particular mm than a particular task. Moreover, dropping the dependency on task_struct will help make both existing and future operations on mm's more flexible and convenient. Signed-off-by: NStephen Wilson <wilsons@start.ca> Reviewed-by: NMichel Lespinasse <walken@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 02 2月, 2011 1 次提交
-
-
由 Michel Lespinasse 提交于
As Tao Ma noticed, change 5ecfda04 breaks blktrace. This is because blktrace mmaps a file with PROT_WRITE permissions but without PROT_READ, so my attempt to not unnecessarity break COW during mlock ended up causing mlock to fail with a permission problem. I am proposing to let mlock ignore vma protection in all cases except PROT_NONE. In particular, mlock should not fail for PROT_WRITE regions (as in the blktrace case, which broke at 5ecfda04) or for PROT_EXEC regions (which seem to me like they were always broken). Signed-off-by: NMichel Lespinasse <walken@google.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 1月, 2011 5 次提交
-
-
由 Michel Lespinasse 提交于
__get_user_pages gets a new 'nonblocking' parameter to signal that the caller is prepared to re-acquire mmap_sem and retry the operation if needed. This is used to split off long operations if they are going to block on a disk transfer, or when we detect contention on the mmap_sem. [akpm@linux-foundation.org: remove ref to rwsem_is_contended()] Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
Use a single code path for faulting in pages during mlock. The reason to have it in this patch series is that I did not want to update both code paths in a later change that releases mmap_sem when blocking on disk. Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
Move the code to mlock pages from __mlock_vma_pages_range() to follow_page(). This allows __mlock_vma_pages_range() to not have to break down work into 16-page batches. An additional motivation for doing this within the present patch series is that it'll make it easier for a later chagne to drop mmap_sem when blocking on disk (we'd like to be able to resume at the page that was read from disk instead of at the start of a 16-page batch). Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
Currently mlock() holds mmap_sem in exclusive mode while the pages get faulted in. In the case of a large mlock, this can potentially take a very long time, during which various commands such as 'ps auxw' will block. This makes sysadmins unhappy: real 14m36.232s user 0m0.003s sys 0m0.015s (output from 'time ps auxw' while a 20GB file was being mlocked without being previously preloaded into page cache) I propose that mlock() could release mmap_sem after the VM_LOCKED bits have been set in all appropriate VMAs. Then a second pass could be done to actually mlock the pages, in small batches, releasing mmap_sem when we block on disk access or when we detect some contention. This patch: Before this change, mlock() holds mmap_sem in exclusive mode while the pages get faulted in. In the case of a large mlock, this can potentially take a very long time. Various things will block while mmap_sem is held, including 'ps auxw'. This can make sysadmins angry. I propose that mlock() could release mmap_sem after the VM_LOCKED bits have been set in all appropriate VMAs. Then a second pass could be done to actually mlock the pages with mmap_sem held for reads only. We need to recheck the vma flags after we re-acquire mmap_sem, but this is easy. In the case where a vma has been munlocked before mlock completes, pages that were already marked as PageMlocked() are handled by the munlock() call, and mlock() is careful to not mark new page batches as PageMlocked() after the munlock() call has cleared the VM_LOCKED vma flags. So, the end result will be identical to what'd happen if munlock() had executed after the mlock() call. In a later change, I will allow the second pass to release mmap_sem when blocking on disk accesses or when it is otherwise contended, so that it won't be held for long periods of time even in shared mode. Signed-off-by: NMichel Lespinasse <walken@google.com> Tested-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
When faulting in pages for mlock(), we want to break COW for anonymous or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no need to write-fault into VM_SHARED vmas since shared file pages can be mlocked first and dirtied later, when/if they actually get written to. Skipping the write fault is desirable, as we don't want to unnecessarily cause these pages to be dirtied and queued for writeback. Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Theodore Tso <tytso@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 9月, 2010 1 次提交
-
-
由 Stefan Bader 提交于
So it can be used by all that need to check for that. Signed-off-by: NStefan Bader <stefan.bader@canonical.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 8月, 2010 1 次提交
-
-
由 Linus Torvalds 提交于
If we've split the stack vma, only the lowest one has the guard page. Now that we have a doubly linked list of vma's, checking this is trivial. Tested-by: NIan Campbell <ijc@hellion.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 8月, 2010 1 次提交
-
-
由 Linus Torvalds 提交于
This commit makes the stack guard page somewhat less visible to user space. It does this by: - not showing the guard page in /proc/<pid>/maps It looks like lvm-tools will actually read /proc/self/maps to figure out where all its mappings are, and effectively do a specialized "mlockall()" in user space. By not showing the guard page as part of the mapping (by just adding PAGE_SIZE to the start for grows-up pages), lvm-tools ends up not being aware of it. - by also teaching the _real_ mlock() functionality not to try to lock the guard page. That would just expand the mapping down to create a new guard page, so there really is no point in trying to lock it in place. It would perhaps be nice to show the guard page specially in /proc/<pid>/maps (or at least mark grow-down segments some way), but let's not open ourselves up to more breakage by user space from programs that depends on the exact deails of the 'maps' file. Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools source code to see what was going on with the whole new warning. Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be Reported-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 3月, 2010 1 次提交
-
-
由 Peter Zijlstra 提交于
Support for the PMU's BTS features has been upstreamed in v2.6.32, but we still have the old and disabled ptrace-BTS, as Linus noticed it not so long ago. It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without regard for other uses (perf) and doesn't provide the flexibility needed for perf either. Its users are ptrace-block-step and ptrace-bts, since ptrace-bts was never used and ptrace-block-step can be implemented using a much simpler approach. So axe all 3000 lines of it. That includes the *locked_memory*() APIs in mm/mlock.c as well. Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Markus Metzger <markus.t.metzger@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <20100325135413.938004390@chello.nl> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 07 3月, 2010 1 次提交
-
-
由 Jiri Slaby 提交于
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in 3e10e716 ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: NJiri Slaby <jslaby@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-