- 21 12月, 2012 1 次提交
-
-
由 Jeremy Eder 提交于
Clarify error messages and correct a few typos in the transparent hugepage sysfs init code. Signed-off-by: NJeremy Eder <jeder@redhat.com> Acked-by: NRafael Aquini <aquini@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 12月, 2012 1 次提交
-
-
由 Hugh Dickins 提交于
Andrea's autonuma-benchmark numa01 hits kernel BUG at huge_memory.c:1474! in change_huge_pmd called from change_protection from change_prot_numa from task_numa_work. That BUG, introduced in the huge zero page commit cad7f613 ("thp: change_huge_pmd(): make sure we don't try to make a page writable") was trying to verify that newprot never adds write permission to an anonymous huge page; but Automatic NUMA Balancing's 4b10e7d5 ("mm: mempolicy: Implement change_prot_numa() in terms of change_protection()") adds a new prot_numa path into change_huge_pmd(), which makes no use of the newprot provided, and may retain the write bit in the pmd. Just move the BUG_ON(pmd_write(entry)) up into the !prot_numa block. Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2012 13 次提交
-
-
由 Kirill A. Shutemov 提交于
pmd value is stable only with mm->page_table_lock taken. After taking the lock we need to check that nobody modified the pmd before changing it. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: David Rientjes <rientjes@google.com> Reviewed-by: NBob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
By default kernel tries to use huge zero page on read page fault. It's possible to disable huge zero page by writing 0 or enable it back by writing 1: echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
hzp_alloc is incremented every time a huge zero page is successfully allocated. It includes allocations which where dropped due race with other allocation. Note, it doesn't count every map of the huge zero page, only its allocation. hzp_alloc_failed is incremented if kernel fails to allocate huge zero page and falls back to using small pages. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
H. Peter Anvin doesn't like huge zero page which sticks in memory forever after the first allocation. Here's implementation of lockless refcounting for huge zero page. We have two basic primitives: {get,put}_huge_zero_page(). They manipulate reference counter. If counter is 0, get_huge_zero_page() allocates a new huge page and takes two references: one for caller and one for shrinker. We free the page only in shrinker callback if counter is 1 (only shrinker has the reference). put_huge_zero_page() only decrements counter. Counter is never zero in put_huge_zero_page() since shrinker holds on reference. Freeing huge zero page in shrinker callback helps to avoid frequent allocate-free. Refcounting has cost. On 4 socket machine I observe ~1% slowdown on parallel (40 processes) read page faulting comparing to lazy huge page allocation. I think it's pretty reasonable for synthetic benchmark. [lliubbo@gmail.com: fix mismerge] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NBob Liu <lliubbo@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Instead of allocating huge zero page on hugepage_init() we can postpone it until first huge zero page map. It saves memory if THP is not in use. cmpxchg() is used to avoid race on huge_zero_pfn initialization. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
All code paths seems covered. Now we can map huge zero page on read page fault. We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup huge zero page (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We can't split huge zero page itself (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. [akpm@linux-foundation.org: fix build error] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Pass vma instead of mm and add address parameter. In most cases we already have vma on the stack. We provides split_huge_page_pmd_mm() for few cases when we have mm, but not vma. This change is preparation to huge zero pmd splitting implementation. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
mprotect core never tries to make page writable using change_huge_pmd(). Let's add an assert that the assumption is true. It's important to be sure we will not make huge zero page writable. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
On write access to huge zero page we alloc a new huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
It's easy to copy huge zero page. Just set destination pmd to huge zero page. It's safe to copy huge zero page since we have none yet :-p [rientjes@google.com: fix comment] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We don't have a mapped page to zap in huge zero page case. Let's just clear pmd and remove it from tlb. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include <assert.h> #include <stdlib.h> #include <unistd.h> #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)&p, 2 * MB, 200 * MB); for (i = 0; i < 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. === By hpa's request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 ============== test: posix_memalign((void **)&p, 2 * MB, 8 * GB); for (i = 0; i < 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile ("": : :"memory"); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% ) 40 context-switches # 0.001 K/sec ( +- 0.94% ) 0 CPU-migrations # 0.000 K/sec 4,218 page-faults # 0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions # 1.75 insns per cycle # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% ) 38 context-switches # 0.001 K/sec ( +- 1.53% ) 0 CPU-migrations # 0.000 K/sec 4,218 page-faults # 0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions # 1.88 insns per cycle # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%] 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%] 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%] 30.381324695 seconds time elapsed ( +- 0.13% ) Mirobenchmark2 ============== test: posix_memalign((void **)&p, 2 * MB, 8 * GB); for (i = 0; i < 1000; i++) { char *_p = p; while (_p < p+4*GB) { assert(*_p == *(_p+4*GB)); _p += 4096; asm volatile ("": : :"memory"); } } hzp: Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs): 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% ) 9 context-switches # 0.003 K/sec ( +- 4.97% ) 4,384 page-faults # 0.001 M/sec ( +- 0.00% ) 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%] 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%] 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%] 9,494,670,537 instructions # 1.14 insns per cycle # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%] 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%] 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%] 3,168,102,115 L1-dcache-loads # 903.693 M/sec ( +- 0.11% ) [41.70%] 1,048,710,998 L1-dcache-misses # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%] 1,047,699,685 LLC-load # 298.854 M/sec ( +- 0.03% ) [33.38%] 2,287 LLC-misses # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%] 3,166,187,367 dTLB-loads # 903.147 M/sec ( +- 0.02% ) [33.35%] 4,266,538 dTLB-misses # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%] 3.513339813 seconds time elapsed ( +- 0.26% ) vhzp: Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs): 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% ) 62 context-switches # 0.002 K/sec ( +- 0.61% ) 4,384 page-faults # 0.160 K/sec ( +- 0.01% ) 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%] 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%] 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%] 10,033,724,846 instructions # 0.15 insns per cycle # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%] 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%] 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%] 3,302,006,540 L1-dcache-loads # 120.891 M/sec ( +- 0.11% ) [41.68%] 271,374,358 L1-dcache-misses # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%] 20,385,476 LLC-load # 0.746 M/sec ( +- 1.64% ) [33.34%] 76,754 LLC-misses # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%] 3,309,927,290 dTLB-loads # 121.181 M/sec ( +- 0.03% ) [33.34%] 2,098,967,427 dTLB-misses # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%] 27.364448741 seconds time elapsed ( +- 0.24% ) === I personally prefer implementation present in this patchset. It doesn't touch arch-specific code. This patch: Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. For now let's allocate the page on hugepage_init(). We'll switch to lazy allocation later. We are not going to map the huge zero page until we can handle it properly on all code paths. is_huge_zero_{pfn,pmd}() functions will be used by following patches to check whether the pfn/pmd is huge zero page. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 12月, 2012 5 次提交
-
-
由 Will Deacon 提交于
On x86 memory accesses to pages without the ACCESSED flag set result in the ACCESSED flag being set automatically. With the ARM architecture a page access fault is raised instead (and it will continue to be raised until the ACCESSED flag is set for the appropriate PTE/PMD). For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only be called for a write fault. This patch ensures that faults on transparent hugepages which do not result in a CoW update the access flags for the faulting pmd. Signed-off-by: NWill Deacon <will.deacon@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Acked-by: NKirill A. Shutemov <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bob Liu 提交于
Introduce mk_huge_pmd() to simplify the code Signed-off-by: NBob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bob Liu 提交于
Multiple places do the same check. Signed-off-by: NBob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bob Liu 提交于
Several place need to find the pmd by(mm_struct, address), so introduce a function to simplify it. [akpm@linux-foundation.org: fix warning] Signed-off-by: NBob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bob Liu 提交于
There are duplicated places using release_pte_pages(). And release_all_pte_pages() can be removed. Signed-off-by: NBob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2012 12 次提交
-
-
由 Ingo Molnar 提交于
rmap_walk_anon() and try_to_unmap_anon() appears to be too careful about locking the anon vma: while it needs protection against anon vma list modifications, it does not need exclusive access to the list itself. Transforming this exclusive lock to a read-locked rwsem removes a global lock from the hot path of page-migration intense threaded workloads which can cause pathological performance like this: 96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch | --- perf_trace_sched_switch __schedule schedule schedule_preempt_disabled __mutex_lock_common.isra.6 __mutex_lock_slowpath mutex_lock | |--50.61%-- rmap_walk | move_to_new_page | migrate_pages | migrate_misplaced_page | __do_numa_page.isra.69 | handle_pte_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | __memset_sse2 | | | --100.00%-- worker_thread | | | --100.00%-- start_thread | --49.39%-- page_lock_anon_vma try_to_unmap_anon try_to_unmap migrate_pages migrate_misplaced_page __do_numa_page.isra.69 handle_pte_fault handle_mm_fault __do_page_fault do_page_fault page_fault __memset_sse2 | --100.00%-- worker_thread start_thread With this change applied the profile is now nicely flat and there's no anon-vma related scheduling/blocking. Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), to make it clearer that it's an exclusive write-lock in that case - suggested by Rik van Riel. Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Ingo Molnar 提交于
Convert the struct anon_vma::mutex to an rwsem, which will help in solving a page-migration scalability problem. (Addressed in a separate patch.) The conversion is simple and straightforward: in every case where we mutex_lock()ed we'll now down_write(). Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
Note: This is very heavily based on a patch from Peter Zijlstra with fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch put a lot of migration logic into mm/huge_memory.c where it does not belong. This version puts tries to share some of the migration logic with migrate_misplaced_page. However, it should be noted that now migrate.c is doing more with the pagetable manipulation than is preferred. The end result is barely recognisable so as before, the signed-offs had to be removed but will be re-added if the original authors are ok with it. Add THP migration for the NUMA working set scanning fault case. It uses the page lock to serialize. No migration pte dance is necessary because the pte is already unmapped when we decide to migrate. [dhillf@gmail.com: Fix memory leak on isolation failure] [dhillf@gmail.com: Fix transfer of last_nid information] Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
The PTE scanning rate and fault rates are two of the biggest sources of system CPU overhead with automatic NUMA placement. Ideally a proper policy would detect if a workload was properly placed, schedule and adjust the PTE scanning rate accordingly. We do not track the necessary information to do that but we at least know if we migrated or not. This patch scans slower if a page was not migrated as the result of a NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is now higher than the previous default. Once every minute it will reset the scanner in case of phase changes. This is hilariously crude and the numbers are arbitrary. Workloads will converge quite slowly in comparison to what a proper policy should be able to do. On the plus side, we will chew up less CPU for workloads that have no need for automatic balancing. Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Hillf Danton 提交于
Pass last_nid from head page to tail page. Signed-off-by: NHillf Danton <dhillf@gmail.com> Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
It is tricky to quantify the basic cost of automatic NUMA placement in a meaningful manner. This patch adds some vmstats that can be used as part of a basic costing model. u = basic unit = sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cpte = Cost PTE access = Ca Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock) where Cpte is incurred twice for a read and a write and Wlock is a constant representing the cost of taking or releasing a lock Cnumahint = Cost of a minor page fault = some high constant e.g. 1000 Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u Ci = Cost of page isolation = Ca + Wi where Wi is a constant that should reflect the approximate cost of the locking operation Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma) where Wnuma is the approximate NUMA factor. 1 is local. 1.2 would imply that remote accesses are 20% more expensive Balancing cost = Cpte * numa_pte_updates + Cnumahint * numa_hint_faults + Ci * numa_pages_migrated + Cpagecopy * numa_pages_migrated Note that numa_pages_migrated is used as a measure of how many pages were isolated even though it would miss pages that failed to migrate. A vmstat counter could have been added for it but the isolation cost is pretty marginal in comparison to the overall cost so it seemed overkill. The ideal way to measure automatic placement benefit would be to count the number of remote accesses versus local accesses and do something like benefit = (remote_accesses_before - remove_access_after) * Wnuma but the information is not readily available. As a workload converges, the expection would be that the number of remote numa hints would reduce to 0. convergence = numa_hint_faults_local / numa_hint_faults where this is measured for the last N number of numa hints recorded. When the workload is fully converged the value is 1. This can measure if the placement policy is converging and how fast it is doing it. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com>
-
由 Peter Zijlstra 提交于
NOTE: This patch is based on "sched, numa, mm: Add fault driven placement and migration policy" but as it throws away all the policy to just leave a basic foundation I had to drop the signed-offs-by. This patch creates a bare-bones method for setting PTEs pte_numa in the context of the scheduler that when faulted later will be faulted onto the node the CPU is running on. In itself this does nothing useful but any placement policy will fundamentally depend on receiving hints on placement from fault context and doing something intelligent about it. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com>
-
由 Mel Gorman 提交于
This patch converts change_prot_numa() to use change_protection(). As pte_numa and friends check the PTE bits directly it is necessary for change_protection() to use pmd_mknuma(). Hence the required modifications to change_protection() are a little clumsy but the end result is that most of the numa page table helpers are just one or two instructions. Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but sufficiently different that the signed-off-bys were dropped Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page() pieces into an effective migrate on fault scheme. Note that (on x86) we rely on PROT_NONE pages being !present and avoid the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the page-migration performance. Based-on-work-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
Note: This patch started as "mm/mpol: Create special PROT_NONE infrastructure" and preserves the basic idea but steals *very* heavily from "autonuma: numa hinting page faults entry points" for the actual fault handlers without the migration parts. The end result is barely recognisable as either patch so all Signed-off and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with this version, I will re-add the signed-offs-by to reflect the history. In order to facilitate a lazy -- fault driven -- migration of pages, create a special transient PAGE_NUMA variant, we can then use the 'spurious' protection faults to drive our migrations from. The meaning of PAGE_NUMA depends on the architecture but on x86 it is effectively PROT_NONE. Actual PROT_NONE mappings will not generate these NUMA faults for the reason that the page fault code checks the permission on the VMA (and will throw a segmentation fault on actual PROT_NONE mappings), before it ever calls handle_mm_fault. [dhillf@gmail.com: Fix typo] Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com>
-
由 Andrea Arcangeli 提交于
When we split a transparent hugepage, transfer the NUMA type from the pmd to the pte if needed. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com>
-
由 Mel Gorman 提交于
With transparent hugepage support, handle_mm_fault() has to be careful that a normal PMD has been established before handling a PTE fault. To achieve this, it used __pte_alloc() directly instead of pte_alloc_map as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map() is called once it is known the PMD is safe. pte_alloc_map() is smart enough to check if a PTE is already present before calling __pte_alloc but this check was lost. As a consequence, PTEs may be allocated unnecessarily and the page table lock taken. Thi useless PTE does get cleaned up but it's a performance hit which is visible in page_test from aim9. This patch simply re-adds the check normally done by pte_alloc_map to check if the PTE needs to be allocated before taking the page table lock. The effect is noticable in page_test from aim9. AIM9 2.6.38-vanilla 2.6.38-checkptenone creat-clo 446.10 ( 0.00%) 424.47 (-5.10%) page_test 38.10 ( 0.00%) 42.04 ( 9.37%) brk_test 52.45 ( 0.00%) 51.57 (-1.71%) exec_test 382.00 ( 0.00%) 456.90 (16.39%) fork_test 60.11 ( 0.00%) 67.79 (11.34%) MMTests Statistics: duration Total Elapsed Time (seconds) 611.90 612.22 (While this affects 2.6.38, it is a performance rather than a functional bug and normally outside the rules -stable. While the big performance differences are to a microbench, the difference in fork and exec performance may be significant enough that -stable wants to consider the patch) Reported-by: NRaz Ben Yehuda <raziebe@gmail.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> [ Picked this up from the AutoNUMA tree to help it upstream and to allow apples-to-apples performance comparisons. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 15 10月, 2012 1 次提交
-
-
由 Ralf Baechle 提交于
Certain configurations won't implicitly pull in <linux/pagemap.h> resulting in the following build error: mm/huge_memory.c: In function 'release_pte_page': mm/huge_memory.c:1697:2: error: implicit declaration of function 'unlock_page' [-Werror=implicit-function-declaration] mm/huge_memory.c: In function '__collapse_huge_page_isolate': mm/huge_memory.c:1757:3: error: implicit declaration of function 'trylock_page' [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors Reported-by: NDavid Daney <david.daney@cavium.com> Signed-off-by: NRalf Baechle <ralf@linux-mips.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 10月, 2012 7 次提交
-
-
由 David Miller 提交于
Invalidation sequences are handled in various ways on various architectures. One way, which sparc64 uses, is to let the set_*_at() functions accumulate pending flushes into a per-cpu array. Then the flush_tlb_range() et al. calls process the pending TLB flushes. In this regime, the __tlb_remove_*tlb_entry() implementations are essentially NOPs. The canonical PTE zap in mm/memory.c is: ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); With a subsequent tlb_flush_mmu() if needed. Mirror this in the THP PMD zapping using: orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd); page = pmd_page(orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); And we properly accomodate TLB flush mechanims like the one described above. Signed-off-by: NDavid S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Miller 提交于
The transparent huge page code passes a PMD pointer in as the third argument of update_mmu_cache(), which expects a PTE pointer. This never got noticed because X86 implements update_mmu_cache() as a macro and thus we don't get any type checking, and X86 is the only architecture which supports transparent huge pages currently. Before other architectures can support transparent huge pages properly we need to add a new interface which will take a PMD pointer as the third argument rather than a PTE pointer. [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390] Signed-off-by: NDavid S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
When a transparent hugepage is mapped and it is included in an mlock() range, follow_page() incorrectly avoids setting the page's mlock bit and moving it to the unevictable lru. This is evident if you try to mlock(), munlock(), and then mlock() a range again. Currently: #define MAP_SIZE (4 << 30) /* 4GB */ void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); mlock(ptr, MAP_SIZE); $ grep -E "Unevictable|Inactive\(anon" /proc/meminfo Inactive(anon): 6304 kB Unevictable: 4213924 kB munlock(ptr, MAP_SIZE); Inactive(anon): 4186252 kB Unevictable: 19652 kB mlock(ptr, MAP_SIZE); Inactive(anon): 4198556 kB Unevictable: 21684 kB Notice that less than 2MB was added to the unevictable list; this is because these pages in the range are not transparent hugepages since the 4GB range was allocated with mmap() and has no specific alignment. If posix_memalign() were used instead, unevictable would not have grown at all on the second mlock(). The fix is to call mlock_vma_page() so that the mlock bit is set and the page is added to the unevictable list. With this patch: mlock(ptr, MAP_SIZE); Inactive(anon): 4056 kB Unevictable: 4213940 kB munlock(ptr, MAP_SIZE); Inactive(anon): 4198268 kB Unevictable: 19636 kB mlock(ptr, MAP_SIZE); Inactive(anon): 4008 kB Unevictable: 4213940 kB Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sagi Grimberg 提交于
In order to allow sleeping during mmu notifier calls, we need to avoid invoking them under the page table spinlock. This patch solves the problem by calling invalidate_page notification after releasing the lock (but before freeing the page itself), or by wrapping the page invalidation with calls to invalidate_range_begin and invalidate_range_end. To prevent accidental changes to the invalidate_range_end arguments after the call to invalidate_range_begin, the patch introduces a convention of saving the arguments in consistently named locals: unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ ... mmun_start = ... mmun_end = ... mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ... mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); The patch changes code to use this convention for all calls to mmu_notifier_invalidate_range_start/end, except those where the calls are close enough so that anyone who glances at the code can see the values aren't changing. This patchset is a preliminary step towards on-demand paging design to be added to the RDMA stack. Why do we want on-demand paging for Infiniband? Applications register memory with an RDMA adapter using system calls, and subsequently post IO operations that refer to the corresponding virtual addresses directly to HW. Until now, this was achieved by pinning the memory during the registration calls. The goal of on demand paging is to avoid pinning the pages of registered memory regions (MRs). This will allow users the same flexibility they get when swapping any other part of their processes address spaces. Instead of requiring the entire MR to fit in physical memory, we can allow the MR to be larger, and only fit the current working set in physical memory. Why should anyone care? What problems are users currently experiencing? This can make programming with RDMA much simpler. Today, developers that are working with more data than their RAM can hold need either to deregister and reregister memory regions throughout their process's life, or keep a single memory region and copy the data to it. On demand paging will allow these developers to register a single MR at the beginning of their process's life, and let the operating system manage which pages needs to be fetched at a given time. In the future, we might be able to provide a single memory access key for each process that would provide the entire process's address as one large memory region, and the developers wouldn't need to register memory regions at all. Is there any prospect that any other subsystems will utilise these infrastructural changes? If so, which and how, etc? As for other subsystems, I understand that XPMEM wanted to sleep in MMU notifiers, as Christoph Lameter wrote at http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and perhaps Andrea knows about other use cases. Scheduling in mmu notifications is required since we need to sync the hardware with the secondary page tables change. A TLB flush of an IO device is inherently slower than a CPU TLB flush, so our design works by sending the invalidation request to the device, and waiting for an interrupt before exiting the mmu notifier handler. Avi said: kvm may be a buyer. kvm::mmu_lock, which serializes guest page faults, also protects long operations such as destroying large ranges. It would be good to convert it into a spinlock, but as it is used inside mmu notifiers, this cannot be done. (there are alternatives, such as keeping the spinlock and using a generation counter to do the teardown in O(1), which is what the "may" is doing up there). [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups] Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com> Signed-off-by: NSagi Grimberg <sagig@mellanox.com> Signed-off-by: NHaggai Eran <haggaie@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Catalin Marinas 提交于
The update_mmu_cache() takes a pointer (to pte_t by default) as the last argument but the huge_memory.c passes a pmd_t value. The patch changes the argument to the pmd_t * pointer. Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NSteve Capper <steve.capper@arm.com> Signed-off-by: NWill Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xiao Guangrong 提交于
If NUMA is enabled, the indicator is not reset if the previous page request failed, ausing us to trigger the BUG_ON() in khugepaged_alloc_page(). Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
In file and anon rmap, we use interval trees to find potentially relevant vmas and then call vma_address() to find the virtual address the given page might be found at in these vmas. vma_address() used to include a check that the returned address falls within the limits of the vma, but this check isn't necessary now that we always use interval trees in rmap: the interval tree just doesn't return any vmas which this check would find to be irrelevant. As a result, we can replace the use of -EFAULT error code (which then needed to be checked in every call site) with a VM_BUG_ON(). Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-