- 14 10月, 2014 1 次提交
-
-
由 Peter Feiner 提交于
For VMAs that don't want write notifications, PTEs created for read faults have their write bit set. If the read fault happens after VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain clear after subsequent writes. Here's a simple code snippet to demonstrate the bug: char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0); system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */ assert(*m == '\0'); /* new PTE allows write access */ assert(!soft_dirty(x)); *m = 'x'; /* should dirty the page */ assert(soft_dirty(x)); /* fails */ With this patch, write notifications are enabled when VM_SOFTDIRTY is cleared. Furthermore, to avoid unnecessary faults, write notifications are disabled when VM_SOFTDIRTY is set. As a side effect of enabling and disabling write notifications with care, this patch fixes a bug in mprotect where vm_page_prot bits set by drivers were zapped on mprotect. An analogous bug was fixed in mmap by commit c9d0bf24 ("mm: uncached vma support with writenotify"). Signed-off-by: NPeter Feiner <pfeiner@google.com> Reported-by: NPeter Feiner <pfeiner@google.com> Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 10月, 2014 15 次提交
-
-
由 Peter Feiner 提交于
If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported as softdirty. Here's a program to demonstrate the bug: int main() { const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55; uint64_t pme[3]; int fd = open("/proc/self/pagemap", O_RDONLY);; char *m = mmap(NULL, 3 * getpagesize(), PROT_READ, MAP_ANONYMOUS | MAP_SHARED, -1, 0); munmap(m + getpagesize(), getpagesize()); pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8); assert(pme[0] & PAGEMAP_SOFTDIRTY); /* passes */ assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */ assert(pme[2] & PAGEMAP_SOFTDIRTY); /* passes */ return 0; } (Note that all pages in new VMAs are softdirty until cleared). Tested: Used the program given above. I'm going to include this code in a selftest in the future. [n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning] Signed-off-by: NPeter Feiner <pfeiner@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
9e781440 "hold task->mempolicy while numa_maps scans." fixed the race with the exiting task but this is not enough. The current code assumes that get_vma_policy(task) should either see task->mempolicy == NULL or it should be equal to ->task_mempolicy saved by hold_task_mempolicy(), so we can never race with __mpol_put(). But this can only work if we can't race with do_set_mempolicy(), and thus we can't race with another do_set_mempolicy() or do_exit() after that. However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this race. This task can exec, change it's ->mm, and call do_set_mempolicy() after that; in this case they take 2 different locks. Change hold_task_mempolicy() to use get_task_policy(), it never returns NULL, and change show_numa_map() to use __get_vma_policy() or fall back to proc_priv->task_mempolicy. Note: this is the minimal fix, we will cleanup this code later. I think hold_task_mempolicy() and release_task_mempolicy() should die, we can move this logic into show_numa_map(). Or we can move get_task_policy() outside of ->mmap_sem and !CONFIG_NUMA code at least. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
- Rename vm_is_stack() to task_of_stack() and change it to return "struct task_struct *" rather than the global (and thus wrong in general) pid_t. - Add the new pid_of_stack() helper which calls task_of_stack() and uses the right namespace to report the correct pid_t. Unfortunately we need to define this helper twice, in task_mmu.c and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c and move at least pid_of_stack/task_of_stack there to avoid the code duplication. - Change show_map_vma() and show_numa_map() to use the new helper. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
m_start() can use get_proc_task() instead, and "struct inode *" provides more potentially useful info, see the next changes. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Change the main loop in m_start() to update m->version. Mostly for consistency, but this can help to avoid the same loop if the very 1st ->show() fails due to seq_overflow(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Add the "last_addr" optimization back. Like before, every ->show() method checks !seq_overflow() and sets m->version = vma->vm_start. However, it also checks that m_next_vma(vma) != NULL, otherwise it sets m->version = -1 for the lockless "EOF" fast-path in m_start(). m_start() can simply do find_vma() + m_next_vma() if last_addr is not zero, the code looks clear and simple and this case is clearly separated from "scan vmas" path. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Extract the tail_vma/vm_next calculation from m_next() into the new trivial helper, m_next_vma(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Now that m->version is gone we can cleanup m_start(). In particular, - Remove the "unsigned long" typecast, m->index can't be negative or exceed ->map_count. But lets use "unsigned int pos" to make it clear that "pos < map_count" is safe. - Remove the unnecessary "vma != NULL" check in the main loop. It can't be NULL unless we have a vm bug. - This also means that "pos < map_count" case can simply return the valid vma and avoid "goto" and subsequent checks. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
m_start() carefully documents, checks, and sets "m->version = -1" if we are going to return NULL. The only problem is that we will be never called again if m_start() returns NULL, so this is simply pointless and misleading. Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is just wrong, we want -1 in this case. And in fact we also want -1 if ->vm_next == NULL and ->tail_vma == NULL. And it is not used consistently, the "scan vmas" loop in m_start() should update last_addr too. Finally, imo the whole "last_addr" logic in m_start() looks horrible. find_vma(last_addr) is called unconditionally even if we are not going to use the result. But the main problem is that this code participates in tail_vma-or-NULL mess, and this looks simply unfixable. Remove this optimization. We will add it back after some cleanups. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
1. There is no reason to reset ->tail_vma in m_start(), if we return IS_ERR_OR_NULL() it won't be used. 2. m_start() also clears priv->task to ensure that m_stop() won't use the stale pointer if we fail before get_task_struct(). But this is ugly and confusing, move this initialization in m_stop(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
1. Kill the first "vma != NULL" check. Firstly this is not possible, m_next() won't be called if ->start() or the previous ->next() returns NULL. And if it was possible the 2nd "vma != tail_vma" check is buggy, we should not wrongly return ->tail_vma. 2. Make this function readable. The logic is very simple, we should return check "vma != tail" once and return "vm_next || tail_vma". Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall vma. This is because in this case m_stop()->vma_stop() obviously can't use gate_vma->vm_mm. Now that we have proc_maps_private->mm we can simplify this logic: - Change m_start() to return with ->mmap_sem held unless it returns IS_ERR_OR_NULL(). - Change vma_stop() to use priv->mm and avoid the ugly vma checks, this makes "vm_area_struct *vma" unnecessary. - This also allows m_start() to use vm_stop(). - Cleanup m_next() to follow the new locking rule. Note: m_stop() looks very ugly, and this temporary uglifies it even more. Fixed by the next change. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
A simple test-case from Kirill Shutemov cat /proc/self/maps >/dev/null chmod +x /proc/self/net/packet exec /proc/self/net/packet makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in the opposite order. It's a false positive and probably we should not allow "chmod +x" on proc files. Still I think that we should avoid mm_access() and cred_guard_mutex in sys_read() paths, security checking should happen at open time. Besides, this doesn't even look right if the task changes its ->mm between m_stop() and m_start(). Add the new "mm_struct *mm" member into struct proc_maps_private and change proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof) otherwise. The only complication is that proc_maps_open() users should additionally do mmdrop() in fop->release(), add the new proc_map_release() helper for that. Note: this is the user-visible change, if the task execs after open("maps") the new ->mm won't be visible via this file. I hope this is fine, and this matches /proc/pid/mem bahaviour. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reported-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
do_maps_open() and numa_maps_open() are overcomplicated, they could use __seq_open_private(). Plus they do the same, just sizeof(*priv) Change them to use a new simple helper, proc_maps_open(ops, psize). This simplifies the code and allows us to do the next changes. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be NULL or it can changed by exec right after mm_access(). And in theory this race is not harmless, the task can exec and then later exit and free the new mm_struct. In this case get_task_mm(oldmm) can't help, get_gate_vma(task->mm) can read the freed/unmapped memory. I think that priv->task should simply die and hold_task_mempolicy() logic can be simplified. tail_vma logic asks for cleanups too. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 9月, 2014 1 次提交
-
-
由 Peter Feiner 提交于
In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap. This bug was introduced in commit 68b5a652 ("mm: softdirty: respect VM_SOFTDIRTY in PTE holes"). That commit made /proc/pid/pagemap look at VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs returned by find_vma. Tested: Wrote a selftest that creates a PMD-sized VMA then unmaps the first page and asserts that the page is not softdirty. I'm going to send the pagemap selftest in a later commit. Signed-off-by: NPeter Feiner <pfeiner@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Jamie Liu <jamieliu@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 8月, 2014 1 次提交
-
-
由 Peter Feiner 提交于
After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap should report that the VMA's virtual pages are soft-dirty until VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to /proc/pid/clear_refs). However, pagemap ignores the VM_SOFTDIRTY flag for virtual addresses that fall in PTE holes (i.e., virtual addresses that don't have a PMD, PUD, or PGD allocated yet). To observe this bug, use mmap to create a VMA large enough such that there's a good chance that the VMA will occupy an unused PMD, then test the soft-dirty bit on its pages. In practice, I found that a VMA that covered a PMD's worth of address space was big enough. This patch adds the necessary VMA lookup to the PTE hole callback in /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs' VM_SOFTDIRTY flag. Signed-off-by: NPeter Feiner <pfeiner@google.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 6月, 2014 2 次提交
-
-
由 Fabian Frederick 提交于
Signed-off-by: NFabian Frederick <fabf@skynet.be> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
The age table walker doesn't check non-present hugetlb entry in common path, so hugetlb_entry() callbacks must check it. The reason for this behavior is that some callers want to handle it in its own way. [ I think that reason is bogus, btw - it should just do what the regular code does, which is to call the "pte_hole()" function for such hugetlb entries - Linus] However, some callers don't check it now, which causes unpredictable result, for example when we have a race between migrating hugepage and reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present checks on buggy callbacks. This bug exists for years and got visible by introducing hugepage migration. ChangeLog v2: - fix if condition (check !pte_present() instead of pte_present()) Reported-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> [ Backported to 3.15. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 6月, 2014 1 次提交
-
-
由 Cyrill Gorcunov 提交于
clear_refs_write() is called earlier than clear_soft_dirty() and it is more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not PTEs) that early instead of clearing it a way deeper inside call chain. Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 5月, 2014 1 次提交
-
-
由 Andy Lutomirski 提交于
arch_vma_name sucks. It's a silly hack, and it's annoying to implement correctly. In fact, AFAICS, even the straightforward x86 implementation is incorrect (I suspect that it breaks if the vdso mapping is split or gets remapped). This adds a new vm_ops->name operation that can replace it. The followup patches will remove all uses of arch_vma_name on x86, fixing a couple of annoyances in the process. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/2eee21791bb36a0a408c5c2bdb382a9e6a41ca4a.1400538962.git.luto@amacapital.netSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 08 4月, 2014 1 次提交
-
-
由 Davidlohr Bueso 提交于
This patch is a continuation of efforts trying to optimize find_vma(), avoiding potentially expensive rbtree walks to locate a vma upon faults. The original approach (https://lkml.org/lkml/2013/11/1/410), where the largest vma was also cached, ended up being too specific and random, thus further comparison with other approaches were needed. There are two things to consider when dealing with this, the cache hit rate and the latency of find_vma(). Improving the hit-rate does not necessarily translate in finding the vma any faster, as the overhead of any fancy caching schemes can be too high to consider. We currently cache the last used vma for the whole address space, which provides a nice optimization, reducing the total cycles in find_vma() by up to 250%, for workloads with good locality. On the other hand, this simple scheme is pretty much useless for workloads with poor locality. Analyzing ebizzy runs shows that, no matter how many threads are running, the mmap_cache hit rate is less than 2%, and in many situations below 1%. The proposed approach is to replace this scheme with a small per-thread cache, maximizing hit rates at a very low maintenance cost. Invalidations are performed by simply bumping up a 32-bit sequence number. The only expensive operation is in the rare case of a seq number overflow, where all caches that share the same address space are flushed. Upon a miss, the proposed replacement policy is based on the page number that contains the virtual address in question. Concretely, the following results are seen on an 80 core, 8 socket x86-64 box: 1) System bootup: Most programs are single threaded, so the per-thread scheme does improve ~50% hit rate by just adding a few more slots to the cache. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 50.61% | 19.90 | | patched | 73.45% | 13.58 | +----------------+----------+------------------+ 2) Kernel build: This one is already pretty good with the current approach as we're dealing with good locality. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 75.28% | 11.03 | | patched | 88.09% | 9.31 | +----------------+----------+------------------+ 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 70.66% | 17.14 | | patched | 91.15% | 12.57 | +----------------+----------+------------------+ 4) Ebizzy: There's a fair amount of variation from run to run, but this approach always shows nearly perfect hit rates, while baseline is just about non-existent. The amounts of cycles can fluctuate between anywhere from ~60 to ~116 for the baseline scheme, but this approach reduces it considerably. For instance, with 80 threads: +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 1.06% | 91.54 | | patched | 99.97% | 14.18 | +----------------+----------+------------------+ [akpm@linux-foundation.org: fix nommu build, per Davidlohr] [akpm@linux-foundation.org: document vmacache_valid() logic] [akpm@linux-foundation.org: attempt to untangle header files] [akpm@linux-foundation.org: add vmacache_find() BUG_ON] [hughd@google.com: add vmacache_valid_mm() (from Oleg)] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: adjust and enhance comments] Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NMichel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Tested-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 11月, 2013 3 次提交
-
-
由 Tetsuo Handa 提交于
All seq_printf() users are using "%n" for calculating padding size, convert them to use seq_setwidth() / seq_pad() pair. Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NKees Cook <keescook@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
With split ptlock it's important to know which lock pmd_trans_huge_lock() took. This patch adds one more parameter to the function to return the lock. In most places migration to new api is trivial. Exception is move_huge_pmd(): we need to take two locks if pmd tables are different. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NAlex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
With split page table lock for PMD level we can't hold mm->page_table_lock while updating nr_ptes. Let's convert it to atomic_long_t to avoid races. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NAlex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 11月, 2013 2 次提交
-
-
由 Naoya Horiguchi 提交于
This flag shows that the VMA is "newly created" and thus represents "dirty" in the task's VM. You can clear it by "echo 4 > /proc/pid/clear_refs." Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
mpol_to_str() should not fail. Currently, it either fails because the string buffer is too small or because a string hasn't been defined for a mempolicy mode. If a new mempolicy mode is introduced and no string is defined for it, just warn and return "unknown". If the buffer is too small, just truncate the string and return, the same behavior as snprintf(). This also fixes a bug where there was no NULL-byte termination when doing *p++ = '=' and *p++ ':' and maxlen has been reached. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Chen Gang <gang.chen@asianux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 10月, 2013 1 次提交
-
-
由 Cyrill Gorcunov 提交于
If a page we are inspecting is in swap we may occasionally report it as having soft dirty bit (even if it is clean). The pte_soft_dirty helper should be called on present pte only. Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 9月, 2013 2 次提交
-
-
由 Chen Gang 提交于
mpol_to_str() may fail, and not fill the buffer (e.g. -EINVAL), so need check about it, or buffer may not be zero based, and next seq_printf() will cause issue. The failure return need after mpol_cond_put() to match get_vma_policy(). Signed-off-by: NChen Gang <gang.chen@asianux.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cyrill Gorcunov 提交于
Pavel reported that in case if vma area get unmapped and then mapped (or expanded) in-place, the soft dirty tracker won't be able to recognize this situation since it works on pte level and ptes are get zapped on unmap, loosing soft dirty bit of course. So to resolve this situation we need to track actions on vma level, there VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded) we set this bit, and keep it here until application calls for clearing soft dirty bit. Thus when user space application track memory changes now it can detect if vma area is renewed. Reported-by: NPavel Emelyanov <xemul@parallels.com> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 8月, 2013 3 次提交
-
-
由 yonghua zheng 提交于
Recently we met quite a lot of random kernel panic issues after enabling CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something to do with following bug in pagemap: In struct pagemapread: struct pagemapread { int pos, len; pagemap_entry_t *buffer; bool v2; }; pos is number of PM_ENTRY_BYTES in buffer, but len is the size of buffer, it is a mistake to compare pos and len in add_page_map() for checking buffer is full or not, and this can lead to buffer overflow and random kernel panic issue. Correct len to be total number of PM_ENTRY_BYTES in buffer. [akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition] Signed-off-by: NYonghua Zheng <younghua.zheng@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cyrill Gorcunov 提交于
Andy reported that if file page get reclaimed we lose the soft-dirty bit if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get encoded into pte entry. Thus when #pf happens on such non-present pte we can restore it back. Reported-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cyrill Gorcunov 提交于
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set get swapped out, the bit is getting lost and no longer available when pte read back. To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in pte entry for the page being swapped out. When such page is to be read back from a swap cache we check for bit presence and if it's there we clear it and restore the former _PAGE_SOFT_DIRTY bit back. One of the problem was to find a place in pte entry where we can save the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was chosen for that, it doesn't intersect with swap entry format stored in pte. Reported-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 7月, 2013 5 次提交
-
-
由 Pavel Emelyanov 提交于
In order to reuse bits from pagemap entries gracefully, we leave the entries as is but on pagemap open emit a warning in dmesg, that bits 55-60 are about to change in a couple of releases. Next, if a user issues soft-dirty clear command via the clear_refs file (it was disabled before v3.9) we assume that he's aware of the new pagemap format, note that fact and report the bits in pagemap in the new manner. The "migration strategy" looks like this then: 1. existing users are not affected -- they don't touch soft-dirty feature, thus see old bits in pagemap, but are warned and have time to fix themselves 2. those who use soft-dirty know about new pagemap format 3. some time soon we get rid of any signs of page-shift in pagemap as well as this trick with clear-soft-dirty affecting pagemap format. Signed-off-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Emelyanov 提交于
The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) 2. Wait some time. 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) To do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is. Thus, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts back writable, dirty and soft-dirty bits on the PTE. Another thing to note, is that when mremap moves PTEs they are marked with soft-dirty as well, since from the user perspective mremap modifies the virtual memory at mremap's new address. Signed-off-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Emelyanov 提交于
These bits are always constant (== PAGE_SHIFT) and just occupy space in the entry. Moreover, in next patch we will need to report one more bit in the pagemap, but all bits are already busy on it. That said, describe the pagemap entry that has 6 more free zero bits. Signed-off-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Emelyanov 提交于
In the next patch the clear-refs-type will be required in clear_refs_pte_range funciton, so prepare the walk->private to carry this info. Signed-off-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Emelyanov 提交于
This is the implementation of the soft-dirty bit concept that should help keep track of changes in user memory, which in turn is very-very required by the checkpoint-restore project (http://criu.org). To create a dump of an application(s) we save all the information about it to files, and the biggest part of such dump is the contents of tasks' memory. However, there are usage scenarios where it's not required to get _all_ the task memory while creating a dump. For example, when doing periodical dumps, it's only required to take full memory dump only at the first step and then take incremental changes of memory. Another example is live migration. We copy all the memory to the destination node without stopping all tasks, then stop them, check for what pages has changed, dump it and the rest of the state, then copy it to the destination node. This decreases freeze time significantly. That said, some help from kernel to watch how processes modify the contents of their memory is required. The proposal is to track changes with the help of new soft-dirty bit this way: 1. First do "echo 4 > /proc/$pid/clear_refs". At that point kernel clears the soft dirty _and_ the writable bits from all ptes of process $pid. From now on every write to any page will result in #pf and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set the soft dirty flag. 2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there (the 55'th one). If set, the respective pte was written to since last call to clear refs. The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by kmemcheck, the latter one marks kernel pages with it, while the former bit is put on user pages so they do not conflict to each other. This patch: A new clear-refs type will be added in the next patch, so prepare code for that. [akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)] Signed-off-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2013 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-