- 09 9月, 2015 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
This patchset makes pagemap useable again in the safe way (after row hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access for non-privileged users but hides PFNs from them. Also it adds bit 'map-exclusive' which is set if page is mapped only here: it helps in estimation of working set without exposing pfns and allows to distinguish CoWed and non-CoWed private anonymous pages. Second patch removes page-shift bits and completes migration to the new pagemap format: flags soft-dirty and mmap-exclusive are available only in the new format. This patch (of 5): This patch moves permission checks from pagemap_read() into pagemap_open(). Pointer to mm is saved in file->private_data. This reference pins only mm_struct itself. /proc/*/mem, maps, smaps already work in the same way. See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.comSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NMark Williamson <mwilliamson@undo-software.com> Tested-by: NMark Williamson <mwilliamson@undo-software.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 9月, 2015 1 次提交
-
-
由 Andrea Arcangeli 提交于
These two flags gets set in vma->vm_flags to tell the VM common code if the userfaultfd is armed and in which mode (only tracking missing faults, only tracking wrprotect faults or both). If neither flags is set it means the userfaultfd is not armed on the vma. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NPavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 6月, 2015 1 次提交
-
-
由 Miklos Szeredi 提交于
Turn seq_path(..., &file->f_path, ...); into seq_file_path(..., file, ...); Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 18 3月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
As pointed by recent post[1] on exploiting DRAM physical imperfection, /proc/PID/pagemap exposes sensitive information which can be used to do attacks. This disallows anybody without CAP_SYS_ADMIN to read the pagemap. [1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html [ Eventually we might want to do anything more finegrained, but for now this is the simple model. - Linus ] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: NAndy Lutomirski <luto@amacapital.net> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Seaborn <mseaborn@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2015 2 次提交
-
-
由 Rafael Aquini 提交于
The output of /proc/$pid/numa_maps is in terms of number of pages like anon=22 or dirty=54. Here's some output: 7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50 7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50 7fff8d425000 default stack anon=50 dirty=50 N0=50 Looks like we have a stack and a couple of anonymous hugetlbfs areas page which both use the same amount of memory. They don't. The 'bigfile' uses 1GB pages and takes up ~50GB of space. The anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack uses normal 4k pages. You can go over to smaps to figure out what the page size _really_ is with KernelPageSize or MMUPageSize. But, I think this is a pretty nasty and counterintuitive interface as it stands. This patch introduces 'kernelpagesize_kB' line element to /proc/<pid>/numa_maps report file in order to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs. This patch is based on Dave Hansen's proposal and reviewer's follow-ups taken from the following dicussion threads: * https://lkml.org/lkml/2011/9/21/454 * https://lkml.org/lkml/2014/12/20/66Signed-off-by: NRafael Aquini <aquini@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Petr Cermak 提交于
Peak resident size of a process can be reset back to the process's current rss value by writing "5" to /proc/pid/clear_refs. The driving use-case for this would be getting the peak RSS value, which can be retrieved from the VmHWM field in /proc/pid/status, per benchmark iteration or test scenario. [akpm@linux-foundation.org: clarify behaviour in documentation] Signed-off-by: NPetr Cermak <petrcermak@chromium.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Primiano Tucci <primiano@chromium.org> Cc: Petr Cermak <petrcermak@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2015 9 次提交
-
-
由 Kirill A. Shutemov 提交于
Currently pagewalker splits all THP pages on any clear_refs request. It's not necessary. We can handle this on PMD level. One side effect is that soft dirty will potentially see more dirty memory, since we will mark whole THP page dirty at once. Sanity checked with CRIU test suite. More testing is required. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). For example for pagemap_read(), when no callbacks are called against VM_PFNMAP vma, pagemap_read() may prepare pagemap data for next virtual address range at wrong index. That could confuse and/or break userspace applications. This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows: - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole() over vma(VM_PFNMAP), - for clear_refs and queue_pages which have their own ->tests_walk, just return 1 and skip vma(VM_PFNMAP). This is no problem because these are not interested in hole regions, - for other callers, just skip the vma(VM_PFNMAP) as a default behavior. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NShiraz Hashim <shashim@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_numa_map() walks pages on vma basis, so using walk_page_vma() is preferable. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code grep-friendly. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Page table walker has the information of the current vma in mm_walk, so we don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call any longer. Currently pagemap_pte_range() does vma loop itself, so this patch reduces many lines of code. NULL-vma check is omitted because we assume that we never run these callbacks on any address outside vma. And even if it were broken, NULL pointer dereference would be detected, so we can get enough information for debugging. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
clear_refs_write() has some prechecks to determine if we really walk over a given vma. Now we have a test_walk() callback to filter vmas, so let's utilize it. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_smap() walks pages on vma basis, so using walk_page_vma() is preferable. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Lockless access to pte in pagemap_pte_range() might race with page migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page(): CPU A (pagemap) CPU B (migration) lock_page() try_to_unmap(page, TTU_MIGRATION...) make_migration_entry() set_pte_at() <read *pte> pte_to_pagemap_entry() remove_migration_ptes() unlock_page() if(is_migration_entry()) migration_entry_to_page() BUG_ON(!PageLocked(page)) Also lockless read might be non-atomic if pte is larger than wordsize. Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes. Fixes: 052fb0d6 ("proc: report file/anon bit in /proc/pid/pagemap") Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: NSedat Dilek <sedat.dilek@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 2月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
We have to handle non-linear mappings for /proc/PID/{smaps,clear_refs} which is unused now. Let's drop it. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2014 1 次提交
-
-
由 Kirill A. Shutemov 提交于
As a small zero page, huge zero page should not be accounted in smaps report as normal page. For small pages we rely on vm_normal_page() to filter out zero page, but vm_normal_page() is not designed to handle pmds. We only get here due hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not necessary compatible on each and every architecture. Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will detect huge zero page for us. We would need pmd_dirty() helper to do this properly. The patch adds it to THP-enabled architectures which don't yet have one. [akpm@linux-foundation.org: use do_div to fix 32-bit build] Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Reported-by: NFengguang Wu <fengguang.wu@intel.com> Tested-by: NFengwei Yin <yfw.kernel@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 11月, 2014 1 次提交
-
-
由 Qiaowei Ren 提交于
MPX-enabled applications using large swaths of memory can potentially have large numbers of bounds tables in process address space to save bounds information. These tables can take up huge swaths of memory (as much as 80% of the memory on the system) even if we clean them up aggressively. In the worst-case scenario, the tables can be 4x the size of the data structure being tracked. IOW, a 1-page structure can require 4 bounds-table pages. Being this huge, our expectation is that folks using MPX are going to be keen on figuring out how much memory is being dedicated to it. So we need a way to track memory use for MPX. If we want to specifically track MPX VMAs we need to be able to distinguish them from normal VMAs, and keep them from getting merged with normal VMAs. A new VM_ flag set only on MPX VMAs does both of those things. With this flag, MPX bounds-table VMAs can be distinguished from other VMAs, and userspace can also walk /proc/$pid/smaps to get memory usage for MPX. In addition to this flag, we also introduce a special ->vm_ops specific to MPX VMAs (see the patch "add MPX specific mmap interface"), but currently different ->vm_ops do not by themselves prevent VMA merging, so we still need this flag. We understand that VM_ flags are scarce and are open to other options. Signed-off-by: NQiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 14 10月, 2014 1 次提交
-
-
由 Peter Feiner 提交于
For VMAs that don't want write notifications, PTEs created for read faults have their write bit set. If the read fault happens after VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain clear after subsequent writes. Here's a simple code snippet to demonstrate the bug: char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0); system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */ assert(*m == '\0'); /* new PTE allows write access */ assert(!soft_dirty(x)); *m = 'x'; /* should dirty the page */ assert(soft_dirty(x)); /* fails */ With this patch, write notifications are enabled when VM_SOFTDIRTY is cleared. Furthermore, to avoid unnecessary faults, write notifications are disabled when VM_SOFTDIRTY is set. As a side effect of enabling and disabling write notifications with care, this patch fixes a bug in mprotect where vm_page_prot bits set by drivers were zapped on mprotect. An analogous bug was fixed in mmap by commit c9d0bf24 ("mm: uncached vma support with writenotify"). Signed-off-by: NPeter Feiner <pfeiner@google.com> Reported-by: NPeter Feiner <pfeiner@google.com> Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 10月, 2014 15 次提交
-
-
由 Peter Feiner 提交于
If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported as softdirty. Here's a program to demonstrate the bug: int main() { const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55; uint64_t pme[3]; int fd = open("/proc/self/pagemap", O_RDONLY);; char *m = mmap(NULL, 3 * getpagesize(), PROT_READ, MAP_ANONYMOUS | MAP_SHARED, -1, 0); munmap(m + getpagesize(), getpagesize()); pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8); assert(pme[0] & PAGEMAP_SOFTDIRTY); /* passes */ assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */ assert(pme[2] & PAGEMAP_SOFTDIRTY); /* passes */ return 0; } (Note that all pages in new VMAs are softdirty until cleared). Tested: Used the program given above. I'm going to include this code in a selftest in the future. [n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning] Signed-off-by: NPeter Feiner <pfeiner@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
9e781440 "hold task->mempolicy while numa_maps scans." fixed the race with the exiting task but this is not enough. The current code assumes that get_vma_policy(task) should either see task->mempolicy == NULL or it should be equal to ->task_mempolicy saved by hold_task_mempolicy(), so we can never race with __mpol_put(). But this can only work if we can't race with do_set_mempolicy(), and thus we can't race with another do_set_mempolicy() or do_exit() after that. However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this race. This task can exec, change it's ->mm, and call do_set_mempolicy() after that; in this case they take 2 different locks. Change hold_task_mempolicy() to use get_task_policy(), it never returns NULL, and change show_numa_map() to use __get_vma_policy() or fall back to proc_priv->task_mempolicy. Note: this is the minimal fix, we will cleanup this code later. I think hold_task_mempolicy() and release_task_mempolicy() should die, we can move this logic into show_numa_map(). Or we can move get_task_policy() outside of ->mmap_sem and !CONFIG_NUMA code at least. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
- Rename vm_is_stack() to task_of_stack() and change it to return "struct task_struct *" rather than the global (and thus wrong in general) pid_t. - Add the new pid_of_stack() helper which calls task_of_stack() and uses the right namespace to report the correct pid_t. Unfortunately we need to define this helper twice, in task_mmu.c and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c and move at least pid_of_stack/task_of_stack there to avoid the code duplication. - Change show_map_vma() and show_numa_map() to use the new helper. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
m_start() can use get_proc_task() instead, and "struct inode *" provides more potentially useful info, see the next changes. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Change the main loop in m_start() to update m->version. Mostly for consistency, but this can help to avoid the same loop if the very 1st ->show() fails due to seq_overflow(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Add the "last_addr" optimization back. Like before, every ->show() method checks !seq_overflow() and sets m->version = vma->vm_start. However, it also checks that m_next_vma(vma) != NULL, otherwise it sets m->version = -1 for the lockless "EOF" fast-path in m_start(). m_start() can simply do find_vma() + m_next_vma() if last_addr is not zero, the code looks clear and simple and this case is clearly separated from "scan vmas" path. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Extract the tail_vma/vm_next calculation from m_next() into the new trivial helper, m_next_vma(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Now that m->version is gone we can cleanup m_start(). In particular, - Remove the "unsigned long" typecast, m->index can't be negative or exceed ->map_count. But lets use "unsigned int pos" to make it clear that "pos < map_count" is safe. - Remove the unnecessary "vma != NULL" check in the main loop. It can't be NULL unless we have a vm bug. - This also means that "pos < map_count" case can simply return the valid vma and avoid "goto" and subsequent checks. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
m_start() carefully documents, checks, and sets "m->version = -1" if we are going to return NULL. The only problem is that we will be never called again if m_start() returns NULL, so this is simply pointless and misleading. Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is just wrong, we want -1 in this case. And in fact we also want -1 if ->vm_next == NULL and ->tail_vma == NULL. And it is not used consistently, the "scan vmas" loop in m_start() should update last_addr too. Finally, imo the whole "last_addr" logic in m_start() looks horrible. find_vma(last_addr) is called unconditionally even if we are not going to use the result. But the main problem is that this code participates in tail_vma-or-NULL mess, and this looks simply unfixable. Remove this optimization. We will add it back after some cleanups. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
1. There is no reason to reset ->tail_vma in m_start(), if we return IS_ERR_OR_NULL() it won't be used. 2. m_start() also clears priv->task to ensure that m_stop() won't use the stale pointer if we fail before get_task_struct(). But this is ugly and confusing, move this initialization in m_stop(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
1. Kill the first "vma != NULL" check. Firstly this is not possible, m_next() won't be called if ->start() or the previous ->next() returns NULL. And if it was possible the 2nd "vma != tail_vma" check is buggy, we should not wrongly return ->tail_vma. 2. Make this function readable. The logic is very simple, we should return check "vma != tail" once and return "vm_next || tail_vma". Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall vma. This is because in this case m_stop()->vma_stop() obviously can't use gate_vma->vm_mm. Now that we have proc_maps_private->mm we can simplify this logic: - Change m_start() to return with ->mmap_sem held unless it returns IS_ERR_OR_NULL(). - Change vma_stop() to use priv->mm and avoid the ugly vma checks, this makes "vm_area_struct *vma" unnecessary. - This also allows m_start() to use vm_stop(). - Cleanup m_next() to follow the new locking rule. Note: m_stop() looks very ugly, and this temporary uglifies it even more. Fixed by the next change. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
A simple test-case from Kirill Shutemov cat /proc/self/maps >/dev/null chmod +x /proc/self/net/packet exec /proc/self/net/packet makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in the opposite order. It's a false positive and probably we should not allow "chmod +x" on proc files. Still I think that we should avoid mm_access() and cred_guard_mutex in sys_read() paths, security checking should happen at open time. Besides, this doesn't even look right if the task changes its ->mm between m_stop() and m_start(). Add the new "mm_struct *mm" member into struct proc_maps_private and change proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof) otherwise. The only complication is that proc_maps_open() users should additionally do mmdrop() in fop->release(), add the new proc_map_release() helper for that. Note: this is the user-visible change, if the task execs after open("maps") the new ->mm won't be visible via this file. I hope this is fine, and this matches /proc/pid/mem bahaviour. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reported-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
do_maps_open() and numa_maps_open() are overcomplicated, they could use __seq_open_private(). Plus they do the same, just sizeof(*priv) Change them to use a new simple helper, proc_maps_open(ops, psize). This simplifies the code and allows us to do the next changes. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be NULL or it can changed by exec right after mm_access(). And in theory this race is not harmless, the task can exec and then later exit and free the new mm_struct. In this case get_task_mm(oldmm) can't help, get_gate_vma(task->mm) can read the freed/unmapped memory. I think that priv->task should simply die and hold_task_mempolicy() logic can be simplified. tail_vma logic asks for cleanups too. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 9月, 2014 1 次提交
-
-
由 Peter Feiner 提交于
In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap. This bug was introduced in commit 68b5a652 ("mm: softdirty: respect VM_SOFTDIRTY in PTE holes"). That commit made /proc/pid/pagemap look at VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs returned by find_vma. Tested: Wrote a selftest that creates a PMD-sized VMA then unmaps the first page and asserts that the page is not softdirty. I'm going to send the pagemap selftest in a later commit. Signed-off-by: NPeter Feiner <pfeiner@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Jamie Liu <jamieliu@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 8月, 2014 1 次提交
-
-
由 Peter Feiner 提交于
After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap should report that the VMA's virtual pages are soft-dirty until VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to /proc/pid/clear_refs). However, pagemap ignores the VM_SOFTDIRTY flag for virtual addresses that fall in PTE holes (i.e., virtual addresses that don't have a PMD, PUD, or PGD allocated yet). To observe this bug, use mmap to create a VMA large enough such that there's a good chance that the VMA will occupy an unused PMD, then test the soft-dirty bit on its pages. In practice, I found that a VMA that covered a PMD's worth of address space was big enough. This patch adds the necessary VMA lookup to the PTE hole callback in /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs' VM_SOFTDIRTY flag. Signed-off-by: NPeter Feiner <pfeiner@google.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 6月, 2014 2 次提交
-
-
由 Fabian Frederick 提交于
Signed-off-by: NFabian Frederick <fabf@skynet.be> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
The age table walker doesn't check non-present hugetlb entry in common path, so hugetlb_entry() callbacks must check it. The reason for this behavior is that some callers want to handle it in its own way. [ I think that reason is bogus, btw - it should just do what the regular code does, which is to call the "pte_hole()" function for such hugetlb entries - Linus] However, some callers don't check it now, which causes unpredictable result, for example when we have a race between migrating hugepage and reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present checks on buggy callbacks. This bug exists for years and got visible by introducing hugepage migration. ChangeLog v2: - fix if condition (check !pte_present() instead of pte_present()) Reported-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> [ Backported to 3.15. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 6月, 2014 1 次提交
-
-
由 Cyrill Gorcunov 提交于
clear_refs_write() is called earlier than clear_soft_dirty() and it is more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not PTEs) that early instead of clearing it a way deeper inside call chain. Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 5月, 2014 1 次提交
-
-
由 Andy Lutomirski 提交于
arch_vma_name sucks. It's a silly hack, and it's annoying to implement correctly. In fact, AFAICS, even the straightforward x86 implementation is incorrect (I suspect that it breaks if the vdso mapping is split or gets remapped). This adds a new vm_ops->name operation that can replace it. The followup patches will remove all uses of arch_vma_name on x86, fixing a couple of annoyances in the process. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/2eee21791bb36a0a408c5c2bdb382a9e6a41ca4a.1400538962.git.luto@amacapital.netSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-