- 27 7月, 2016 1 次提交
-
-
由 Kirill A. Shutemov 提交于
Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and smaps. It indicates how many times we allocate and map shmem THP. NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS. Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 6月, 2016 1 次提交
-
-
由 Jann Horn 提交于
This prevents stacking filesystems (ecryptfs and overlayfs) from using procfs as lower filesystem. There is too much magic going on inside procfs, and there is no good reason to stack stuff on top of procfs. (For example, procfs does access checks in VFS open handlers, and ecryptfs by design calls open handlers from a kernel thread that doesn't drop privileges or so.) Signed-off-by: NJann Horn <jannh@google.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 5月, 2016 1 次提交
-
-
由 Michal Hocko 提交于
CLEAR_REFS_MM_HIWATER_RSS and CLEAR_REFS_SOFT_DIRTY are relying on mmap_sem for write. If the waiting task gets killed by the oom killer and it would operate on the current's mm it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. This will also expedite the return to the userspace and do_exit even if the mm is remote. Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Petr Cermak <petrcermak@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 5月, 2016 2 次提交
-
-
由 Janis Danisevskis 提交于
The PR_DUMPABLE flag causes the pid related paths of the proc file system to be owned by ROOT. The implementation of pthread_set/getname_np however needs access to /proc/<pid>/task/<tid>/comm. If PR_DUMPABLE is false this implementation is locked out. This patch installs a special permission function for the file "comm" that grants read and write access to all threads of the same group regardless of the ownership of the inode. For all other threads the function falls back to the generic inode permission check. [akpm@linux-foundation.org: fix spello in comment] Signed-off-by: NJanis Danisevskis <jdanis@google.com> Acked-by: NKees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: David Rientjes <rientjes@google.com> Cc: Minfei Huang <mnfhuang@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Jann Horn <jann@thejh.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Richard W.M. Jones 提交于
It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc/<PID>/status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. This patch is adapted from one originally written by Pierre Carrier. The use case is that we have endless trouble with people setting weird umask() values (usually on the grounds of "security"), and then everything breaking. I'm on the hook to fix these. We'd like to add debugging to our program so we can dump out the umask in debug reports. Previous versions of the patch used a syscall so you could only read your own umask. That's all I need. However there was quite a lot of push-back from those, so this new version exports it in /proc. See: https://lkml.org/lkml/2016/4/13/704 [umask2] https://lkml.org/lkml/2016/4/13/487 [getumask] Signed-off-by: NRichard W.M. Jones <rjones@redhat.com> Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com> Acked-by: NJerome Marchand <jmarchan@redhat.com> Acked-by: NKees Cook <keescook@chromium.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pierre Carrier <pierre@spotify.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 5月, 2016 1 次提交
-
-
由 Joonsoo Kim 提交于
Many developers already know that field for reference count of the struct page is _count and atomic type. They would try to handle it directly and this could break the purpose of page reference count tracepoint. To prevent direct _count modification, this patch rename it to _refcount and add warning message on the code. After that, developer who need to handle reference count will find that field should not be accessed directly. [akpm@linux-foundation.org: fix comments, per Vlastimil] [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too] [sfr@canb.auug.org.au: sync ethernet driver changes] Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Manish Chopra <manish.chopra@qlogic.com> Cc: Yuval Mintz <yuval.mintz@qlogic.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 5月, 2016 1 次提交
-
-
由 Daniel Wagner 提交于
parse_crash_elf{32|64}_headers will check the headers via the elf_check_arch respectively vmcore_elf64_check_arch macro. The MIPS architecture implements those two macros differently. In order to make the differentiation more explicit, let's introduce an vmcore_elf32_check_arch to allow the archs to overwrite it. Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de> Suggested-by: NMaciej W. Rozycki <macro@imgtec.com> Reviewed-by: NMaciej W. Rozycki <macro@imgtec.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/12535/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
-
- 10 5月, 2016 1 次提交
-
-
由 Robin Humble 提交于
This reverts the 4.6-rc1 commit 7e2bc81d ("proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan") because it breaks /proc/$PID/whcan formatting in ps and top. Revert also because the patch is inconsistent - it adds a newline at the end of only the '0' wchan, and does not add a newline when /proc/$PID/wchan contains a symbol name. eg. $ ps -eo pid,stat,wchan,comm PID STAT WCHAN COMMAND ... 1189 S - dbus-launch 1190 Ssl 0 dbus-daemon 1198 Sl 0 lightdm 1299 Ss ep_pol systemd 1301 S - (sd-pam) 1304 Ss wait sh Signed-off-by: NRobin Humble <plaguedbypenguins@gmail.com> Cc: Minfei Huang <mnfhuang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 5月, 2016 1 次提交
-
-
由 Mathias Krause 提交于
If /proc/<PID>/environ gets read before the envp[] array is fully set up in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to read more bytes than are actually written, as env_start will already be set but env_end will still be zero, making the range calculation underflow, allowing to read beyond the end of what has been written. Fix this as it is done for /proc/<PID>/cmdline by testing env_end for zero. It is, apparently, intentionally set last in create_*_tables(). This bug was found by the PaX size_overflow plugin that detected the arithmetic underflow of 'this_len = env_end - (env_start + src)' when env_end is still zero. The expected consequence is that userland trying to access /proc/<PID>/environ of a not yet fully set up process may get inconsistent data as we're in the middle of copying in the environment variables. Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363 Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461Signed-off-by: NMathias Krause <minipli@googlemail.com> Cc: Emese Revfy <re.emese@gmail.com> Cc: Pax Team <pageexec@freemail.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jarod Wilson <jarod@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 5月, 2016 3 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
make it usable with directory locked shared Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
... making it usable with directory locked shared Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 29 4月, 2016 1 次提交
-
-
由 Gerald Schaefer 提交于
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the layouts may differ depending on the architecture. On s390 this will lead to inaccurate numa_maps accounting in /proc because of misguided pte_present() and pte_dirty() checks on the fake pte. On other architectures pte_present() and pte_dirty() may work by chance, but there may be an issue with direct-access (dax) mappings w/o underlying struct pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page() the fake pte will be checked with pte_special() and because there is no "special" bit in a pmd, this will always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an invalid struct page pointer would then be returned that can crash the kernel. This patch fixes the numa_maps THP handling by introducing new "_pmd" variants of the can_gather_numa_stats() and vm_normal_page() functions. Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 4月, 2016 1 次提交
-
-
由 Kirill A. Shutemov 提交于
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 3月, 2016 7 次提交
-
-
由 Dave Young 提交于
On i686 PAE enabled machine the contiguous physical area could be large and it can cause trimming down variables in below calculation in read_vmcore() and mmap_vmcore(): tsz = min_t(size_t, m->offset + m->size - *fpos, buflen); That is, the types being used is like below on i686: m->offset: unsigned long long int m->size: unsigned long long int *fpos: loff_t (long long int) buflen: size_t (unsigned int) So casting (m->offset + m->size - *fpos) by size_t means truncating a given value by 4GB. Suppose (m->offset + m->size - *fpos) being truncated to 0, buflen >0 then we will get tsz = 0. It is of course not an expected result. Similarly we could also get other truncated values less than buflen. Then the real size passed down is not correct any more. If (m->offset + m->size - *fpos) is above 4GB, read_vmcore or mmap_vmcore use the min_t result with truncated values being compared to buflen. Then, fpos proceeds with the wrong value so that we reach below bugs: 1) read_vmcore will refuse to continue so makedumpfile fails. 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range(). Use unsigned long long in min_t instead so that the variables in are not truncated. Signed-off-by: NBaoquan He <bhe@redhat.com> Signed-off-by: NDave Young <dyoung@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Cc: Minfei Huang <mhuang@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minfei Huang 提交于
It is not elegant that prompt shell does not start from new line after executing "cat /proc/$pid/wchan". Make prompt shell start from new line. Signed-off-by: NMinfei Huang <mnfhuang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric Engestrom 提交于
`proc_timers_operations` is only used when CONFIG_CHECKPOINT_RESTORE is enabled. Signed-off-by: NEric Engestrom <eric.engestrom@imgtec.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 John Stultz 提交于
This patch provides a proc/PID/timerslack_ns interface which exposes a task's timerslack value in nanoseconds and allows it to be changed. This allows power/performance management software to set timer slack for other threads according to its policy for the thread (such as when the thread is designated foreground vs. background activity) If the value written is non-zero, slack is set to that value. Otherwise sets it to the default for the thread. This interface checks that the calling task has permissions to to use PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure arbitrary apps do not change the timer slack for other apps. Signed-off-by: NJohn Stultz <john.stultz@linaro.org> Acked-by: NKees Cook <keescook@chromium.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Igor Redko 提交于
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. This metric would be very useful in VM orchestration software to improve memory management of different VMs under overcommit. This patch (of 2): Factor out calculation of the available memory counter into a separate exportable function, in order to be able to use it in other parts of the kernel. In particular, it appears a relevant metric to report to the hypervisor via virtio-balloon statistics interface (in a followup patch). Signed-off-by: NIgor Redko <redkoi@virtuozzo.com> Signed-off-by: NDenis V. Lunev <den@openvz.org> Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Currently /proc/kpageflags returns just KPF_COMPOUND_TAIL for slab tail pages, which is inconvenient when grasping how slab pages are distributed (userspace always needs to check which kind of tail pages by itself). This patch sets KPF_SLAB for such pages. With this patch: $ grep Slab /proc/meminfo ; tools/vm/page-types -b slab Slab: 64880 kB flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000080 16220 63 _______S__________________________________ slab total 16220 63 16220 pages equals to 64880 kB, so returned result is consistent with the global counter. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Currently /proc/kpageflags returns nothing for "tail" buddy pages, which is inconvenient when grasping how free pages are distributed. This patch sets KPF_BUDDY for such pages. With this patch: $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy MemFree: 3134992 kB flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000400 779272 3044 __________B_______________________________ buddy 0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap total 783657 3061 783657 pages is 3134628 kB (roughly consistent with the global counter,) so it's OK. [akpm@linux-foundation.org: update comment, per Naoya] Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 2月, 2016 1 次提交
-
-
由 Dave Hansen 提交于
The protection key can now be just as important as read/write permissions on a VMA. We need some debug mechanism to help figure out if it is in play. smaps seems like a logical place to expose it. arch/x86/kernel/setup.c is a bit of a weirdo place to put this code, but it already had seq_file.h and there was not a much better existing place to put it. We also use no #ifdef. If protection keys is .config'd out we will effectively get the same function as if we used the weak generic function. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Joerg Roedel <jroedel@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Mark Williamson <mwilliamson@undo-software.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 17 2月, 2016 1 次提交
-
-
由 Aditya Kali 提交于
Introduce the ability to create new cgroup namespace. The newly created cgroup namespace remembers the cgroup of the process at the point of creation of the cgroup namespace (referred as cgroupns-root). The main purpose of cgroup namespace is to virtualize the contents of /proc/self/cgroup file. Processes inside a cgroup namespace are only able to see paths relative to their namespace root (unless they are moved outside of their cgroupns-root, at which point they will see a relative path from their cgroupns-root). For a correctly setup container this enables container-tools (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized containers without leaking system level cgroup hierarchy to the task. This patch only implements the 'unshare' part of the cgroupns. Signed-off-by: NAditya Kali <adityakali@google.com> Signed-off-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 04 2月, 2016 2 次提交
-
-
由 Johannes Weiner 提交于
Commit b7643757 ("procfs: mark thread stack correctly in proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps. Finding the task of a stack VMA requires walking the entire thread list, turning this into quadratic behavior: a thousand threads means a thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a million combinations. The cost is not in proportion to the usefulness as described in the patch. Drop the [stack:TID] annotation to make /proc/<pid>/maps (and /proc/<pid>/numa_maps) usable again for higher thread counts. The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as identifying the stack VMA there is an O(1) operation. Siddesh said: "The end users needed a way to identify thread stacks programmatically and there wasn't a way to do that. I'm afraid I no longer remember (or have access to the resources that would aid my memory since I changed employers) the details of their requirement. However, I did do this on my own time because I thought it was an interesting project for me and nobody really gave any feedback then as to its utility, so as far as I am concerned you could roll back the main thread maps information since the information is available in the thread-specific files" Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Cc: Shaohua Li <shli@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michael Holzheu 提交于
When working with hugetlbfs ptes (which are actually pmds) is not valid to directly use pte functions like pte_present() because the hardware bit layout of pmds and ptes can be different. This is the case on s390. Therefore we have to convert the hugetlbfs ptes first into a valid pte encoding with huge_ptep_get(). Currently the /proc/<pid>/numa_maps code uses hugetlbfs ptes without huge_ptep_get(). On s390 this leads to the following two problems: 1) The pte_present() function returns false (instead of true) for PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing completely in the "numa_maps" output. 2) The pte_dirty() function always returns false for all hugetlb ptes. Therefore these pages are reported as "mapped=xxx" instead of "dirty=xxx". Therefore use huge_ptep_get() to correctly convert the hugetlb ptes. Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com> Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 1月, 2016 1 次提交
-
-
由 Al Viro 提交于
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 22 1月, 2016 1 次提交
-
-
由 Kirill A. Shutemov 提交于
After THP refcounting rework we have only two possible return values from pmd_trans_huge_lock(): success and failure. Return-by-pointer for ptl doesn't make much sense in this case. Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on failure. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 1月, 2016 3 次提交
-
-
由 Mateusz Guzik 提交于
Only functions doing more than one read are modified. Consumeres happened to deal with possibly changing data, but it does not seem like a good thing to rely on. Signed-off-by: NMateusz Guzik <mguzik@redhat.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jarod Wilson <jarod@redhat.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anshuman Khandual <anshuman.linux@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jann Horn 提交于
By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [akpm@linux-foundation.org: fix warning] Signed-off-by: NJann Horn <jann@thejh.net> Acked-by: NKees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
For THP=n, HPAGE_PMD_NR in smaps_account() expands to BUILD_BUG(). That's fine since this codepath is eliminated by modern compilers. But older compilers have not that efficient dead code elimination. It causes problem at least with gcc 4.1.2 on m68k: fs/built-in.o: In function `smaps_account': task_mmu.c:(.text+0x4f8fa): undefined reference to `__compiletime_assert_471' Let's replace HPAGE_PMD_NR with 1 << compound_order(page). Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 1月, 2016 3 次提交
-
-
由 Kirill A. Shutemov 提交于
Let's define page_mapped() to be true for compound pages if any sub-pages of the compound page is mapped (with PMD or PTE). On other hand page_mapcount() return mapcount for this particular small page. This will make cases like page_get_anon_vma() behave correctly once we allow huge pages to be mapped with PTE. Most users outside core-mm should use page_mapcount() instead of page_mapped(). Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
With new refcounting we don't need to mark PMDs splitting. Let's drop code to handle this. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
The goal of this patchset is to make refcounting on THP pages cheaper with simpler semantics and allow the same THP compound page to be mapped with PMD and PTEs. This is required to get reasonable THP-pagecache implementation. With the new refcounting design it's much easier to protect against split_huge_page(): simple reference on a page will make you the deal. It makes gup_fast() implementation simpler and doesn't require special-case in futex code to handle tail THP pages. It should improve THP utilization over the system since splitting THP in one process doesn't necessary lead to splitting the page in all other processes have the page mapped. The patchset drastically lower complexity of get_page()/put_page() codepaths. I encourage people look on this code before-and-after to justify time budget on reviewing this patchset. This patch (of 37): With new refcounting all subpages of the compound page are not necessary have the same mapcount. We need to take into account mapcount of every sub-page. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NJerome Marchand <jmarchan@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 1月, 2016 7 次提交
-
-
由 Konstantin Khlebnikov 提交于
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which testing the RLIMIT_DATA value to figure out if we're allowed to assign new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been commited that RLIMIT_DATA in a form it's implemented now doesn't do anything useful because most of user-space libraries use mmap() syscall for dynamic memory allocations. Linus suggested to convert RLIMIT_DATA rlimit into something suitable for anonymous memory accounting. But in this patch we go further, and the changes are bundled together as: * keep vma counting if CONFIG_PROC_FS=n, will be used for limits * replace mm->shared_vm with better defined mm->data_vm * account anonymous executable areas as executable * account file-backed growsdown/up areas as stack * drop struct file* argument from vm_stat_account * enforce RLIMIT_DATA for size of data areas This way code looks cleaner: now code/stack/data classification depends only on vm_flags state: VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc) VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk) VM_WRITE & ~VM_SHARED & !stack -> data (VmData) The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared", but that might be strange beast like readonly-private or VM_IO area. - RLIMIT_AS limits whole address space "VmSize" - RLIMIT_STACK limits stack "VmStk" (but each vma individually) - RLIMIT_DATA now limits "VmData" Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Kees Cook <keescook@google.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY), VM_SOFTDIRTY was already cleared before walk_page_range(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The MemAvailable item in /proc/meminfo is to give users a hint of how much memory is allocatable without causing swapping, so it excludes the zones' low watermarks as unavailable to userspace. However, for a userspace allocation, kswapd will actually reclaim until the free pages hit a combination of the high watermark and the page allocator's lowmem protection that keeps a certain amount of DMA and DMA32 memory from userspace as well. Subtract the full amount we know to be unavailable to userspace from the number of free pages when calculating MemAvailable. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jerome Marchand 提交于
There are several shortcomings with the accounting of shared memory (SysV shm, shared anonymous mapping, mapping of a tmpfs file). The values in /proc/<pid>/status and <...>/statm don't allow to distinguish between shmem memory and a shared mapping to a regular file, even though theirs implication on memory usage are quite different: during reclaim, file mapping can be dropped or written back on disk, while shmem needs a place in swap. Also, to distinguish the memory occupied by anonymous and file mappings, one has to read the /proc/pid/statm file, which has a field for the file mappings (again, including shmem) and total memory occupied by these mappings (i.e. equivalent to VmRSS in the <...>/status file. Getting the value for anonymous mappings only is thus not exactly user-friendly (the statm file is intended to be rather efficiently machine-readable). To address both of these shortcomings, this patch adds a breakdown of VmRSS in /proc/<pid>/status via new fields RssAnon, RssFile and RssShmem, making use of the previous preparatory patch. These fields tell the user the memory occupied by private anonymous pages, mapped regular files and shmem, respectively. Other existing fields in /status and /statm files are left without change. The /statm file can be extended in the future, if there's a need for that. Example (part of) /proc/pid/status output including the new Rss* fields: VmPeak: 2001008 kB VmSize: 2001004 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 5108 kB VmRSS: 5108 kB RssAnon: 92 kB RssFile: 1324 kB RssShmem: 3692 kB VmData: 192 kB VmStk: 136 kB VmExe: 4 kB VmLib: 1784 kB VmPTE: 3928 kB VmPMD: 20 kB VmSwap: 0 kB HugetlbPages: 0 kB [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jerome Marchand 提交于
Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported "shmem-rss" from "file-rss". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Following the previous patch, further reduction of /proc/pid/smaps cost is possible for private writable shmem mappings with unpopulated areas where the page walk invokes the .pte_hole function. We can use radix tree iterator for each such area instead of calling find_get_entry() in a loop. This is possible at the extra maintenance cost of introducing another shmem function shmem_partial_swap_usage(). To demonstrate the diference, I have measured this on a process that creates a private writable 2GB mapping of a partially swapped out /dev/shm/file (which cannot employ the optimizations from the prvious patch) and doesn't populate it at all. I time how long does it take to cat /proc/pid/smaps of this process 100 times. Before this patch: real 0m3.831s user 0m0.180s sys 0m3.212s After this patch: real 0m1.176s user 0m0.180s sys 0m0.684s The time is similar to the case where a radix tree iterator is employed on the whole mapping. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The previous patch has improved swap accounting for shmem mapping, which however made /proc/pid/smaps more expensive for shmem mappings, as we consult the radix tree for each pte_none entry, so the overal complexity is O(n*log(n)). We can reduce this significantly for mappings that cannot contain COWed pages, because then we can either use the statistics tha shmem object itself tracks (if the mapping contains the whole object, or the swap usage of the whole object is zero), or use the radix tree iterator, which is much more effective than repeated find_get_entry() calls. This patch therefore introduces a function shmem_swap_usage(vma) and makes /proc/pid/smaps use it when possible. Only for writable private mappings of shmem objects (i.e. tmpfs files) with the shmem object itself (partially) swapped outwe have to resort to the find_get_entry() approach. Hopefully such mappings are relatively uncommon. To demonstrate the diference, I have measured this on a process that creates a 2GB mapping and dirties single pages with a stride of 2MB, and time how long does it take to cat /proc/pid/smaps of this process 100 times. Private writable mapping of a /dev/shm/file (the most complex case): real 0m3.831s user 0m0.180s sys 0m3.212s Shared mapping of an almost full mapping of a partially swapped /dev/shm/file (which needs to employ the radix tree iterator). real 0m1.351s user 0m0.096s sys 0m0.768s Same, but with /dev/shm/file not swapped (so no radix tree walk needed) real 0m0.935s user 0m0.128s sys 0m0.344s Private anonymous mapping: real 0m0.949s user 0m0.116s sys 0m0.348s The cost is now much closer to the private anonymous mapping case, unless the shmem mapping is private and writable. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-