- 09 8月, 2016 1 次提交
-
-
由 Eric W. Biederman 提交于
Export the export the maximum number of user namespaces as /proc/sys/userns/max_user_namespaces. Acked-by: NKees Cook <keescook@chromium.org> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 29 7月, 2016 2 次提交
-
-
由 Andy Lutomirski 提交于
We should account for stacks regardless of stack size, and we need to account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units to kilobytes and Move it into account_kernel_stack(). Fixes: 12580e4b ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat") Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.orgSigned-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Lutomirski 提交于
Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone. This only makes sense if each kernel stack exists entirely in one zone, and allowing vmapped stacks could break this assumption. Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all architectures. Keep it simple and use KiB. Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.orgSigned-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 7月, 2016 1 次提交
-
-
由 Vladimir Davydov 提交于
Currently, to charge a non-slab allocation to kmemcg one has to use alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with this helper should finally be freed using free_kmem_pages, otherwise it won't be uncharged. This API suits its current users fine, but it turns out to be impossible to use along with page reference counting, i.e. when an allocation is supposed to be freed with put_page, as it is the case with pipe or unix socket buffers. To overcome this limitation, this patch moves charging/uncharging to generic page allocator paths, i.e. to __alloc_pages_nodemask and free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way, one can use any of the available page allocation functions to get the allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT, just like in case of kmalloc and friends. A charged page will be automatically uncharged on free. To make it possible, we need to mark pages charged to kmemcg somehow. To avoid introducing a new page flag, we make use of page->_mapcount for marking such pages. Since pages charged to kmemcg are not supposed to be mapped to userspace, it should work just fine. There are other (ab)users of page->_mapcount - buddy and balloon pages - but we don't conflict with them. In case kmemcg is compiled out or not used at runtime, this patch introduces no overhead to generic page allocator paths. If kmemcg is used, it will be plus one gfp flags check on alloc and plus one page->_mapcount check on free, which shouldn't hurt performance, because the data accessed are hot. Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 6月, 2016 2 次提交
-
-
由 Michael Ellerman 提交于
Commit b235beea ("Clarify naming of thread info/stack allocators") breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE: kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack' kernel/fork.c:355:8: error: assignment from incompatible pointer type stack = alloc_thread_stack_node(tsk, node); ^ Fix it by renaming free_stack() to free_thread_stack(), and updating the return type of alloc_thread_stack_node(). Fixes: b235beea ("Clarify naming of thread info/stack allocators") Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
We've had the thread info allocated together with the thread stack for most architectures for a long time (since the thread_info was split off from the task struct), but that is about to change. But the patches that move the thread info to be off-stack (and a part of the task struct instead) made it clear how confused the allocator and freeing functions are. Because the common case was that we share an allocation with the thread stack and the thread_info, the two pointers were identical. That identity then meant that we would have things like ti = alloc_thread_info_node(tsk, node); ... tsk->stack = ti; which certainly _worked_ (since stack and thread_info have the same value), but is rather confusing: why are we assigning a thread_info to the stack? And if we move the thread_info away, the "confusing" code just gets to be entirely bogus. So remove all this confusion, and make it clear that we are doing the stack allocation by renaming and clarifying the function names to be about the stack. The fact that the thread_info then shares the allocation is an implementation detail, and not really about the allocation itself. This is a pure renaming and type fix: we pass in the same pointer, it's just that we clarify what the pointer means. The ia64 code that actually only has one single allocation (for all of task_struct, thread_info and kernel thread stack) now looks a bit odd, but since "tsk->stack" is actually not even used there, that oddity doesn't matter. It would be a separate thing to clean that up, I intentionally left the ia64 changes as a pure brute-force renaming and type change. Acked-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 5月, 2016 1 次提交
-
-
由 Michal Hocko 提交于
mmput_async is currently used only from the oom_reaper which is defined only for CONFIG_MMU. We can save work_struct in mm_struct for !CONFIG_MMU. [akpm@linux-foundation.org: fix typo, per Minchan] Link: http://lkml.kernel.org/r/20160520061658.GB19172@dhcp22.suse.czReported-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMinchan Kim <minchan@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 5月, 2016 2 次提交
-
-
由 Michal Hocko 提交于
dup_mmap needs to lock current's mm mmap_sem for write. If the waiting task gets killed by the oom killer it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andi Kleen 提交于
Linux preallocates the task structs of the idle tasks for all possible CPUs. This currently means they all end up on node 0. This also implies that the cache line of MWAIT, which is around the flags field in the task struct, are all located in node 0. We see a noticeable performance improvement on Knights Landing CPUs when the cache lines used for MWAIT are located in the local nodes of the CPUs using them. I would expect this to give a (likely slight) improvement on other systems too. The patch implements placing the idle task in the node of its CPUs, by passing the right target node to copy_process() [akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1] Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.orgSigned-off-by: NAndi Kleen <ak@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 5月, 2016 2 次提交
-
-
由 Jiri Slaby 提交于
When using this program (as root): #include <err.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/io.h> #include <sys/types.h> #include <sys/wait.h> #define ITER 1000 #define FORKERS 15 #define THREADS (6000/FORKERS) // 1850 is proc max static void fork_100_wait() { unsigned a, to_wait = 0; printf("\t%d forking %d\n", THREADS, getpid()); for (a = 0; a < THREADS; a++) { switch (fork()) { case 0: usleep(1000); exit(0); break; case -1: break; default: to_wait++; break; } } printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(), to_wait); for (a = 0; a < to_wait; a++) wait(NULL); printf("\t%d waited from %d\n", THREADS, getpid()); } static void run_forkers() { pid_t forkers[FORKERS]; unsigned a; for (a = 0; a < FORKERS; a++) { switch ((forkers[a] = fork())) { case 0: fork_100_wait(); exit(0); break; case -1: err(1, "DIE fork of %d'th forker", a); break; default: break; } } for (a = 0; a < FORKERS; a++) waitpid(forkers[a], NULL, 0); } int main() { unsigned a; int ret; ret = ioperm(10, 20, 0); if (ret < 0) err(1, "ioperm"); for (a = 0; a < ITER; a++) run_forkers(); return 0; } kmemleak reports many occurences of this leak: unreferenced object 0xffff8805917c8000 (size 8192): comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s) hex dump (first 32 bytes): ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace: [<ffffffff814cfbf5>] kmemdup+0x25/0x50 [<ffffffff8103ab43>] copy_thread_tls+0x6c3/0x9a0 [<ffffffff81150174>] copy_process+0x1a84/0x5790 [<ffffffff811dc375>] wake_up_new_task+0x2d5/0x6f0 [<ffffffff8115411d>] _do_fork+0x12d/0x820 ... Due to the leakage of the memory items which should have been freed in arch/x86/kernel/process.c:exit_thread(). Make sure the memory is freed when fork fails later in copy_process. This is done by calling exit_thread with the thread to kill. Signed-off-by: NJiri Slaby <jslaby@suse.cz> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Liqin <liqin.linux@gmail.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Howells <dhowells@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Jonas Bonn <jonas@southpole.se> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Mikael Starvik <starvik@axis.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@arm.linux.org.uk> Cc: Steven Miao <realmz6@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 5月, 2016 1 次提交
-
-
由 Stas Sergeev 提交于
This patch implements the SS_AUTODISARM flag that can be OR-ed with SS_ONSTACK when forming ss_flags. When this flag is set, sigaltstack will be disabled when entering the signal handler; more precisely, after saving sas to uc_stack. When leaving the signal handler, the sigaltstack is restored by uc_stack. When this flag is used, it is safe to switch from sighandler with swapcontext(). Without this flag, the subsequent signal will corrupt the state of the switched-away sighandler. To detect the support of this functionality, one can do: err = sigaltstack(SS_DISABLE | SS_AUTODISARM); if (err && errno == EINVAL) unsupported(); Signed-off-by: NStas Sergeev <stsp@list.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Amanieu d'Antras <amanieu@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Jason Low <jason.low2@hp.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <pmoore@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: linux-api@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 3月, 2016 1 次提交
-
-
由 Dmitry Vyukov 提交于
kcov provides code coverage collection for coverage-guided fuzzing (randomized testing). Coverage-guided fuzzing is a testing technique that uses coverage feedback to determine new interesting inputs to a system. A notable user-space example is AFL (http://lcamtuf.coredump.cx/afl/). However, this technique is not widely used for kernel testing due to missing compiler and kernel support. kcov does not aim to collect as much coverage as possible. It aims to collect more or less stable coverage that is function of syscall inputs. To achieve this goal it does not collect coverage in soft/hard interrupts and instrumentation of some inherently non-deterministic or non-interesting parts of kernel is disbled (e.g. scheduler, locking). Currently there is a single coverage collection mode (tracing), but the API anticipates additional collection modes. Initially I also implemented a second mode which exposes coverage in a fixed-size hash table of counters (what Quentin used in his original patch). I've dropped the second mode for simplicity. This patch adds the necessary support on kernel side. The complimentary compiler support was added in gcc revision 231296. We've used this support to build syzkaller system call fuzzer, which has found 90 kernel bugs in just 2 months: https://github.com/google/syzkaller/wiki/Found-Bugs We've also found 30+ bugs in our internal systems with syzkaller. Another (yet unexplored) direction where kcov coverage would greatly help is more traditional "blob mutation". For example, mounting a random blob as a filesystem, or receiving a random blob over wire. Why not gcov. Typical fuzzing loop looks as follows: (1) reset coverage, (2) execute a bit of code, (3) collect coverage, repeat. A typical coverage can be just a dozen of basic blocks (e.g. an invalid input). In such context gcov becomes prohibitively expensive as reset/collect coverage steps depend on total number of basic blocks/edges in program (in case of kernel it is about 2M). Cost of kcov depends only on number of executed basic blocks/edges. On top of that, kernel requires per-thread coverage because there are always background threads and unrelated processes that also produce coverage. With inlined gcov instrumentation per-thread coverage is not possible. kcov exposes kernel PCs and control flow to user-space which is insecure. But debugfs should not be mapped as user accessible. Based on a patch by Quentin Casasnovas. [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode'] [akpm@linux-foundation.org: unbreak allmodconfig] [akpm@linux-foundation.org: follow x86 Makefile layout standards] Signed-off-by: NDmitry Vyukov <dvyukov@google.com> Reviewed-by: NKees Cook <keescook@chromium.org> Cc: syzkaller <syzkaller@googlegroups.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tavis Ormandy <taviso@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@google.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Drysdale <drysdale@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 3月, 2016 1 次提交
-
-
由 Vladimir Davydov 提交于
Show how much memory is allocated to kernel stacks. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 2月, 2016 1 次提交
-
-
由 Aditya Kali 提交于
Introduce the ability to create new cgroup namespace. The newly created cgroup namespace remembers the cgroup of the process at the point of creation of the cgroup namespace (referred as cgroupns-root). The main purpose of cgroup namespace is to virtualize the contents of /proc/self/cgroup file. Processes inside a cgroup namespace are only able to see paths relative to their namespace root (unless they are moved outside of their cgroupns-root, at which point they will see a relative path from their cgroupns-root). For a correctly setup container this enables container-tools (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized containers without leaking system level cgroup hierarchy to the task. This patch only implements the 'unshare' part of the cgroupns. Signed-off-by: NAditya Kali <adityakali@google.com> Signed-off-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 15 1月, 2016 2 次提交
-
-
由 Konstantin Khlebnikov 提交于
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which testing the RLIMIT_DATA value to figure out if we're allowed to assign new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been commited that RLIMIT_DATA in a form it's implemented now doesn't do anything useful because most of user-space libraries use mmap() syscall for dynamic memory allocations. Linus suggested to convert RLIMIT_DATA rlimit into something suitable for anonymous memory accounting. But in this patch we go further, and the changes are bundled together as: * keep vma counting if CONFIG_PROC_FS=n, will be used for limits * replace mm->shared_vm with better defined mm->data_vm * account anonymous executable areas as executable * account file-backed growsdown/up areas as stack * drop struct file* argument from vm_stat_account * enforce RLIMIT_DATA for size of data areas This way code looks cleaner: now code/stack/data classification depends only on vm_flags state: VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc) VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk) VM_WRITE & ~VM_SHARED & !stack -> data (VmData) The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared", but that might be strange beast like readonly-private or VM_IO area. - RLIMIT_AS limits whole address space "VmSize" - RLIMIT_STACK limits stack "VmStk" (but each vma individually) - RLIMIT_DATA now limits "VmData" Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Kees Cook <keescook@google.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 1月, 2016 1 次提交
-
-
In the following commit: 76751049 ("sched: Implement lockless wake-queues") we gained lockless wake-queues. The -RT kernel managed to lockup itself with those. There could be multiple attempts for task X to enqueue it for a wakeup _even_ if task X is already running. The reason is that task X could be runnable but not yet on CPU. The the task performing the wakeup did not leave the CPU it could performe multiple wakeups. With the proper timming task X could be running and enqueued for a wakeup. If this happens while X is performing a fork() then its its child will have a !NULL `wake_q` member copied. This is not a problem as long as the child task does not participate in lockless wakeups :) Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 76751049 ("sched: Implement lockless wake-queues") Link: http://lkml.kernel.org/r/20151221171710.GA5499@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 12月, 2015 2 次提交
-
-
由 Frederic Weisbecker 提交于
The cputime can only be updated by the current task itself, even in vtime case. So we can safely use seqcount instead of seqlock as there is no writer concurrency involved. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-8-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Frederic Weisbecker 提交于
VTIME_SLEEPING state happens either when: 1) The task is sleeping and no tickless delta is to be added on the task cputime stats. 2) The CPU isn't running vtime at all, so the same properties of 1) applies. Lets rename the vtime symbol to reflect both states. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-4-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 12月, 2015 1 次提交
-
-
由 Oleg Nesterov 提交于
Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 30 11月, 2015 1 次提交
-
-
由 Oleg Nesterov 提交于
If the new child migrates to another cgroup before cgroup_post_fork() calls subsys->fork(), then both pids_can_attach() and pids_fork() will do the same pids_uncharge(old_pids) + pids_charge(pids) sequence twice. Change copy_process() to call threadgroup_change_begin/threadgroup_change_end unconditionally. percpu_down_read() is cheap and this allows other cleanups, see the next changes. Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 06 11月, 2015 1 次提交
-
-
由 Eric B Munson 提交于
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in. Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 10月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
cgroup_exit() is called when a task exits and disassociates the exiting task from its cgroups and half-attach it to the root cgroup. This is unnecessary and undesirable. No controller actually needs an exiting task to be disassociated with non-root cgroups. Both cpu and perf_event controllers update the association to the root cgroup from their exit callbacks just to keep consistent with the cgroup core behavior. Also, this disassociation makes it difficult to track resources held by zombies or determine where the zombies came from. Currently, pids controller is completely broken as it uncharges on exit and zombies always escape the resource restriction. With cgroup association being reset on exit, fixing it is pretty painful. There's no reason to reset cgroup membership on exit. The zombie can be removed from its css_set so that it doesn't show up on "cgroup.procs" and thus can't be migrated or interfere with cgroup removal. It can still pin and point to the css_set so that its cgroup membership is maintained. This patch makes cgroup core keep zombies associated with their cgroups at the time of exit. * Previous patches decoupled populated_cnt tracking from css_set lifetime, so a dying task can be simply unlinked from its css_set while pinning and pointing to the css_set. This keeps css_set association from task side alive while hiding it from "cgroup.procs" and populated_cnt tracking. The css_set reference is dropped when the task_struct is freed. * ->exit() callback no longer needs the css arguments as the associated css never changes once PF_EXITING is set. Removed. * cpu and perf_events controllers no longer need ->exit() callbacks. There's no reason to explicitly switch away on exit. The final schedule out is enough. The callbacks are removed. * On traditional hierarchies, nothing changes. "/proc/PID/cgroup" still reports "/" for all zombies. On the default hierarchy, "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged to at the time of exit. If the cgroup gets removed before the task is reaped, " (deleted)" is appended. v2: Build brekage due to missing dummy cgroup_free() when !CONFIG_CGROUP fixed. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
-
- 15 10月, 2015 1 次提交
-
-
由 Jason Low 提交于
In the next patch in this series, a new field 'checking_timer' will be added to 'struct thread_group_cputimer'. Both this and the existing 'running' integer field are just used as boolean values. To save space in the structure, we can make both of these fields booleans. This is a preparatory patch to convert the existing running integer field to a boolean. Suggested-by: NGeorge Spelvin <linux@horizon.com> Signed-off-by: NJason Low <jason.low2@hp.com> Reviewed: George Spelvin <linux@horizon.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: hideaki.kimura@hpe.com Cc: terry.rudd@hpe.com Cc: scott.norton@hpe.com Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1444849677-29330-4-git-send-email-jason.low2@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 17 9月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
Note: This commit was originally committed as d59cfc09 but got reverted by 0c986253 due to the performance regression from the percpu_rwsem write down/up operations added to cgroup task migration path. percpu_rwsem changes which alleviate the performance issue are pending for v4.4-rc1 merge window. Re-apply. The cgroup side of threadgroup locking uses signal_struct->group_rwsem to synchronize against threadgroup changes. This per-process rwsem adds small overhead to thread creation, exit and exec paths, forces cgroup code paths to do lock-verify-unlock-retry dance in a couple places and makes it impossible to atomically perform operations across multiple processes. This patch replaces signal_struct->group_rwsem with a global percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader side and contained in cgroups proper. This patch converts one-to-one. This does make writer side heavier and lower the granularity; however, cgroup process migration is a fairly cold path, we do want to optimize thread operations over it and cgroup migration operations don't take enough time for the lower granularity to matter. Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
- 16 9月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
This reverts commit d59cfc09. d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify threadgroup locking") changed how cgroup synchronizes against task fork and exits so that it uses global percpu_rwsem instead of per-process rwsem; unfortunately, the write [un]lock paths of percpu_rwsem always involve synchronize_rcu_expedited() which turned out to be too expensive. Improvements for percpu_rwsem are scheduled to be merged in the coming v4.4-rc1 merge window which alleviates this issue. For now, revert the two commits to restore per-process rwsem. They will be re-applied for the v4.4-rc1 merge window. Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org # v4.2+
-
- 05 9月, 2015 2 次提交
-
-
由 Andrea Arcangeli 提交于
These two flags gets set in vma->vm_flags to tell the VM common code if the userfaultfd is armed and in which mode (only tracking missing faults, only tracking wrprotect faults or both). If neither flags is set it means the userfaultfd is not armed on the vma. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NPavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
This adds the vm_userfaultfd_ctx to the vm_area_struct. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NPavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 8月, 2015 2 次提交
-
-
由 Eric W. Biederman 提交于
The code that places signals in signal queues computes the uids, gids, and pids at the time the signals are enqueued. Which means that tasks that share signal queues must be in the same pid and user namespaces. Sharing signal handlers is fine, but bizarre. So make the code in fork and userns_install clearer by only testing for what is functionally necessary. Also update the comment in unshare about unsharing a user namespace to be a little more explicit and make a little more sense. Acked-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
In the logic in the initial commit of unshare made creating a new thread group for a process, contingent upon creating a new memory address space for that process. That is wrong. Two separate processes in different thread groups can share a memory address space and clone allows creation of such proceses. This is significant because it was observed that mm_users > 1 does not mean that a process is multi-threaded, as reading /proc/PID/maps temporarily increments mm_users, which allows other processes to (accidentally) interfere with unshare() calls. Correct the check in check_unshare_flags() to test for !thread_group_empty() for CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM. For sighand->count > 1 for CLONE_SIGHAND and CLONE_VM. For !current_is_single_threaded instead of mm_users > 1 for CLONE_VM. By using the correct checks in unshare this removes the possibility of an accidental denial of service attack. Additionally using the correct checks in unshare ensures that only an explicit unshare(CLONE_VM) can possibly trigger the slow path of current_is_single_threaded(). As an explict unshare(CLONE_VM) is pointless it is not expected there are many applications that make that call. Cc: stable@vger.kernel.org Fixes: b2e0d987 userns: Implement unshare of the user namespace Reported-by: NRicky Zhou <rickyz@chromium.org> Reported-by: NKees Cook <keescook@chromium.org> Reviewed-by: NKees Cook <keescook@chromium.org> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 03 8月, 2015 1 次提交
-
-
由 Peter Zijlstra 提交于
While the current code guarantees monotonicity for stime and utime independently of one another, it does not guarantee that the sum of both is equal to the total time we started out with. This confuses things (and peoples) who look at this sum, like top, and will report >100% usage followed by a matching period of 0%. Rework the code to provide both individual monotonicity and a coherent sum. Suggested-by: NFredrik Markstrom <fredrik.markstrom@gmail.com> Reported-by: NFredrik Markstrom <fredrik.markstrom@gmail.com> Tested-by: NFredrik Markstrom <fredrik.markstrom@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: jason.low2@hp.com Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 7月, 2015 2 次提交
-
-
由 Ingo Molnar 提交于
Don't burden architectures without dynamic task_struct sizing with the overhead of dynamic sizing. Also optimize the x86 code a bit by caching task_struct_size. Acked-and-Tested-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437128892-9831-3-git-send-email-mingo@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Dave Hansen 提交于
The FPU rewrite removed the dynamic allocations of 'struct fpu'. But, this potentially wastes massive amounts of memory (2k per task on systems that do not have AVX-512 for instance). Instead of having a separate slab, this patch just appends the space that we need to the 'task_struct' which we dynamically allocate already. This saves from doing an extra slab allocation at fork(). The only real downside here is that we have to stick everything and the end of the task_struct. But, I think the BUILD_BUG_ON()s I stuck in there should keep that from being too fragile. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 15 7月, 2015 1 次提交
-
-
由 Aleksa Sarai 提交于
Add a new cgroup subsystem callback can_fork that conditionally states whether or not the fork is accepted or rejected by a cgroup policy. In addition, add a cancel_fork callback so that if an error occurs later in the forking process, any state modified by can_fork can be reverted. Allow for a private opaque pointer to be passed from cgroup_can_fork to cgroup_post_fork, allowing for the fork state to be stored by each subsystem separately. Also add a tagging system for cgroup_subsys.h to allow for CGROUP_<TAG> enumerations to be be defined and used. In addition, explicitly add a CGROUP_CANFORK_COUNT macro to make arrays easier to define. This is in preparation for implementing the pids cgroup subsystem. Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 26 6月, 2015 1 次提交
-
-
由 Josh Triplett 提交于
clone has some of the quirkiest syscall handling in the kernel, with a pile of special cases, historical curiosities, and architecture-specific calling conventions. In particular, clone with CLONE_SETTLS accepts a parameter "tls" that the C entry point completely ignores and some assembly entry points overwrite; instead, the low-level arch-specific code pulls the tls parameter out of the arch-specific register captured as part of pt_regs on entry to the kernel. That's a massive hack, and it makes the arch-specific code only work when called via the specific existing syscall entry points; because of this hack, any new clone-like system call would have to accept an identical tls argument in exactly the same arch-specific position, rather than providing a unified system call entry point across architectures. The first patch allows architectures to handle the tls argument via normal C parameter passing, if they opt in by selecting HAVE_COPY_THREAD_TLS. The second patch makes 32-bit and 64-bit x86 opt into this. These two patches came out of the clone4 series, which isn't ready for this merge window, but these first two cleanup patches were entirely uncontroversial and have acks. I'd like to go ahead and submit these two so that other architectures can begin building on top of this and opting into HAVE_COPY_THREAD_TLS. However, I'm also happy to wait and send these through the next merge window (along with v3 of clone4) if anyone would prefer that. This patch (of 2): clone with CLONE_SETTLS accepts an argument to set the thread-local storage area for the new thread. sys_clone declares an int argument tls_val in the appropriate point in the argument list (based on the various CLONE_BACKWARDS variants), but doesn't actually use or pass along that argument. Instead, sys_clone calls do_fork, which calls copy_process, which calls the arch-specific copy_thread, and copy_thread pulls the corresponding syscall argument out of the pt_regs captured at kernel entry (knowing what argument of clone that architecture passes tls in). Apart from being awful and inscrutable, that also only works because only one code path into copy_thread can pass the CLONE_SETTLS flag, and that code path comes from sys_clone with its architecture-specific argument-passing order. This prevents introducing a new version of the clone system call without propagating the same architecture-specific position of the tls argument. However, there's no reason to pull the argument out of pt_regs when sys_clone could just pass it down via C function call arguments. Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into, and a new copy_thread_tls that accepts the tls parameter as an additional unsigned long (syscall-argument-sized) argument. Change sys_clone's tls argument to an unsigned long (which does not change the ABI), and pass that down to copy_thread_tls. Architectures that don't opt into copy_thread_tls will continue to ignore the C argument to sys_clone in favor of the pt_regs captured at kernel entry, and thus will be unable to introduce new versions of the clone syscall. Patch co-authored by Josh Triplett and Thiago Macieira. Signed-off-by: NJosh Triplett <josh@joshtriplett.org> Acked-by: NAndy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thiago Macieira <thiago.macieira@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 5月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
The cgroup side of threadgroup locking uses signal_struct->group_rwsem to synchronize against threadgroup changes. This per-process rwsem adds small overhead to thread creation, exit and exec paths, forces cgroup code paths to do lock-verify-unlock-retry dance in a couple places and makes it impossible to atomically perform operations across multiple processes. This patch replaces signal_struct->group_rwsem with a global percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader side and contained in cgroups proper. This patch converts one-to-one. This does make writer side heavier and lower the granularity; however, cgroup process migration is a fairly cold path, we do want to optimize thread operations over it and cgroup migration operations don't take enough time for the lower granularity to matter. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
- 19 5月, 2015 1 次提交
-
-
由 David Hildenbrand 提交于
Until now, pagefault_disable()/pagefault_enabled() used the preempt count to track whether in an environment with pagefaults disabled (can be queried via in_atomic()). This patch introduces a separate counter in task_struct to count the level of pagefault_disable() calls. We'll keep manipulating the preempt count to retain compatibility to existing pagefault handlers. It is now possible to verify whether in a pagefault_disable() envionment by calling pagefault_disabled(). In contrast to in_atomic() it will not be influenced by preempt_enable()/preempt_disable(). This patch is based on a patch from Ingo Molnar. Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: airlied@linux.ie Cc: akpm@linux-foundation.org Cc: benh@kernel.crashing.org Cc: bigeasy@linutronix.de Cc: borntraeger@de.ibm.com Cc: daniel.vetter@intel.com Cc: heiko.carstens@de.ibm.com Cc: herbert@gondor.apana.org.au Cc: hocko@suse.cz Cc: hughd@google.com Cc: mst@redhat.com Cc: paulus@samba.org Cc: ralf@linux-mips.org Cc: schwidefsky@de.ibm.com Cc: yang.shi@windriver.com Link: http://lkml.kernel.org/r/1431359540-32227-2-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 5月, 2015 2 次提交
-
-
由 Jason Low 提交于
While running a database workload, we found a scalability issue with itimers. Much of the problem was caused by the thread_group_cputimer spinlock. Each time we account for group system/user time, we need to obtain a thread_group_cputimer's spinlock to update the timers. On larger systems (such as a 16 socket machine), this caused more than 30% of total time spent trying to obtain this kernel lock to update these group timer stats. This patch converts the timers to 64-bit atomic variables and use atomic add to update them without a lock. With this patch, the percent of total time spent updating thread group cputimer timers was reduced from 30% down to less than 1%. Note: On 32-bit systems using the generic 64-bit atomics, this causes sample_group_cputimer() to take locks 3 times instead of just 1 time. However, we tested this patch on a 32-bit system ARM system using the generic atomics and did not find the overhead to be much of an issue. An explanation for why this isn't an issue is that 32-bit systems usually have small numbers of CPUs, and cacheline contention from extra spinlocks called periodically is not really apparent on smaller systems. Signed-off-by: NJason Low <jason.low2@hp.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Waiman Long <Waiman.Long@hp.com> Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Jason Low 提交于
ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use the new READ_ONCE() and WRITE_ONCE() APIs as appropriate. Signed-off-by: NJason Low <jason.low2@hp.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NWaiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-