- 09 9月, 2015 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
This patchset makes pagemap useable again in the safe way (after row hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access for non-privileged users but hides PFNs from them. Also it adds bit 'map-exclusive' which is set if page is mapped only here: it helps in estimation of working set without exposing pfns and allows to distinguish CoWed and non-CoWed private anonymous pages. Second patch removes page-shift bits and completes migration to the new pagemap format: flags soft-dirty and mmap-exclusive are available only in the new format. This patch (of 5): This patch moves permission checks from pagemap_read() into pagemap_open(). Pointer to mm is saved in file->private_data. This reference pins only mm_struct itself. /proc/*/mem, maps, smaps already work in the same way. See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.comSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NMark Williamson <mwilliamson@undo-software.com> Tested-by: NMark Williamson <mwilliamson@undo-software.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 9月, 2015 2 次提交
-
-
由 Andrea Arcangeli 提交于
These two flags gets set in vma->vm_flags to tell the VM common code if the userfaultfd is armed and in which mode (only tracking missing faults, only tracking wrprotect faults or both). If neither flags is set it means the userfaultfd is not armed on the vma. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NPavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Lutomirski 提交于
Credit where credit is due: this idea comes from Christoph Lameter with a lot of valuable input from Serge Hallyn. This patch is heavily based on Christoph's patch. ===== The status quo ===== On Linux, there are a number of capabilities defined by the kernel. To perform various privileged tasks, processes can wield capabilities that they hold. Each task has four capability masks: effective (pE), permitted (pP), inheritable (pI), and a bounding set (X). When the kernel checks for a capability, it checks pE. The other capability masks serve to modify what capabilities can be in pE. Any task can remove capabilities from pE, pP, or pI at any time. If a task has a capability in pP, it can add that capability to pE and/or pI. If a task has CAP_SETPCAP, then it can add any capability to pI, and it can remove capabilities from X. Tasks are not the only things that can have capabilities; files can also have capabilities. A file can have no capabilty information at all [1]. If a file has capability information, then it has a permitted mask (fP) and an inheritable mask (fI) as well as a single effective bit (fE) [2]. File capabilities modify the capabilities of tasks that execve(2) them. A task that successfully calls execve has its capabilities modified for the file ultimately being excecuted (i.e. the binary itself if that binary is ELF or for the interpreter if the binary is a script.) [3] In the capability evolution rules, for each mask Z, pZ represents the old value and pZ' represents the new value. The rules are: pP' = (X & fP) | (pI & fI) pI' = pI pE' = (fE ? pP' : 0) X is unchanged For setuid binaries, fP, fI, and fE are modified by a moderately complicated set of rules that emulate POSIX behavior. Similarly, if euid == 0 or ruid == 0, then fP, fI, and fE are modified differently (primary, fP and fI usually end up being the full set). For nonroot users executing binaries with neither setuid nor file caps, fI and fP are empty and fE is false. As an extra complication, if you execute a process as nonroot and fE is set, then the "secure exec" rules are in effect: AT_SECURE gets set, LD_PRELOAD doesn't work, etc. This is rather messy. We've learned that making any changes is dangerous, though: if a new kernel version allows an unprivileged program to change its security state in a way that persists cross execution of a setuid program or a program with file caps, this persistent state is surprisingly likely to allow setuid or file-capped programs to be exploited for privilege escalation. ===== The problem ===== Capability inheritance is basically useless. If you aren't root and you execute an ordinary binary, fI is zero, so your capabilities have no effect whatsoever on pP'. This means that you can't usefully execute a helper process or a shell command with elevated capabilities if you aren't root. On current kernels, you can sort of work around this by setting fI to the full set for most or all non-setuid executable files. This causes pP' = pI for nonroot, and inheritance works. No one does this because it's a PITA and it isn't even supported on most filesystems. If you try this, you'll discover that every nonroot program ends up with secure exec rules, breaking many things. This is a problem that has bitten many people who have tried to use capabilities for anything useful. ===== The proposed change ===== This patch adds a fifth capability mask called the ambient mask (pA). pA does what most people expect pI to do. pA obeys the invariant that no bit can ever be set in pA if it is not set in both pP and pI. Dropping a bit from pP or pI drops that bit from pA. This ensures that existing programs that try to drop capabilities still do so, with a complication. Because capability inheritance is so broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and then calling execve effectively drops capabilities. Therefore, setresuid from root to nonroot conditionally clears pA unless SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can re-add bits to pA afterwards. The capability evolution rules are changed: pA' = (file caps or setuid or setgid ? 0 : pA) pP' = (X & fP) | (pI & fI) | pA' pI' = pI pE' = (fE ? pP' : pA') X is unchanged If you are nonroot but you have a capability, you can add it to pA. If you do so, your children get that capability in pA, pP, and pE. For example, you can set pA = CAP_NET_BIND_SERVICE, and your children can automatically bind low-numbered ports. Hallelujah! Unprivileged users can create user namespaces, map themselves to a nonzero uid, and create both privileged (relative to their namespace) and unprivileged process trees. This is currently more or less impossible. Hallelujah! You cannot use pA to try to subvert a setuid, setgid, or file-capped program: if you execute any such program, pA gets cleared and the resulting evolution rules are unchanged by this patch. Users with nonzero pA are unlikely to unintentionally leak that capability. If they run programs that try to drop privileges, dropping privileges will still work. It's worth noting that the degree of paranoia in this patch could possibly be reduced without causing serious problems. Specifically, if we allowed pA to persist across executing non-pA-aware setuid binaries and across setresuid, then, naively, the only capabilities that could leak as a result would be the capabilities in pA, and any attacker *already* has those capabilities. This would make me nervous, though -- setuid binaries that tried to privilege-separate might fail to do so, and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have unexpected side effects. (Whether these unexpected side effects would be exploitable is an open question.) I've therefore taken the more paranoid route. We can revisit this later. An alternative would be to require PR_SET_NO_NEW_PRIVS before setting ambient capabilities. I think that this would be annoying and would make granting otherwise unprivileged users minor ambient capabilities (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than it is with this patch. ===== Footnotes ===== [1] Files that are missing the "security.capability" xattr or that have unrecognized values for that xattr end up with has_cap set to false. The code that does that appears to be complicated for no good reason. [2] The libcap capability mask parsers and formatters are dangerously misleading and the documentation is flat-out wrong. fE is *not* a mask; it's a single bit. This has probably confused every single person who has tried to use file capabilities. [3] Linux very confusingly processes both the script and the interpreter if applicable, for reasons that elude me. The results from thinking about a script's file capabilities and/or setuid bits are mostly discarded. Preliminary userspace code is here, but it needs updating: https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2 Here is a test program that can be used to verify the functionality (from Christoph): /* * Test program for the ambient capabilities. This program spawns a shell * that allows running processes with a defined set of capabilities. * * (C) 2015 Christoph Lameter <cl@linux.com> * Released under: GPL v3 or later. * * * Compile using: * * gcc -o ambient_test ambient_test.o -lcap-ng * * This program must have the following capabilities to run properly: * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE * * A command to equip the binary with the right caps is: * * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test * * * To get a shell with additional caps that can be inherited by other processes: * * ./ambient_test /bin/bash * * * Verifying that it works: * * From the bash spawed by ambient_test run * * cat /proc/$$/status * * and have a look at the capabilities. */ #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <cap-ng.h> #include <sys/prctl.h> #include <linux/capability.h> /* * Definitions from the kernel header files. These are going to be removed * when the /usr/include files have these defined. */ #define PR_CAP_AMBIENT 47 #define PR_CAP_AMBIENT_IS_SET 1 #define PR_CAP_AMBIENT_RAISE 2 #define PR_CAP_AMBIENT_LOWER 3 #define PR_CAP_AMBIENT_CLEAR_ALL 4 static void set_ambient_cap(int cap) { int rc; capng_get_caps_process(); rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap); if (rc) { printf("Cannot add inheritable cap\n"); exit(2); } capng_apply(CAPNG_SELECT_CAPS); /* Note the two 0s at the end. Kernel checks for these */ if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) { perror("Cannot set cap"); exit(1); } } int main(int argc, char **argv) { int rc; set_ambient_cap(CAP_NET_RAW); set_ambient_cap(CAP_NET_ADMIN); set_ambient_cap(CAP_SYS_NICE); printf("Ambient_test forking shell\n"); if (execv(argv[1], argv + 1)) perror("Cannot exec"); return 0; } Signed-off-by: Christoph Lameter <cl@linux.com> # Original author Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: NKees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Aaron Jones <aaronmdjones@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Andrew G. Morgan <morgan@kernel.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com> Cc: Markku Savela <msa@moth.iki.fi> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 7月, 2015 4 次提交
-
-
由 Ingo Molnar 提交于
Don't burden architectures without dynamic task_struct sizing with the overhead of dynamic sizing. Also optimize the x86 code a bit by caching task_struct_size. Acked-and-Tested-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437128892-9831-3-git-send-email-mingo@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Dave Hansen 提交于
The FPU rewrite removed the dynamic allocations of 'struct fpu'. But, this potentially wastes massive amounts of memory (2k per task on systems that do not have AVX-512 for instance). Instead of having a separate slab, this patch just appends the space that we need to the 'task_struct' which we dynamically allocate already. This saves from doing an extra slab allocation at fork(). The only real downside here is that we have to stick everything and the end of the task_struct. But, I think the BUILD_BUG_ON()s I stuck in there should keep that from being too fragile. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Alexey Dobriyan 提交于
/proc/*/cmdline code checks if it should look at ENVP area by checking last byte of ARGV area: rv = access_remote_vm(mm, arg_end - 1, &c, 1, 0); if (rv <= 0) goto out_free_page; If ARGV is somehow made empty (by doing execve(..., NULL, ...) or manually setting ->arg_start and ->arg_end to equal values), the decision will be based on byte which doesn't even belong to ARGV/ENVP. So, quickly check if ARGV area is empty and report 0 to match previous behaviour. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Iago López Galeiras 提交于
The purpose of the option was documented in Documentation/filesystems/proc.txt but the help text was missing. Add small help text that also points to the documentation. Signed-off-by: NIago López Galeiras <iago@endocode.com> Reviewed-by: NJean Delvare <jdelvare@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 7月, 2015 1 次提交
-
-
由 Eric W. Biederman 提交于
Today proc and sysfs do not contain any executable files. Several applications today mount proc or sysfs without noexec and nosuid and then depend on there being no exectuables files on proc or sysfs. Having any executable files show on proc or sysfs would cause a user space visible regression, and most likely security problems. Therefore commit to never allowing executables on proc and sysfs by adding a new flag to mark them as filesystems without executables and enforce that flag. Test the flag where MNT_NOEXEC is tested today, so that the only user visible effect will be that exectuables will be treated as if the execute bit is cleared. The filesystems proc and sysfs do not currently incoporate any executable files so this does not result in any user visible effects. This makes it unnecessary to vet changes to proc and sysfs tightly for adding exectuable files or changes to chattr that would modify existing files, as no matter what the individual file say they will not be treated as exectuable files by the vfs. Not having to vet changes to closely is important as without this we are only one proc_create call (or another goof up in the implementation of notify_change) from having problematic executables on proc. Those mistakes are all too easy to make and would create a situation where there are security issues or the assumptions of some program having to be broken (and cause userspace regressions). Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 04 7月, 2015 1 次提交
-
-
由 Naveen N. Rao 提交于
Expand /proc/pid/schedstat output: - enable it on CONFIG_TASK_DELAY_ACCT=y && !CONFIG_SCHEDSTATS kernels. - dump all zeroes on kernels that are booted with the 'nodelayacct' option, which boot option disables delay accounting on CONFIG_TASK_DELAY_ACCT=y kernels. Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: a.p.zijlstra@chello.nl Cc: ricklind@us.ibm.com Link: http://lkml.kernel.org/r/5ccbef17d4bc841084ea6e6421d4e4a23b7b806f.1435654789.git.naveen.n.rao@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 01 7月, 2015 2 次提交
-
-
由 Eric W. Biederman 提交于
Add a new function proc_create_mount_point that when used to creates a directory that can not be added to. Add a new function is_empty_pde to test if a function is a mount point. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories. Cc: stable@vger.kernel.org Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Add a magic sysctl table sysctl_mount_point that when used to create a directory forces that directory to be permanently empty. Update the code to use make_empty_dir_inode when accessing permanently empty directories. Update the code to not allow adding to permanently empty directories. Update /proc/sys/fs/binfmt_misc to be a permanently empty directory. Cc: stable@vger.kernel.org Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 26 6月, 2015 2 次提交
-
-
由 Iago López Galeiras 提交于
Commit 81841161 ("fs, proc: introduce /proc/<pid>/task/<tid>/children entry") introduced the children entry for checkpoint restore and the file is only available on kernels configured with CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE. This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS) because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE. But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE. However, the children proc file is useful outside of checkpoint restore. I would like to use it in rkt. The rkt process exec() another program it does not control, and that other program will fork()+exec() a child process. I would like to find the pid of the child process from an external tool without iterating in /proc over all processes to find which one has a parent pid equal to rkt. This commit introduces CONFIG_PROC_CHILDREN and makes CONFIG_CHECKPOINT_RESTORE select it. This allows enabling /proc/<pid>/task/<tid>/children without needing to enable CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT. Alban tested that /proc/<pid>/task/<tid>/children is present when the kernel is configured with CONFIG_PROC_CHILDREN=y but without CONFIG_CHECKPOINT_RESTORE Signed-off-by: NIago López Galeiras <iago@endocode.com> Tested-by: NAlban Crequy <alban@endocode.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Djalal Harouni <djalal@endocode.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
/proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with $ cat /proc/self/cmdline $(seq 1037) 2>/dev/null However, command line size was never limited to PAGE_SIZE but to 128 KB and relatively recently limitation was removed altogether. People noticed and ask questions: http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit seq file interface is not OK, because it kmalloc's for whole output and open + read(, 1) + sleep will pin arbitrary amounts of kernel memory. To not do that, limit must be imposed which is incompatible with arbitrary sized command lines. I apologize for hairy code, but this it direct consequence of command line layout in memory and hacks to support things like "init [3]". The loops are "unrolled" otherwise it is either macros which hide control flow or functions with 7-8 arguments with equal line count. There should be real setproctitle(2) or something. [akpm@linux-foundation.org: fix a billion min() warnings] Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Tested-by: NJarod Wilson <jarod@redhat.com> Acked-by: NJarod Wilson <jarod@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jan Stancek <jstancek@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 6月, 2015 1 次提交
-
-
由 Chris Metcalf 提交于
Allowing watchdog threads to be parked means that we now have the opportunity of actually seeing persistent parked threads in the output of /proc/<pid>/stat and /proc/<pid>/status. The existing code reported such threads as "Running", which is kind-of true if you think of the case where we park them as part of taking cpus offline. But if we allow parking them indefinitely, "Running" is pretty misleading, so we report them as "Sleeping" instead. We could simply report them with a new string, "Parked", but it feels like it's a bit risky for userspace to see unexpected new values; the output is already documented in Documentation/filesystems/proc.txt, and it seems like a mistake to change that lightly. The scheduler does report parked tasks with a "P" in debugging output from sched_show_task() or dump_cpu_task(), but that's a different API. Similarly, the trace_ctxwake_* routines report a "P" for parked tasks, but again, different API. This change seemed slightly cleaner than updating the task_state_array to have additional rows. TASK_DEAD should be subsumed by the exit_state bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very reasonably be reported as "Running" (as it is now). Only TASK_PARKED shows up with unreasonable output here. Signed-off-by: NChris Metcalf <cmetcalf@ezchip.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 6月, 2015 1 次提交
-
-
由 Miklos Szeredi 提交于
Turn seq_path(..., &file->f_path, ...); into seq_file_path(..., file, ...); Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 5月, 2015 1 次提交
-
-
由 Eric W. Biederman 提交于
Fresh mounts of proc and sysfs are a very special case that works very much like a bind mount. Unfortunately the current structure can not preserve the MNT_LOCK... mount flags. Therefore refactor the logic into a form that can be modified to preserve those lock bits. Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount of the filesystem be fully visible in the current mount namespace, before the filesystem may be mounted. Move the logic for calling fs_fully_visible from proc and sysfs into fs/namespace.c where it has greater access to mount namespace state. Cc: stable@vger.kernel.org Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 11 5月, 2015 3 次提交
-
-
由 Al Viro 提交于
only one instance looks at that argument at all; that sole exception wants inode rather than dentry. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
its only use is getting passed to nd_jump_link(), which can obtain it from current->nameidata Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
a) instead of storing the symlink body (via nd_set_link()) and returning an opaque pointer later passed to ->put_link(), ->follow_link() _stores_ that opaque pointer (into void * passed by address by caller) and returns the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic symlinks) and pointer to symlink body for normal symlinks. Stored pointer is ignored in all cases except the last one. Storing NULL for opaque pointer (or not storing it at all) means no call of ->put_link(). b) the body used to be passed to ->put_link() implicitly (via nameidata). Now only the opaque pointer is. In the cases when we used the symlink body to free stuff, ->follow_link() now should store it as opaque pointer in addition to returning it. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 17 4月, 2015 1 次提交
-
-
由 Andrey Vagin 提交于
Let's show locks which are associated with a file descriptor in its fdinfo file. Currently we don't have a reliable way to determine who holds a lock. We can find some information in /proc/locks, but PID which is reported there can be wrong. For example, a process takes a lock, then forks a child and dies. In this case /proc/locks contains the parent pid, which can be reused by another process. $ cat /proc/locks ... 6: FLOCK ADVISORY WRITE 324 00:13:13431 0 EOF ... $ ps -C rpcbind PID TTY TIME CMD 332 ? 00:00:00 rpcbind $ cat /proc/332/fdinfo/4 pos: 0 flags: 0100000 mnt_id: 22 lock: 1: FLOCK ADVISORY WRITE 324 00:13:13431 0 EOF $ ls -l /proc/332/fd/4 lr-x------ 1 root root 64 Mar 5 14:43 /proc/332/fd/4 -> /run/rpcbind.lock $ ls -l /proc/324/fd/ total 0 lrwx------ 1 root root 64 Feb 27 14:50 0 -> /dev/pts/0 lrwx------ 1 root root 64 Feb 27 14:50 1 -> /dev/pts/0 lrwx------ 1 root root 64 Feb 27 14:49 2 -> /dev/pts/0 You can see that the process with the 324 pid doesn't hold the lock. This information is required for proper dumping and restoring file locks. Signed-off-by: NAndrey Vagin <avagin@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: NJeff Layton <jlayton@poochiereds.net> Acked-by: N"J. Bruce Fields" <bfields@fieldses.org> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 4月, 2015 4 次提交
-
-
由 Joe Perches 提交于
The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rasmus Villemoes 提交于
The current semantics of string_escape_mem are inadequate for one of its current users, vsnprintf(). If that is to honour its contract, it must know how much space would be needed for the entire escaped buffer, and string_escape_mem provides no way of obtaining that (short of allocating a large enough buffer (~4 times input string) to let it play with, and that's definitely a big no-no inside vsnprintf). So change the semantics for string_escape_mem to be more snprintf-like: Return the size of the output that would be generated if the destination buffer was big enough, but of course still only write to the part of dst it is allowed to, and (contrary to snprintf) don't do '\0'-termination. It is then up to the caller to detect whether output was truncated and to append a '\0' if desired. Also, we must output partial escape sequences, otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever they previously contained. This also fixes a bug in the escaped_string() helper function, which used to unconditionally pass a length of "end-buf" to string_escape_mem(); since the latter doesn't check osz for being insanely large, it would happily write to dst. For example, kasprintf(GFP_KERNEL, "something and then %pE", ...); is an easy way to trigger an oops. In test-string_helpers.c, the -ENOMEM test is replaced with testing for getting the expected return value even if the buffer is too small. We also ensure that nothing is written (by relying on a NULL pointer deref) if the output size is 0 by passing NULL - this has to work for kasprintf("%pE") to work. In net/sunrpc/cache.c, I think qword_add still has the same semantics. Someone should definitely double-check this. In fs/proc/array.c, I made the minimum possible change, but longer-term it should stop poking around in seq_file internals. [andriy.shevchenko@linux.intel.com: simplify qword_add] [andriy.shevchenko@linux.intel.com: add missed curly braces] Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chen Hanxiao 提交于
If some issues occurred inside a container guest, host user could not know which process is in trouble just by guest pid: the users of container guest only knew the pid inside containers. This will bring obstacle for trouble shooting. This patch adds four fields: NStgid, NSpid, NSpgid and NSsid: a) In init_pid_ns, nothing changed; b) In one pidns, will tell the pid inside containers: NStgid: 21776 5 1 NSpid: 21776 5 1 NSpgid: 21776 5 1 NSsid: 21729 1 0 ** Process id is 21776 in level 0, 5 in level 1, 1 in level 2. c) If pidns is nested, it depends on which pidns are you in. NStgid: 5 1 NSpid: 5 1 NSpgid: 5 1 NSsid: 1 0 ** Views from level 1 [akpm@linux-foundation.org: add CONFIG_PID_NS ifdef] Signed-off-by: NChen Hanxiao <chenhanxiao@cn.fujitsu.com> Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Tested-by: NSerge Hallyn <serge.hallyn@canonical.com> Tested-by: NNathan Scott <nathans@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Howells 提交于
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 18 3月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
As pointed by recent post[1] on exploiting DRAM physical imperfection, /proc/PID/pagemap exposes sensitive information which can be used to do attacks. This disallows anybody without CAP_SYS_ADMIN to read the pagemap. [1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html [ Eventually we might want to do anything more finegrained, but for now this is the simple model. - Linus ] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: NAndy Lutomirski <luto@amacapital.net> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Seaborn <mseaborn@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2015 1 次提交
-
-
由 Al Viro 提交于
use_pde()/unuse_pde() in ->follow_link()/->put_link() resp. Cc: stable@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 18 2月, 2015 1 次提交
-
-
由 WANG Chao 提交于
When updating PT_NOTE header size (ie. p_memsz), an overflow issue happens with the following bogus note entry: n_namesz = 0xFFFFFFFF n_descsz = 0x0 n_type = 0x0 This kind of note entry should be dropped during updating p_memsz. But because n_namesz is 32bit, after (n_namesz + 3) & (~3), it's overflow to 0x0, the note entry size looks sane and reserved. When userspace (eg. crash utility) is trying to access such bogus note, it could lead to an unexpected behavior (eg. crash utility segment fault because it's reading bogus address). The source of bogus note hasn't been identified yet. At least we could drop the bogus note so user space wouldn't be surprised. Signed-off-by: NWANG Chao <chaowang@redhat.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Randy Wright <rwright@hp.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Rashika Kheria <rashika.kheria@gmail.com> Cc: Greg Pearson <greg.pearson@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 2月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2015 4 次提交
-
-
由 Andy Shevchenko 提交于
Instead of custom approach let's use string_escape_str() to escape a given string (task_name in this case). Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rafael Aquini 提交于
The output of /proc/$pid/numa_maps is in terms of number of pages like anon=22 or dirty=54. Here's some output: 7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50 7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50 7fff8d425000 default stack anon=50 dirty=50 N0=50 Looks like we have a stack and a couple of anonymous hugetlbfs areas page which both use the same amount of memory. They don't. The 'bigfile' uses 1GB pages and takes up ~50GB of space. The anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack uses normal 4k pages. You can go over to smaps to figure out what the page size _really_ is with KernelPageSize or MMUPageSize. But, I think this is a pretty nasty and counterintuitive interface as it stands. This patch introduces 'kernelpagesize_kB' line element to /proc/<pid>/numa_maps report file in order to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs. This patch is based on Dave Hansen's proposal and reviewer's follow-ups taken from the following dicussion threads: * https://lkml.org/lkml/2011/9/21/454 * https://lkml.org/lkml/2014/12/20/66Signed-off-by: NRafael Aquini <aquini@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexander Kuleshov 提交于
Use the PDE() helper to get proc_dir_entry instead of coding it directly. Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com> Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Petr Cermak 提交于
Peak resident size of a process can be reset back to the process's current rss value by writing "5" to /proc/pid/clear_refs. The driving use-case for this would be getting the peak RSS value, which can be retrieved from the VmHWM field in /proc/pid/status, per benchmark iteration or test scenario. [akpm@linux-foundation.org: clarify behaviour in documentation] Signed-off-by: NPetr Cermak <petrcermak@chromium.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Primiano Tucci <primiano@chromium.org> Cc: Petr Cermak <petrcermak@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2015 8 次提交
-
-
由 Kirill A. Shutemov 提交于
Currently pagewalker splits all THP pages on any clear_refs request. It's not necessary. We can handle this on PMD level. One side effect is that soft dirty will potentially see more dirty memory, since we will mark whole THP page dirty at once. Sanity checked with CRIU test suite. More testing is required. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). For example for pagemap_read(), when no callbacks are called against VM_PFNMAP vma, pagemap_read() may prepare pagemap data for next virtual address range at wrong index. That could confuse and/or break userspace applications. This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows: - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole() over vma(VM_PFNMAP), - for clear_refs and queue_pages which have their own ->tests_walk, just return 1 and skip vma(VM_PFNMAP). This is no problem because these are not interested in hole regions, - for other callers, just skip the vma(VM_PFNMAP) as a default behavior. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NShiraz Hashim <shashim@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_numa_map() walks pages on vma basis, so using walk_page_vma() is preferable. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code grep-friendly. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Page table walker has the information of the current vma in mm_walk, so we don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call any longer. Currently pagemap_pte_range() does vma loop itself, so this patch reduces many lines of code. NULL-vma check is omitted because we assume that we never run these callbacks on any address outside vma. And even if it were broken, NULL pointer dereference would be detected, so we can get enough information for debugging. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
clear_refs_write() has some prechecks to determine if we really walk over a given vma. Now we have a test_walk() callback to filter vmas, so let's utilize it. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_smap() walks pages on vma basis, so using walk_page_vma() is preferable. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Lockless access to pte in pagemap_pte_range() might race with page migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page(): CPU A (pagemap) CPU B (migration) lock_page() try_to_unmap(page, TTU_MIGRATION...) make_migration_entry() set_pte_at() <read *pte> pte_to_pagemap_entry() remove_migration_ptes() unlock_page() if(is_migration_entry()) migration_entry_to_page() BUG_ON(!PageLocked(page)) Also lockless read might be non-atomic if pte is larger than wordsize. Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes. Fixes: 052fb0d6 ("proc: report file/anon bit in /proc/pid/pagemap") Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-