You need to sign in or sign up before continuing.
- 02 5月, 2018 2 次提交
-
-
由 Michael S. Tsirkin 提交于
This reverts commit 93c0d549c4c5a7382ad70de6b86610b7aae57406. Unfortunately the padding will break 32 bit userspace. Ouch. Need to add some compat code, revert for now. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Michael S. Tsirkin 提交于
There's a 32 bit hole just after type. It's best to give it a name, this way compiler is forced to initialize it with rest of the structure. Reported-by: NKevin Easton <kevin@guarana.org> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 4月, 2018 1 次提交
-
-
由 KarimAllah Ahmed 提交于
Move DISABLE_EXITS KVM capability bits to the UAPI just like the rest of capabilities. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
- 27 4月, 2018 1 次提交
-
-
由 Jozsef Kadlecsik 提交于
Dominique Martinet reported a TCP hang problem when simultaneous open was used. The problem is that the tcp_conntracks state table is not smart enough to handle the case. The state table could be fixed by introducing a new state, but that would require more lines of code compared to this patch, due to the required backward compatibility with ctnetlink. Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu> Reported-by: NDominique Martinet <asmadeus@codewreck.org> Tested-by: NDominique Martinet <asmadeus@codewreck.org> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 26 4月, 2018 1 次提交
-
-
由 Thomas Gleixner 提交于
Revert commits 92af4dcb ("tracing: Unify the "boot" and "mono" tracing clocks") 127bfa5f ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior") 7250a404 ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior") d6c7270e ("timekeeping: Remove boot time specific code") f2d6fdbf ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior") d6ed449a ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock") 72199320 ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock") As stated in the pull request for the unification of CLOCK_MONOTONIC and CLOCK_BOOTTIME, it was clear that we might have to revert the change. As reported by several folks systemd and other applications rely on the documented behaviour of CLOCK_MONOTONIC on Linux and break with the above changes. After resume daemons time out and other timeout related issues are observed. Rafael compiled this list: * systemd kills daemons on resume, after >WatchdogSec seconds of suspending (Genki Sky). [Verified that that's because systemd uses CLOCK_MONOTONIC and expects it to not include the suspend time.] * systemd-journald misbehaves after resume: systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal corrupted or uncleanly shut down, renaming and replacing. (Mike Galbraith). * NetworkManager reports "networking disabled" and networking is broken after resume 50% of the time (Pavel). [May be because of systemd.] * MATE desktop dims the display and starts the screensaver right after system resume (Pavel). * Full system hang during resume (me). [May be due to systemd or NM or both.] That happens on debian and open suse systems. It's sad, that these problems were neither catched in -next nor by those folks who expressed interest in this change. Reported-by: NRafael J. Wysocki <rjw@rjwysocki.net> Reported-by: Genki Sky <sky@genki.is>, Reported-by: NPavel Machek <pavel@ucw.cz> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>
-
- 25 4月, 2018 1 次提交
-
-
由 Michael S. Tsirkin 提交于
Jason Wang points out that it's very hard for users to build an array of stat names. The naive thing is to use VIRTIO_BALLOON_S_NR but that breaks if we add more stats - as done e.g. recently by commit 6c64fe7f ("virtio_balloon: export hugetlb page allocation counts"). Let's add an array of reasonably readable names. Fixes: 6c64fe7f ("virtio_balloon: export hugetlb page allocation counts") Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Reviewed-by: NJonathan Helman <jonathan.helman@oracle.com>
-
- 23 4月, 2018 1 次提交
-
-
由 Jason Gunthorpe 提交于
Based on discussion with Kate Stewart this license is not a BSD-2-Clause, but is now formally identified as Linux-OpenIB by SPDX. The key difference between the licenses is in the 'warranty' paragraph. if_infiniband.h refers to the 'OpenIB.org' license, but does not include the text, instead it links to an obsolete web site that contains a license that matches the BSD-2-Clause SPX. There is no 'three clause' version of the OpenIB.org license. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
- 19 4月, 2018 1 次提交
-
-
由 Johannes Berg 提交于
There's currently no limit on wiphy names, other than netlink message size and memory limitations, but that causes issues when, for example, the wiphy name is used in a uevent, e.g. in rfkill where we use the same name for the rfkill instance, and then the buffer there is "only" 2k for the environment variables. This was reported by syzkaller, which used a 4k name. Limit the name to something reasonable, I randomly picked 128. Reported-by: syzbot+230d9e642a85d3fec29c@syzkaller.appspotmail.com Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
- 17 4月, 2018 1 次提交
-
-
由 Alexey Budankov 提交于
Store preempting context switch out event into Perf trace as a part of PERF_RECORD_SWITCH[_CPU_WIDE] record. Percentage of preempting and non-preempting context switches help understanding the nature of workloads (CPU or IO bound) that are running on a machine; The event is treated as preemption one when task->state value of the thread being switched out is TASK_RUNNING. Event type encoding is implemented using PERF_RECORD_MISC_SWITCH_OUT_PREEMPT bit; Signed-off-by: NAlexey Budankov <alexey.budankov@linux.intel.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/9ff84e83-a0ca-dd82-a6d0-cb951689be74@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 16 4月, 2018 1 次提交
-
-
由 Greg Kroah-Hartman 提交于
There were some documentation locations that irda was mentioned, as well as an old MAINTAINERS entry and the networking sysctl entries. Clean these all out as this stuff really is finally gone. Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 14 4月, 2018 1 次提交
-
-
由 Theodore Ts'o 提交于
Add a new ioctl which forces the the crng to be reseeded. Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
-
- 12 4月, 2018 8 次提交
-
-
由 Masahiro Yamada 提交于
Minor cleanups available by _UL and _ULL. Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masahiro Yamada 提交于
ARM, ARM64 and UniCore32 duplicate the definition of UL(): #define UL(x) _AC(x, UL) This is not actually arch-specific, so it will be useful to move it to a common header. Currently, we only have the uapi variant for linux/const.h, so I am creating include/linux/const.h. I also added _UL(), _ULL() and ULL() because _AC() is mostly used in the form either _AC(..., UL) or _AC(..., ULL). I expect they will be replaced in follow-up cleanups. The underscore-prefixed ones should be used for exported headers. Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Acked-by: NGuan Xuetao <gxt@mprc.pku.edu.cn> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: NRussell King <rmk+kernel@armlinux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masahiro Yamada 提交于
Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(), BIT() etc", v3. ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL). More architectures may introduce it in the future. UL() is arch-agnostic, and useful. So let's move it to include/linux/const.h Currently, <asm/memory.h> must be included to use UL(). It pulls in more bloats just for defining some bit macros. I posted V2 one year ago. The previous posts are: https://patchwork.kernel.org/patch/9498273/ https://patchwork.kernel.org/patch/9498275/ https://patchwork.kernel.org/patch/9498269/ https://patchwork.kernel.org/patch/9498271/ At that time, what blocked this series was a comment from David Howells: You need to be very careful doing this. Some userspace stuff depends on the guard macro names on the kernel header files. (https://patchwork.kernel.org/patch/9498275/) Looking at the code closer, I noticed this is not a problem. See the following line. https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40 scripts/headers_install.sh rips off _UAPI prefix from guard macro names. I ran "make headers_install" and confirmed the result is what I expect. So, we can prefix the include guard of include/uapi/linux/const.h, and add a new include/linux/const.h. This patch (of 4): I am going to add include/linux/const.h for the kernel space. Add _UAPI to the include guard of include/uapi/linux/const.h to prepare for that. Please notice the guard name of the exported one will be kept as-is. So, this commit has no impact to the userspace even if some userspace stuff depends on the guard macro names. scripts/headers_install.sh processes exported headers by SED, and rips off "_UAPI" from guard macro names. #ifndef _UAPI_LINUX_CONST_H #define _UAPI_LINUX_CONST_H will be turned into #ifndef _LINUX_CONST_H #define _LINUX_CONST_H Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Cc: David Howells <dhowells@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Both load_elf_interp and load_elf_binary rely on elf_map to map segments on a controlled address and they use MAP_FIXED to enforce that. This is however dangerous thing prone to silent data corruption which can be even exploitable. Let's take CVE-2017-1000253 as an example. At the time (before commit eab09532: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE") ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from the stack top on 32b (legacy) memory layout (only 1GB away). Therefore we could end up mapping over the existing stack with some luck. The issue has been fixed since then (a87938b2: "fs/binfmt_elf.c: fix bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much further from the stack (eab09532 and later by c715b72c: "mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive stack consumption early during execve fully stopped by da029c11 ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be safe and any attack should be impractical. On the other hand this is just too subtle assumption so it can break quite easily and hard to spot. I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still fundamentally dangerous. Moreover it shouldn't be even needed. We are at the early process stage and so there shouldn't be unrelated mappings (except for stack and loader) existing so mmap for a given address should succeed even without MAP_FIXED. Something is terribly wrong if this is not the case and we should rather fail than silently corrupt the underlying mapping. Address this issue by changing MAP_FIXED to the newly added MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is an existing mapping clashing with the requested one without clobbering it. [mhocko@suse.com: fix build] [akpm@linux-foundation.org: coding-style fixes] [avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC] Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrei Vagin <avagin@openvz.org> Signed-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com> Acked-by: NMichael Ellerman <mpe@ellerman.id.au> Acked-by: NKees Cook <keescook@chromium.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2. This has started as a follow up discussion [3][4] resulting in the runtime failure caused by hardening patch [5] which removes MAP_FIXED from the elf loader because MAP_FIXED is inherently dangerous as it might silently clobber an existing underlying mapping (e.g. stack). The reason for the failure is that some architectures enforce an alignment for the given address hint without MAP_FIXED used (e.g. for shared or file backed mappings). One way around this would be excluding those archs which do alignment tricks from the hardening [6]. The patch is really trivial but it has been objected, rightfully so, that this screams for a more generic solution. We basically want a non-destructive MAP_FIXED. The first patch introduced MAP_FIXED_NOREPLACE which enforces the given address but unlike MAP_FIXED it fails with EEXIST if the given range conflicts with an existing one. The flag is introduced as a completely new one rather than a MAP_FIXED extension because of the backward compatibility. We really want a never-clobber semantic even on older kernels which do not recognize the flag. Unfortunately mmap sucks wrt flags evaluation because we do not EINVAL on unknown flags. On those kernels we would simply use the traditional hint based semantic so the caller can still get a different address (which sucks) but at least not silently corrupt an existing mapping. I do not see a good way around that. Except we won't export expose the new semantic to the userspace at all. It seems there are users who would like to have something like that. Jemalloc has been mentioned by Michael Ellerman [7] Florian Weimer has mentioned the following: : glibc ld.so currently maps DSOs without hints. This means that the kernel : will map right next to each other, and the offsets between them a completely : predictable. We would like to change that and supply a random address in a : window of the address space. If there is a conflict, we do not want the : kernel to pick a non-random address. Instead, we would try again with a : random address. John Hubbard has mentioned CUDA example : a) Searches /proc/<pid>/maps for a "suitable" region of available : VA space. "Suitable" generally means it has to have a base address : within a certain limited range (a particular device model might : have odd limitations, for example), it has to be large enough, and : alignment has to be large enough (again, various devices may have : constraints that lead us to do this). : : This is of course subject to races with other threads in the process. : : Let's say it finds a region starting at va. : : b) Next it does: : p = mmap(va, ...) : : *without* setting MAP_FIXED, of course (so va is just a hint), to : attempt to safely reserve that region. If p != va, then in most cases, : this is a failure (almost certainly due to another thread getting a : mapping from that region before we did), and so this layer now has to : call munmap(), before returning a "failure: retry" to upper layers. : : IMPROVEMENT: --> if instead, we could call this: : : p = mmap(va, ... MAP_FIXED_NOREPLACE ...) : : , then we could skip the munmap() call upon failure. This : is a small thing, but it is useful here. (Thanks to Piotr : Jaroszynski and Mark Hairgrove for helping me get that detail : exactly right, btw.) : : c) After that, CUDA suballocates from p, via: : : q = mmap(sub_region_start, ... MAP_FIXED ...) : : Interestingly enough, "freeing" is also done via MAP_FIXED, and : setting PROT_NONE to the subregion. Anyway, I just included (c) for : general interest. Atomic address range probing in the multithreaded programs in general sounds like an interesting thing to me. The second patch simply replaces MAP_FIXED use in elf loader by MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED should follow. Actually real MAP_FIXED usages should be docummented properly and they should be more of an exception. [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au This patch (of 2): MAP_FIXED is used quite often to enforce mapping at the particular range. The main problem of this flag is, however, that it is inherently dangerous because it unmaps existing mappings covered by the requested range. This can cause silent memory corruptions. Some of them even with serious security implications. While the current semantic might be really desiderable in many cases there are others which would want to enforce the given range but rather see a failure than a silent memory corruption on a clashing range. Please note that there is no guarantee that a given range is obeyed by the mmap even when it is free - e.g. arch specific code is allowed to apply an alignment. Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this behavior. It has the same semantic as MAP_FIXED wrt. the given address request with a single exception that it fails with EEXIST if the requested address is already covered by an existing mapping. We still do rely on get_unmaped_area to handle all the arch specific MAP_FIXED treatment and check for a conflicting vma after it returns. The flag is introduced as a completely new one rather than a MAP_FIXED extension because of the backward compatibility. We really want a never-clobber semantic even on older kernels which do not recognize the flag. Unfortunately mmap sucks wrt. flags evaluation because we do not EINVAL on unknown flags. On those kernels we would simply use the traditional hint based semantic so the caller can still get a different address (which sucks) but at least not silently corrupt an existing mapping. I do not see a good way around that. [mpe@ellerman.id.au: fix whitespace] [fail on clashing range with EEXIST as per Florian Weimer] [set MAP_FIXED before round_hint_to_min as per Khalid Aziz] Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.orgReviewed-by: NKhalid Aziz <khalid.aziz@oracle.com> Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichael Ellerman <mpe@ellerman.id.au> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Russell King - ARM Linux <linux@armlinux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Jason Evans <jasone@google.com> Cc: David Goldblatt <davidtgoldblatt@gmail.com> Cc: Edward Tomasz Napierała <trasz@FreeBSD.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
There is a permission discrepancy when consulting msq ipc object metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the msq metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new MSG_STAT_ANY command such that the msq ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de> Reported-by: NRobert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
There is a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the sma metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new SEM_STAT_ANY command such that the sem ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de> Reported-by: NRobert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
Patch series "sysvipc: introduce STAT_ANY commands", v2. The following patches adds the discussed (see [1]) new command for shm as well as for sems and msq as they are subject to the same discrepancies for ipc object permission checks between the syscall and via procfs. These new commands are justified in that (1) we are stuck with this semantics as changing syscall and procfs can break userland; and (2) some users can benefit from performance (for large amounts of shm segments, for example) from not having to parse the procfs interface. Once merged, I will submit the necesary manpage updates. But I'm thinking something like: : diff --git a/man2/shmctl.2 b/man2/shmctl.2 : index 7bb503999941..bb00bbe21a57 100644 : --- a/man2/shmctl.2 : +++ b/man2/shmctl.2 : @@ -41,6 +41,7 @@ : .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new : .\" attaches to a segment that has already been marked for deletion. : .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions. : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description. : .\" : .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual" : .SH NAME : @@ -242,6 +243,18 @@ However, the : argument is not a segment identifier, but instead an index into : the kernel's internal array that maintains information about : all shared memory segments on the system. : +.TP : +.BR SHM_STAT_ANY " (Linux-specific)" : +Return a : +.I shmid_ds : +structure as for : +.BR SHM_STAT . : +However, the : +.I shm_perm.mode : +is not checked for read access for : +.IR shmid , : +resembing the behaviour of : +/proc/sysvipc/shm. : .PP : The caller can prevent or allow swapping of a shared : memory segment with the following \fIcmd\fP values: : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the : kernel's internal array recording information about all : shared memory segments. : (This information can be used with repeated : -.B SHM_STAT : +.B SHM_STAT/SHM_STAT_ANY : operations to obtain information about all shared memory segments : on the system.) : A successful : @@ -328,7 +341,7 @@ isn't accessible. : \fIshmid\fP is not a valid identifier, or \fIcmd\fP : is not a valid command. : Or: for a : -.B SHM_STAT : +.B SHM_STAT/SHM_STAT_ANY : operation, the index value specified in : .I shmid : referred to an array slot that is currently unused. This patch (of 3): There is a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the shm metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases. This patch introduces a new SHM_STAT_ANY command such that the shm ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. [1] https://lkml.org/lkml/2017/12/19/220 Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Robert Kettler <robert.kettler@outlook.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 4月, 2018 2 次提交
-
-
由 Helge Deller 提交于
Posix and common sense requires that SI_USER not be a signal specific si_code. Thus add a new FPE_CONDTRAP si_code for conditional traps. Signed-off-by: NHelge Deller <deller@gmx.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
-
由 Jonathan Helman 提交于
Export the number of successful and failed hugetlb page allocations via the virtio balloon driver. These 2 counts come directly from the vm_events HTLB_BUDDY_PGALLOC and HTLB_BUDDY_PGALLOC_FAIL. Signed-off-by: NJonathan Helman <jonathan.helman@oracle.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Reviewed-by: NJason Wang <jasowang@redhat.com>
-
- 06 4月, 2018 4 次提交
-
-
由 Ariel Levkovich 提交于
This patch adds the mlx5_ib driver implementation for the device memory allocation API. It implements the ib_device callbacks for allocation and deallocation operations as well as a new mmap command support which allows mapping an allocated device memory to a VMA. The change also adds reporting of device memory maximum size and alignment parameters reported in device capabilities. The allocation/deallocation operations are using new firmware commands to allocate MEMIC memory on the device. Signed-off-by: NAriel Levkovich <lariel@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Ariel Levkovich 提交于
Adding new ioctl method for the MR object - REG_DM_MR. This command can be used by users to register an allocated device memory buffer as an MR and receive lkey and rkey to be used within work requests. It is added as a new method under the MR object and using a new ib_device callback - reg_dm_mr. The command creates a standard ib_mr object which represents the registered memory. Signed-off-by: NAriel Levkovich <lariel@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Ariel Levkovich 提交于
This change adds uverbs support for allocation/freeing of device memory commands. A new uverbs object is defined of type idr to represent and track the new resource type allocation per context. The API requires provider driver to implement 2 new ib_device callbacks - one for allocation and one for deallocation which return and accept (respectively) the ib_dm object which represents the allocated memory on the device. The support is added via the ioctl command infrastructure only. Signed-off-by: NAriel Levkovich <lariel@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Ariel Levkovich 提交于
Adding a new capability field under ib_uverbs_ex_query_device_resp - max_dm_size - which reflects the maximum amount of device memory that is available for allocation on a device in bytes. Signed-off-by: NAriel Levkovich <lariel@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
- 05 4月, 2018 9 次提交
-
-
由 Matan Barak 提交于
When a Raw Ethernet QP is created, we actually create a few objects. One of these objects is a TIR. Currently, a TIR could hash (and spread the traffic) by IP or port only. Adding a hashing by IPSec SPI to TIR creation with the required UAPI bit. Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Matan Barak 提交于
Users should be able to query for IPSec support. Adding a few capabilities bits as part of the driver specific part in alloc_ucontext: MLX5_USER_ALLOC_UCONTEXT_FLOW_ACTION_FLAGS_ESP_AES_GCM_REQ_METADATA Payload's header is returned with metadata representing the IPSec decryption state. MLX5_USER_ALLOC_UCONTEXT_FLOW_ACTION_FLAGS_ESP_AES_GCM_RX Support ESP_AES_GCM in ingress path. MLX5_USER_ALLOC_UCONTEXT_FLOW_ACTION_FLAGS_ESP_AES_GCM_TX Support ESP_AES_GCM in egress path. MLX5_USER_ALLOC_UCONTEXT_FLOW_ACTION_FLAGS_ESP_AES_GCM_SPI_RSS_ONLY Hardware doesn't support matching SPI in flow steering rules but just hashing and spreading the traffic accordingly. Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Aviad Yehezkel 提交于
Adding implementation in mlx5 driver to create and destroy action_xfrm object. This merely call the accel layer. A user may pass MLX5_IB_XFRM_FLAGS_REQUIRE_METADATA flag which states that [s]he expects a metadata header to be added to the payload. This header represents information regarding the transformation's state. Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Matan Barak 提交于
Adding a new ESP steering match filter that could match against spi and seq used in IPSec protocol. Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Matan Barak 提交于
flow_actions of ESP type could be modified during runtime. This could be common for example when ESN should be changed. Adding a new UVERBS_FLOW_ACTION_ESP_MODIFY method for changing ESP parameters of an existing ESP flow_action. The new method uses the UVERBS_FLOW_ACTION_ESP_CREATE attributes, but adds a new IB_FLOW_ACTION_ESP_FLAGS_MOD_ESP_ATTRS which means ESP_ATTRS should be changed. In addition, we add a new FLOW_ACTION_ESP_REPLAY_NONE replay type that could be used when one wants to disable a replay protection over a specific flow_action. Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Matan Barak 提交于
Binding a flow_action to flow steering rule requires using a new specification. Therefore, adding such an IB_FLOW_SPEC_ACTION_HANDLE flow specification. Flow steering rules could use flow_action(s) and as of that we need to avoid deleting flow_action(s) as long as they're being used. Moreover, when the attached rules are deleted, action_handle reference count should be decremented. Introducing a new mechanism of flow resources to keep track on the attached action_handle(s). Later on, this mechanism should be extended to other attached flow steering resources like flow counters. Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Matan Barak 提交于
A verbs application may receive and transmits packets using a data path pipeline. Sometimes, the first stage in the receive pipeline or the last stage in the transmit pipeline involves transforming a packet, either in order to make it easier for later stages to process it or to prepare it for transmission over the wire. Such transformation could be stripping/encapsulating the packet (i.e. vxlan), decrypting/encrypting it (i.e. ipsec), altering headers, doing some complex FPGA changes, etc. Some hardware could do such transformations without software data path intervention at all. The flow steering API supports steering a packet (either to a QP or dropping it) and some simple packet immutable actions (i.e. tagging a packet). Complex actions, that may change the packet, could bloat the flow steering API extensively. Sometimes the same action should be applied to several flows. In this case, it's easier to bind several flows to the same action and modify it than change all matching flows. Introducing a new flow_action object that abstracts any packet transformation (out of a standard and well defined set of actions). This flow_action object could be tied to a flow steering rule via a new specification. Currently, we support esp flow_action, which encrypts or decrypts a packet according to the given parameters. However, we present a flexible schema that could be used to other transformation actions tied to flow rules. Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Matan Barak 提交于
Methods sometimes need to get one attribute out of a group of pre-defined attributes. This is an enum-like behavior. Since this is a common requirement, we add a new ENUM attribute to the generic uverbs ioctl() layer. This attribute is embedded in methods, like any other attributes we currently have. ENUM attributes point to an array of standard UVERBS_ATTR_PTR_IN. The user-space encodes the enum's attribute id in the id field and the internal PTR_IN attr id in the enum_data.elem_id field. This ENUM attribute could be shared by several attributes and it can get UVERBS_ATTR_SPEC_F_MANDATORY flag, stating this attribute must be supported by the kernel, like any other attribute. Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Mike Snitzer 提交于
Commit 519049af ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl") inadvertantly introduced a regression relative to users of device cgroups that issue ioctls (e.g. libvirt). Using blkdev_get() in DM's passthrough ioctl support implicitly introduced a cgroup permissions check that would fail unless care were taken to add all devices in the IO stack to the device cgroup. E.g. rather than just adding the top-level DM multipath device to the cgroup all the underlying devices would need to be allowed. Fix this, to no longer require allowing all underlying devices, by simply holding the live DM table (which includes the table's original blkdev_get() reference on the blockdevice that the ioctl will be issued to) for the duration of the ioctl. Also, bump the DM ioctl version so a user can know that their device cgroup allow workaround is no longer needed. Reported-by: NMichal Privoznik <mprivozn@redhat.com> Suggested-by: NMikulas Patocka <mpatocka@redhat.com> Fixes: 519049af ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl") Cc: stable@vger.kernel.org # 4.16 Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
- 04 4月, 2018 2 次提交
-
-
由 Jason Gunthorpe 提交于
This structure is pushed down the ex and the non-ex path, so it needs to be aligned to 8 bytes to go through ex without implicit padding. Old user space will provide 4 bytes of resp on !ex and 8 bytes on ex, so take the approach of just copying the minimum length. New user space will consistently provide 8 bytes in both cases. Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Mike Snitzer 提交于
Could be useful for a target to return stats or other information. If a target does DMEMIT() anything to @result from its .message method then it must return 1 to the caller. Signed-off-By: NMike Snitzer <snitzer@redhat.com>
-
- 03 4月, 2018 1 次提交
-
-
由 Eric W. Biederman 提交于
The change moving addr_lsb into the _sigfault union failed to take into account that _sigfault._addr_bnd._lower being a pointer forced the entire union to have pointer alignment. The fix for _sigfault._addr_bnd._lower having pointer alignment failed to take into account that m68k has a pointer alignment less than the size of a pointer. So simply making the padding members pointers changed the location of later members in the structure. Fix this by directly computing the needed size of the padding members, and making the padding members char arrays of the needed size. AKA if __alignof__(void *) is 1 sizeof(short) otherwise __alignof__(void *). Which should be exactly the same rules the compiler whould have used when computing the padding. I have tested this change by adding BUILD_BUG_ONs to m68k to verify the offset of every member of struct siginfo, and with those testing that the offsets of the fields in struct siginfo is the same before I changed the generic _sigfault member and after the correction to the _sigfault member. I have also verified that the x86 with it's own BUILD_BUG_ONs to verify the offsets of the siginfo members also compiles cleanly. Cc: stable@vger.kernel.org Reported-by: NEugene Syromiatnikov <esyr@redhat.com> Fixes: 859d880c ("signal: Correct the offset of si_pkey in struct siginfo") Fixes: b68a68d3 ("signal: Move addr_lsb into the _sigfault union for clarity") Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 01 4月, 2018 2 次提交
-
-
由 Jon Maloy 提交于
gcc points out that the combined length of the fixed-length inputs to l->name is larger than the destination buffer size: net/tipc/link.c: In function 'tipc_link_create': net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes into a region of size between 26 and 58 [-Werror=format-overflow=] sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes (assuming 75) into a destination of size 60 sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); A detailed analysis reveals that the theoretical maximum length of a link name is: max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name = 16 + 1 + 15 + 1 + 16 + 1 + 15 = 65 Since we also need space for a trailing zero we now set MAX_LINK_NAME to 68. Just to be on the safe side we also replace the sprintf() call with snprintf(). Fixes: 25b0b9c4 ("tipc: handle collisions of 32-bit node address hash values") Reported-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NJon Maloy <jon.maloy@ericsson.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jon Maloy 提交于
The three address type structs in the user API have names that in reality reflect the specific, non-Linux environment where they were originally created. We now give them more intuitive names, in accordance with how TIPC is described in the current documentation. struct tipc_portid -> struct tipc_socket_addr struct tipc_name -> struct tipc_service_addr struct tipc_name_seq -> struct tipc_service_range To avoid confusion, we also update some commmets and macro names to match the new terminology. For compatibility, we add macros that map all old names to the new ones. Signed-off-by: NJon Maloy <jon.maloy@ericsson.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 3月, 2018 1 次提交
-
-
由 Andrey Ignatov 提交于
"Post-hooks" are hooks that are called right before returning from sys_bind. At this time IP and port are already allocated and no further changes to `struct sock` can happen before returning from sys_bind but BPF program has a chance to inspect the socket and change sys_bind result. Specifically it can e.g. inspect what port was allocated and if it doesn't satisfy some policy, BPF program can force sys_bind to fail and return EPERM to user. Another example of usage is recording the IP:port pair to some map to use it in later calls to sys_connect. E.g. if some TCP server inside cgroup was bound to some IP:port_n, it can be recorded to a map. And later when some TCP client inside same cgroup is trying to connect to 127.0.0.1:port_n, BPF hook for sys_connect can override the destination and connect application to IP:port_n instead of 127.0.0.1:port_n. That helps forcing all applications inside a cgroup to use desired IP and not break those applications if they e.g. use localhost to communicate between each other. == Implementation details == Post-hooks are implemented as two new attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`. Separate attach types for IPv4 and IPv6 are introduced to avoid access to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from `inet6_bind()` since those fields might not make sense in such cases. Signed-off-by: NAndrey Ignatov <rdna@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-