- 13 9月, 2015 2 次提交
-
-
由 Peter Zijlstra 提交于
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Linus Torvalds 提交于
rq_data_dir() returns either READ or WRITE (0 == READ, 1 == WRITE), not a boolean value. Now, admittedly the "!= 0" doesn't really change the value (0 stays as zero, 1 stays as one), but it's not only redundant, it confuses gcc, and causes gcc to warn about the construct switch (rq_data_dir(req)) { case READ: ... case WRITE: ... that we have in a few drivers. Now, the gcc warning is silly and stupid (it seems to warn not about the switch value having a different type from the case statements, but about _any_ boolean switch value), but in this case the code itself is silly and stupid too, so let's just change it, and get rid of warnings like this: drivers/block/hd.c: In function ‘hd_request’: drivers/block/hd.c:630:11: warning: switch condition has boolean value [-Wswitch-bool] switch (rq_data_dir(req)) { The odd '!= 0' came in when "cmd_flags" got turned into a "u64" in commit 5953316d ("block: make rq->cmd_flags be 64-bit") and is presumably because the old code (that just did a logical 'and' with 1) would then end up making the type of rq_data_dir() be u64 too. But if we want to retain the old regular integer type, let's just cast the result to 'int' rather than use that rather odd '!= 0'. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 9月, 2015 2 次提交
-
-
由 Joe Perches 提交于
The seq_<foo> function return values were frequently misused. See: commit 1f33c41c ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") All uses of these return values have been removed, so convert the return types to void. Miscellanea: o Move seq_put_decimal_<type> and seq_escape prototypes closer the other seq_vprintf prototypes o Reorder seq_putc and seq_puts to return early on overflow o Add argument names to seq_vprintf and seq_printf o Update the seq_escape kernel-doc o Convert a couple of leading spaces to tabs in seq_escape Signed-off-by: NJoe Perches <joe@perches.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mark Brown <broonie@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Joerg Roedel <jroedel@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mathieu Desnoyers 提交于
Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads running on the system. It is implemented by calling synchronize_sched(). It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU [1], rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side. The existing applications of which I am aware that would be improved by this system call are as follows: * Through Userspace RCU library (http://urcu.so) - DNS server (Knot DNS) https://www.knot-dns.cz/ - Network sniffer (http://netsniff-ng.org/) - Distributed object storage (https://sheepdog.github.io/sheepdog/) - User-space tracing (http://lttng.org) - Network storage system (https://www.gluster.org/) - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf) - Financial software (https://lkml.org/lkml/2015/3/23/189) Those projects use RCU in userspace to increase read-side speed and scalability compared to locking. Especially in the case of RCU used by libraries, sys_membarrier can speed up the read-side by moving the bulk of the memory barrier cost to synchronize_rcu(). * Direct users of sys_membarrier - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198) Microsoft core dotnet GC developers are planning to use the mprotect() side-effect of issuing memory barriers through IPIs as a way to implement Windows FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their github thread, specifically stating that sys_membarrier() is what they are looking for. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B previous mem accesses previous mem accesses smp_mb() smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by synchronize_sched(). * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) 1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 2202847 writes signal-based scheme: 9830061167 reads, 6700 writes sys_membarrier: 9952759104 reads, 425 writes sys_membarrier (dyn. check): 7970328887 reads, 425 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, sys_membarrier slightly outperforms the signal-based scheme. However, this non-expedited sys_membarrier implementation has a much slower grace period than signal and memory barrier schemes. Besides diminishing the number of wake-ups, one major advantage of the membarrier system call over the signal-based scheme is that it does not need to reserve a signal. This plays much more nicely with libraries, and with processes injected into for tracing purposes, for which we cannot expect that signals will be unused by the application. An expedited version of this system call can be added later on to speed up the grace period. Its implementation will likely depend on reading the cpu_curr()->mm without holding each CPU's rq lock. This patch adds the system call to x86 and to asm-generic. [1] http://urcu.so membarrier(2) man page: MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2) NAME membarrier - issue memory barriers on a set of threads SYNOPSIS #include <linux/membarrier.h> int membarrier(int cmd, int flags); DESCRIPTION The cmd argument is one of the following: MEMBARRIER_CMD_QUERY Query the set of supported commands. It returns a bitmask of supported commands. MEMBARRIER_CMD_SHARED Execute a memory barrier on all threads running on the system. Upon return from system call, the caller thread is ensured that all running threads have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This covers threads from all pro=E2=80=90 cesses running on the system. This command returns 0. The flags argument needs to be 0. For future extensions. All memory accesses performed in program order from each targeted thread is guaranteed to be ordered with respect to sys_membarrier(). If we use the semantic "barrier()" to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pair of barrier(), sys_membarrier() and smp_mb(): The pair ordering is detailed as (O: ordered, X: not ordered): barrier() smp_mb() sys_membarrier() barrier() X X O smp_mb() X O O sys_membarrier() O O O RETURN VALUE On success, these system calls return zero. On error, -1 is returned, and errno is set appropriately. For a given command, with flags argument set to 0, this system call is guaranteed to always return the same value until reboot. ERRORS ENOSYS System call is not implemented. EINVAL Invalid arguments. Linux 2015-04-15 MEMBARRIER(2) Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nicholas Miell <nmiell@comcast.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 9月, 2015 23 次提交
-
-
由 David Disseldorp 提交于
Signed-off-by: NDavid Disseldorp <ddiss@suse.de> Acked-by: NAndy Grover <agrover@redhat.com> Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
-
由 Nicholas Bellinger 提交于
This patch adds an optional fabric driver provided SGL limit that target-core will honor as it's own internal I/O maximum transfer length limit, as exposed by EVPD=0xb0 block limits parameters. This is required for handling cases when host I/O transfer length exceeds the requested EVPD block limits maximum transfer length. The initial user of this logic is qla2xxx, so that we can avoid having to reject I/Os from some legacy FC hosts where EVPD=0xb0 parameters are not honored. When se_cmd payload length exceeds the provided limit in target_check_max_data_sg_nents() code, se_cmd->data_length + se_cmd->prot_length are reset with se_cmd->residual_count plus underflow bit for outgoing TFO response callbacks. It also checks for existing CDB level underflow + overflow and recalculates final residual_count as necessary. Note this patch currently assumes 1:1 mapping of PAGE_SIZE per struct scatterlist entry. Reported-by: NCraig Watson <craig.watson@vanguard-rugged.com> Cc: Craig Watson <craig.watson@vanguard-rugged.com> Tested-by: NHimanshu Madhani <himanshu.madhani@qlogic.com> Cc: Roland Dreier <roland@purestorage.com> Cc: Arun Easi <arun.easi@qlogic.com> Cc: Giridhar Malavali <giridhar.malavali@qlogic.com> Cc: Andrew Vasquez <andrew.vasquez@qlogic.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
-
由 Christoph Hellwig 提交于
Almost everyone implements dma_set_mask the same way, although some time that's hidden in ->set_dma_mask methods. This patch consolidates those into a common implementation that either calls ->set_dma_mask if present or otherwise uses the default implementation. Some architectures used to only call ->set_dma_mask after the initial checks, and those instance have been fixed to do the full work. h8300 implemented dma_set_mask bogusly as a no-ops and has been fixed. Unfortunately some architectures overload unrelated semantics like changing the dma_ops into it so we still need to allow for an architecture override for now. [jcmvbkbc@gmail.com: fix xtensa] Signed-off-by: NChristoph Hellwig <hch@lst.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Michal Simek <monstr@monstr.eu> Cc: Jonas Bonn <jonas@southpole.se> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
Most architectures just call into ->dma_supported, but some also return 1 if the method is not present, or 0 if no dma ops are present (although that should never happeb). Consolidate this more broad version into common code. Also fix h8300 which inorrectly always returned 0, which would have been a problem if it's dma_set_mask implementation wasn't a similarly buggy noop. As a few architectures have much more elaborate implementations, we still allow for arch overrides. [jcmvbkbc@gmail.com: fix xtensa] Signed-off-by: NChristoph Hellwig <hch@lst.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Michal Simek <monstr@monstr.eu> Cc: Jonas Bonn <jonas@southpole.se> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
Currently there are three valid implementations of dma_mapping_error: (1) call ->mapping_error (2) check for a hardcoded error code (3) always return 0 This patch provides a common implementation that calls ->mapping_error if present, then checks for DMA_ERROR_CODE if defined or otherwise returns 0. [jcmvbkbc@gmail.com: fix xtensa] Signed-off-by: NChristoph Hellwig <hch@lst.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Michal Simek <monstr@monstr.eu> Cc: Jonas Bonn <jonas@southpole.se> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
Most architectures do not support non-coherent allocations and either define dma_{alloc,free}_noncoherent to their coherent versions or stub them out. Openrisc uses dma_{alloc,free}_attrs to implement them, and only Mips implements them directly. This patch moves the Openrisc version to common code, and handles the DMA_ATTR_NON_CONSISTENT case in the mips dma_map_ops instance. Note that actual non-coherent allocations require a dma_cache_sync implementation, so if non-coherent allocations didn't work on an architecture before this patch they still won't work after it. [jcmvbkbc@gmail.com: fix xtensa] Signed-off-by: NChristoph Hellwig <hch@lst.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Michal Simek <monstr@monstr.eu> Cc: Jonas Bonn <jonas@southpole.se> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
Since 2009 we have a nice asm-generic header implementing lots of DMA API functions for architectures using struct dma_map_ops, but unfortunately it's still missing a lot of APIs that all architectures still have to duplicate. This series consolidates the remaining functions, although we still need arch opt outs for two of them as a few architectures have very non-standard implementations. This patch (of 5): The coherent DMA allocator works the same over all architectures supporting dma_map operations. This patch consolidates them and converges the minor differences: - the debug_dma helpers are now called from all architectures, including those that were previously missing them - dma_alloc_from_coherent and dma_release_from_coherent are now always called from the generic alloc/free routines instead of the ops dma-mapping-common.h always includes dma-coherent.h to get the defintions for them, or the stubs if the architecture doesn't support this feature - checks for ->alloc / ->free presence are removed. There is only one magic instead of dma_map_ops without them (mic_dma_ops) and that one is x86 only anyway. Besides that only x86 needs special treatment to replace a default devices if none is passed and tweak the gfp_flags. An optional arch hook is provided for that. [linux@roeck-us.net: fix build] [jcmvbkbc@gmail.com: fix xtensa] Signed-off-by: NChristoph Hellwig <hch@lst.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Michal Simek <monstr@monstr.eu> Cc: Jonas Bonn <jonas@southpole.se> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(), rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple wrapper on top of do_mmap(). Perhaps we should update the callers of do_mmap_pgoff() and kill it later. This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not play with vm internals. After this change mmap_region() has a single user outside of mmap.c, arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to change arch/tile/ and unexport mmap_region(). [kirill@shutemov.name: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NDave Hansen <dave.hansen@linux.intel.com> Tested-by: NDave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Young 提交于
There are two kexec load syscalls, kexec_load another and kexec_file_load. kexec_file_load has been splited as kernel/kexec_file.c. In this patch I split kexec_load syscall code to kernel/kexec.c. And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and use kexec_file_load only, or vice verse. The original requirement is from Ted Ts'o, he want kexec kernel signature being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use kexec_load syscall can bypass the checking. Vivek Goyal proposed to create a common kconfig option so user can compile in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects KEXEC_CORE so that old config files still work. Because there's general code need CONFIG_KEXEC_CORE, so I updated all the architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects KEXEC_CORE in arch Kconfig. Also updated general kernel code with to kexec_load syscall. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NDave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Young 提交于
Split kexec_file syscall related code to another file kernel/kexec_file.c so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped. Sharing variables and functions are moved to kernel/kexec_internal.h per suggestion from Vivek and Petr. [akpm@linux-foundation.org: fix bisectability] [akpm@linux-foundation.org: declare the various arch_kexec functions] [akpm@linux-foundation.org: fix build] Signed-off-by: NDave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Shevchenko 提交于
This introduces a new helper and switches current users to use it. All patches are compiled tested. kmemleak is tested via its own test suite. This patch (of 6): The new seq_hex_dump() is a complete analogue of print_hex_dump(). We have few users of this functionality already. It allows to reduce their codebase. Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Cc: Tadeusz Struk <tadeusz.struk@intel.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Tuchscherer <ingo.tuchscherer@de.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Frederic Weisbecker 提交于
We need to launch the usermodehelper kernel threads with the widest affinity and this is partly why we use khelper. This workqueue has unbound properties and thus a wide affinity inherited by all its children. Now khelper also has special properties that we aren't much interested in: ordered and singlethread. There is really no need about ordering as all we do is creating kernel threads. This can be done concurrently. And singlethread is a useless limitation as well. The workqueue engine already proposes generic unbound workqueues that don't share these useless properties and handle well parallel jobs. The only worrysome specific is their affinity to the node of the current CPU. It's fine for creating the usermodehelper kernel threads but those inherit this affinity for longer jobs such as requesting modules. This patch proposes to use these node affine unbound workqueues assuming that a node is sufficient to handle several parallel usermodehelper requests. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kees Cook 提交于
To further clarify the purpose of the "esc" argument, rename it to "only" to reflect that it is a limit, not a list of additional characters to escape. Signed-off-by: NKees Cook <keescook@chromium.org> Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Walleij 提交于
print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or dev_dbg() & friends. Currently it will adhere to dynamic debug, but will not stub out prints if CONFIG_DEBUG is not set. Let's make it do the right thing, because I am tired of having my dmesg buffer full of hex dumps on production systems. Signed-off-by: NLinus Walleij <linus.walleij@linaro.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jason A. Donenfeld 提交于
The other two implementations of pr_debug_ratelimited include pr_fmt, along with every other pr_* function. But pr_debug_ratelimited forgot to add it with the CONFIG_DYNAMIC_DEBUG implementation. This patch unifies the behavior. Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vasily Kulikov 提交于
Signed-off-by: NVasily Kulikov <segoon@openwall.com> Cc: Solar Designer <solar@openwall.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vasily Kulikov 提交于
Poison pointer values should be small enough to find a room in non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space" is located starting from 0x0. Given unprivileged users cannot mmap anything below mmap_min_addr, it should be safe to use poison pointers lower than mmap_min_addr. The current poison pointer values of LIST_POISON{1,2} might be too big for mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu uses only 0x10000). There is little point to use such a big value given the "poison pointer space" below 1 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. The values are suggested by Solar Designer: http://www.openwall.com/lists/oss-security/2015/05/02/6Signed-off-by: NVasily Kulikov <segoon@openwall.com> Cc: Solar Designer <solar@openwall.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags is that one can easily filter dirty and/or unevictable pages while estimating the size of unused memory. Note that idle flag read from /proc/kpageflags may be stale in case the page was accessed via a PTE, because it would be too costly to iterate over all page mappings on each /proc/kpageflags read to provide an up-to-date value. To make sure the flag is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Knowing the portion of memory that is not used by a certain application or memory cgroup (idle memory) can be useful for partitioning the system efficiently, e.g. by setting memory cgroup limits appropriately. Currently, the only means to estimate the amount of idle memory provided by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the access bit for all pages mapped to a particular process by writing 1 to clear_refs, wait for some time, and then count smaps:Referenced. However, this method has two serious shortcomings: - it does not count unmapped file pages - it affects the reclaimer logic To overcome these drawbacks, this patch introduces two new page flags, Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap. A page's Idle flag can only be set from userspace by setting bit in /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page, and it is cleared whenever the page is accessed either through page tables (it is cleared in page_referenced() in this case) or using the read(2) system call (mark_page_accessed()). Thus by setting the Idle flag for pages of a particular workload, which can be found e.g. by reading /proc/PID/pagemap, waiting for some time to let the workload access its working set, and then reading the bitmap file, one can estimate the amount of pages that are not used by the workload. The Young page flag is used to avoid interference with the memory reclaimer. A page's Young flag is set whenever the Access bit of a page table entry pointing to the page is cleared by writing to the bitmap file. If page_referenced() is called on a Young page, it will add 1 to its return value, therefore concealing the fact that the Access bit was cleared. Note, since there is no room for extra page flags on 32 bit, this feature uses extended page flags when compiled on 32 bit. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: kpageidle requires an MMU] [akpm@linux-foundation.org: decouple from page-flags rework] Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
In the scope of the idle memory tracking feature, which is introduced by the following patch, we need to clear the referenced/accessed bit not only in primary, but also in secondary ptes. The latter is required in order to estimate wss of KVM VMs. At the same time we want to avoid flushing tlb, because it is quite expensive and it won't really affect the final result. Currently, there is no function for clearing pte young bit that would meet our requirements, so this patch introduces one. To achieve that we have to add a new mmu-notifier callback, clear_young, since there is no method for testing-and-clearing a secondary pte w/o flushing tlb. The new method is not mandatory and currently only implemented by KVM. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
It is only used in mem_cgroup_try_charge, so fold it in and zap it. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
This patchset introduces a new user API for tracking user memory pages that have not been used for a given period of time. The purpose of this is to provide the userspace with the means of tracking a workload's working set, i.e. the set of pages that are actively used by the workload. Knowing the working set size can be useful for partitioning the system more efficiently, e.g. by tuning memory cgroup limits appropriately, or for job placement within a compute cluster. ==== USE CASES ==== The unified cgroup hierarchy has memory.low and memory.high knobs, which are defined as the low and high boundaries for the workload working set size. However, the working set size of a workload may be unknown or change in time. With this patch set, one can periodically estimate the amount of memory unused by each cgroup and tune their memory.low and memory.high parameters accordingly, therefore optimizing the overall memory utilization. Another use case is balancing workloads within a compute cluster. Knowing how much memory is not really used by a workload unit may help take a more optimal decision when considering migrating the unit to another node within the cluster. Also, as noted by Minchan, this would be useful for per-process reclaim (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle pages only by smart user memory manager. ==== USER API ==== The user API consists of two new files: * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each bit corresponds to a page, indexed by PFN. When the bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle. To mark a page idle one should set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. Only user memory pages can be marked idle, for other page types input is silently ignored. Writing to this file beyond max PFN results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set. This file can be used to estimate the amount of pages that are not used by a particular workload as follows: 1. mark all pages of interest idle by setting corresponding bits in the /sys/kernel/mm/page_idle/bitmap 2. wait until the workload accesses its working set 3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set * /proc/kpagecgroup. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. This file can be used to find all pages (including unmapped file pages) accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one can then estimate the cgroup working set size. For an example of using these files for estimating the amount of unused memory pages per each memory cgroup, please see the script attached below. ==== REASONING ==== The reason to introduce the new user API instead of using /proc/PID/{clear_refs,smaps} is that the latter has two serious drawbacks: - it does not count unmapped file pages - it affects the reclaimer logic The new API attempts to overcome them both. For more details on how it is achieved, please see the comment to patch 6. ==== PATCHSET STRUCTURE ==== The patch set is organized as follows: - patch 1 adds page_cgroup_ino() helper for the sake of /proc/kpagecgroup and patches 2-3 do related cleanup - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is charged to - patch 5 introduces a new mmu notifier callback, clear_young, which is a lightweight version of clear_flush_young; it is used in patch 6 - patch 6 implements the idle page tracking feature, including the userspace API, /sys/kernel/mm/page_idle/bitmap - patch 7 exports idle flag via /proc/kpageflags ==== SIMILAR WORKS ==== Originally, the patch for tracking idle memory was proposed back in 2011 by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main difference between Michel's patch and this one is that Michel implemented a kernel space daemon for estimating idle memory size per cgroup while this patch only provides the userspace with the minimal API for doing the job, leaving the rest up to the userspace. However, they both share the same idea of Idle/Young page flags to avoid affecting the reclaimer logic. ==== PERFORMANCE EVALUATION ==== SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the performance impact introduced by this patch set. Three runs were carried out: - base: kernel without the patch - patched: patched kernel, the feature is not used - patched-active: patched kernel, 1 minute-period daemon is used for tracking idle memory For tracking idle memory, idlememstat utility was used: https://github.com/locker/idlememstat testcase base patched patched-active compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)% compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)% crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)% derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)% mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)% scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)% scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)% serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)% startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)% sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)% xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)% composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)% time idlememstat: 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k 448inputs+40outputs (1major+36052minor)pagefaults 0swaps ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ==== #! /usr/bin/python # import os import stat import errno import struct CGROUP_MOUNT = "/sys/fs/cgroup/memory" BUFSIZE = 8 * 1024 # must be multiple of 8 def get_hugepage_size(): with open("/proc/meminfo", "r") as f: for s in f: k, v = s.split(":") if k == "Hugepagesize": return int(v.split()[0]) * 1024 PAGE_SIZE = os.sysconf("SC_PAGE_SIZE") HUGEPAGE_SIZE = get_hugepage_size() def set_idle(): f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE) while True: try: f.write(struct.pack("Q", pow(2, 64) - 1)) except IOError as err: if err.errno == errno.ENXIO: break raise f.close() def count_idle(): f_flags = open("/proc/kpageflags", "rb", BUFSIZE) f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE) with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f: while f.read(BUFSIZE): pass # update idle flag idlememsz = {} while True: s1, s2 = f_flags.read(8), f_cgroup.read(8) if not s1 or not s2: break flags, = struct.unpack('Q', s1) cgino, = struct.unpack('Q', s2) unevictable = (flags >> 18) & 1 huge = (flags >> 22) & 1 idle = (flags >> 25) & 1 if idle and not unevictable: idlememsz[cgino] = idlememsz.get(cgino, 0) + \ (HUGEPAGE_SIZE if huge else PAGE_SIZE) f_flags.close() f_cgroup.close() return idlememsz if __name__ == "__main__": print "Setting the idle flag for each page..." set_idle() raw_input("Wait until the workload accesses its working set, " "then press Enter") print "Counting idle pages..." idlememsz = count_idle() for dir, subdirs, files in os.walk(CGROUP_MOUNT): ino = os.stat(dir)[stat.ST_INO] print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB" ==== END SCRIPT ==== This patch (of 8): Add page_cgroup_ino() helper to memcg. This function returns the inode number of the closest online ancestor of the memory cgroup a page is charged to. It is required for exporting information about which page is charged to which cgroup to userspace, which will be introduced by a following patch. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Streetman 提交于
This series makes creation of the zpool and compressor dynamic, so that they can be changed at runtime. This makes using/configuring zswap easier, as before this zswap had to be configured at boot time, using boot params. This uses a single list to track both the zpool and compressor together, although Seth had mentioned an alternative which is to track the zpools and compressors using separate lists. In the most common case, only a single zpool and single compressor, using one list is slightly simpler than using two lists, and for the uncommon case of multiple zpools and/or compressors, using one list is slightly less simple (and uses slightly more memory, probably) than using two lists. This patch (of 4): Add zpool_has_pool() function, indicating if the specified type of zpool is available (i.e. zsmalloc or zbud). This allows checking if a pool is available, without actually trying to allocate it, similar to crypto_has_alg(). This is used by a following patch to zswap that enables the dynamic runtime creation of zswap zpools. Signed-off-by: NDan Streetman <ddstreet@ieee.org> Acked-by: NSeth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 9月, 2015 5 次提交
-
-
由 Henrik Austad 提交于
IEEE 1722 describes AVB (later renamed to TSN - Time Sensitive Networking), a protocol, encapsualtion and synchronization to utilize standard networks for audio/video (and later other time-sensitive) streams. This standard uses ethertype 0x22F0. http://standards.ieee.org/develop/regauth/ethertype/eth.txt This is a respin of a previous patch ("ether: add AVB frame type ETH_P_AVB") CC: "David S. Miller" <davem@davemloft.net> CC: netdev@vger.kernel.org CC: linux-api@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: NHenrik Austad <henrik@austad.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Mike Frysinger 提交于
The linux/audit.h header uses EM_MICROBLAZE in order to define AUDIT_ARCH_MICROBLAZE, but it's only available in the microblaze asm headers. Move it to the common elf-em.h header so that the define can be used on non-microblaze systems. Otherwise we get build errors that EM_MICROBLAZE isn't defined when we try to use the AUDIT_ARCH_MICROBLAZE symbol. Signed-off-by: NMike Frysinger <vapier@gentoo.org> Signed-off-by: NMichal Simek <michal.simek@xilinx.com>
-
由 Daniel Borkmann 提交于
When netlink mmap on receive side is the consumer of nf queue data, it can happen that in some edge cases, we write skb shared info into the user space mmap buffer: Assume a possible rx ring frame size of only 4096, and the network skb, which is being zero-copied into the netlink skb, contains page frags with an overall skb->len larger than the linear part of the netlink skb. skb_zerocopy(), which is generic and thus not aware of the fact that shared info cannot be accessed for such skbs then tries to write and fill frags, thus leaking kernel data/pointers and in some corner cases possibly writing out of bounds of the mmap area (when filling the last slot in the ring buffer this way). I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has an advertised length larger than 4096, where the linear part is visible at the slot beginning, and the leaked sizeof(struct skb_shared_info) has been written to the beginning of the next slot (also corrupting the struct nl_mmap_hdr slot header incl. status etc), since skb->end points to skb->data + ring->frame_size - NL_MMAP_HDRLEN. The fix adds and lets __netlink_alloc_skb() take the actual needed linear room for the network skb + meta data into account. It's completely irrelevant for non-mmaped netlink sockets, but in case mmap sockets are used, it can be decided whether the available skb_tailroom() is really large enough for the buffer, or whether it needs to internally fallback to a normal alloc_skb(). >From nf queue side, the information whether the destination port is an mmap RX ring is not really available without extra port-to-socket lookup, thus it can only be determined in lower layers i.e. when __netlink_alloc_skb() is called that checks internally for this. I chose to add the extra ldiff parameter as mmap will then still work: We have data_len and hlen in nfqnl_build_packet_message(), data_len is the full length (capped at queue->copy_range) for skb_zerocopy() and hlen some possible part of data_len that needs to be copied; the rem_len variable indicates the needed remaining linear mmap space. The only other workaround in nf queue internally would be after allocation time by f.e. cap'ing the data_len to the skb_tailroom() iff we deal with an mmap skb, but that would 1) expose the fact that we use a mmap skb to upper layers, and 2) trim the skb where we otherwise could just have moved the full skb into the normal receive queue. After the patch, in my test case the ring slot doesn't fit and therefore shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and thus needs to be picked up via recv(). Fixes: 3ab1f683 ("nfnetlink: add support for memory mapped netlink") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
Add Microchip LAN88XX phy driver for phylib. Signed-off-by: NWoojung Huh <woojung.huh@microchip.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Phil Sutter 提交于
This switches IPv6 policy routing to use the shared fib_default_rule_pref() function of IPv4 and DECnet. It is also used in multicast routing for IPv4 as well as IPv6. The motivation for this patch is a complaint about iproute2 behaving inconsistent between IPv4 and IPv6 when adding policy rules: Formerly, IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the assigned priority value was decreased with each rule added. Since then all users of the default_pref field have been converted to assign the generic function fib_default_rule_pref(), fib_nl_newrule() may just use it directly instead. Therefore get rid of the function pointer altogether and make fib_default_rule_pref() static, as it's not used outside fib_rules.c anymore. Signed-off-by: NPhil Sutter <phil@nwl.cc> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 9月, 2015 8 次提交
-
-
Add dev_pm_opp_get_suspend_opp() helper to obtain suspend opp. Acked-by: NViresh Kumar <viresh.kumar@linaro.org> Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
由 Krzysztof Kozlowski 提交于
The structure zbud_ops is not modified so make the pointer to it a pointer to const. Signed-off-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: NDan Streetman <ddstreet@ieee.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Krzysztof Kozlowski 提交于
The structure zpool_ops is not modified so make the pointer to it a pointer to const. Signed-off-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: NDan Streetman <ddstreet@ieee.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dmitry Safonov 提交于
zswap_get_swap_cache_page and read_swap_cache_async have pretty much the same code with only significant difference in return value and usage of swap_readpage. I a helper __read_swap_cache_async() with the common code. Behavior change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload instead radix_tree_preload. Looks like, this wasn't changed only by the reason of code duplication. Signed-off-by: NDmitry Safonov <0x7f454c46@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sergey Senozhatsky 提交于
Compaction returns back to zram the number of migrated objects, which is quite uninformative -- we have objects of different sizes so user space cannot obtain any valuable data from that number. Change compaction to operate in terms of pages and return back to compaction issuer the number of pages that were freed during compaction. So from now on we will export more meaningful value in zram<id>/mm_stat -- the number of freed (compacted) pages. This requires: (a) a rename of `num_migrated' to 'pages_compacted' (b) a internal API change -- return first_page's fullness_group from putback_zspage(), so we know when putback_zspage() did free_zspage(). It helps us to account compaction stats correctly. Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sergey Senozhatsky 提交于
`zs_compact_control' accounts the number of migrated objects but it has a limited lifespan -- we lose it as soon as zs_compaction() returns back to zram. It worked fine, because (a) zram had it's own counter of migrated objects and (b) only zram could trigger compaction. However, this does not work for automatic pool compaction (not issued by zram). To account objects migrated during auto-compaction (issued by the shrinker) we need to store this number in zs_pool. Define a new `struct zs_pool_stats' structure to keep zs_pool's stats there. It provides only `num_migrated', as of this writing, but it surely can be extended. A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to caller. Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats. Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
alloc_pages_node() might fail when called with NUMA_NO_NODE and __GFP_THISNODE on a CPU belonging to a memoryless node. To make the local-node fallback more robust and prevent such situations, use numa_mem_id(), which was introduced for similar scenarios in the slab context. Suggested-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Perform the same debug checks in alloc_pages_node() as are done in __alloc_pages_node(), by making the former function a wrapper of the latter one. In addition to better diagnostics in DEBUG_VM builds for situations which have been already fatal (e.g. out-of-bounds node id), there are two visible changes for potential existing buggy callers of alloc_pages_node(): - calling alloc_pages_node() with any negative nid (e.g. due to arithmetic overflow) was treated as passing NUMA_NO_NODE and fallback to local node was applied. This will now be fatal. - calling alloc_pages_node() with an offline node will now be checked for DEBUG_VM builds. Since it's not fatal if the node has been previously online, and this patch may expose some existing buggy callers, change the VM_BUG_ON in __alloc_pages_node() to VM_WARN_ON. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-