- 30 5月, 2012 1 次提交
-
-
由 Andrea Arcangeli 提交于
When holding the mmap_sem for reading, pmd_offset_map_lock should only run on a pmd_t that has been read atomically from the pmdp pointer, otherwise we may read only half of it leading to this crash. PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic" #0 [f06a9dd8] crash_kexec at c049b5ec #1 [f06a9e2c] oops_end at c083d1c2 #2 [f06a9e40] no_context at c0433ded #3 [f06a9e64] bad_area_nosemaphore at c043401a #4 [f06a9e6c] __do_page_fault at c0434493 #5 [f06a9eec] do_page_fault at c083eb45 #6 [f06a9f04] error_code (via page_fault) at c083c5d5 EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP: 00000000 DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0 CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246 #7 [f06a9f38] _spin_lock at c083bc14 #8 [f06a9f44] sys_mincore at c0507b7d #9 [f06a9fb0] system_call at c083becd start len EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00 SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033 CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286 This should be a longstanding bug affecting x86 32bit PAE without THP. Only archs with 64bit large pmd_t and 32bit unsigned long should be affected. With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when the pmd transition from none to stable, by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is enabled a new set of problem arises by the fact could then transition freely in any of the none, pmd_trans_huge or pmd_trans_stable states. So making the barrier in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea and it would be a flakey solution. This should be fully fixed by introducing a pmd_read_atomic that reads the pmd in order with THP disabled, or by reading the pmd atomically with cmpxchg8b with THP enabled. Luckily this new race condition only triggers in the places that must already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix is localized there but this bug is not related to THP. NOTE: this can trigger on x86 32bit systems with PAE enabled with more than 4G of ram, otherwise the high part of the pmd will never risk to be truncated because it would be zero at all times, in turn so hiding the SMP race. This bug was discovered and fully debugged by Ulrich, quote: ---- [..] pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and eax. 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) 497 { 498 /* depend on compiler for an atomic pmd read */ 499 pmd_t pmdval = *pmd; // edi = pmd pointer 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi ... // edx = PTE page table high address 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx ... // eax = PTE page table low address 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax [..] Please note that the PMD is not read atomically. These are two "mov" instructions where the high order bits of the PMD entry are fetched first. Hence, the above machine code is prone to the following race. - The PMD entry {high|low} is 0x0000000000000000. The "mov" at 0xc0507a84 loads 0x00000000 into edx. - A page fault (on another CPU) sneaks in between the two "mov" instructions and instantiates the PMD. - The PMD entry {high|low} is now 0x00000003fda38067. The "mov" at 0xc0507a8e loads 0xfda38067 into eax. ---- Reported-by: NUlrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 5月, 2012 1 次提交
-
-
由 Avi Kivity 提交于
Prevents build failures on non-KVM archs. Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 27 5月, 2012 1 次提交
-
-
由 Linus Torvalds 提交于
This changes the interfaces in <asm/word-at-a-time.h> to be a bit more complicated, but a lot more generic. In particular, it allows us to really do the operations efficiently on both little-endian and big-endian machines, pretty much regardless of machine details. For example, if you can rely on a fast population count instruction on your architecture, this will allow you to make your optimized <asm/word-at-a-time.h> file with that. NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is not truly generic, it actually only works on big-endian. Why? Because on little-endian the generic algorithms are wasteful, since you can inevitably do better. The x86 implementation is an example of that. (The only truly non-generic part of the asm-generic implementation is the "find_zero()" function, and you could make a little-endian version of it. And if the Kbuild infrastructure allowed us to pick a particular header file, that would be lovely) The <asm/word-at-a-time.h> functions are as follows: - WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm uses. - has_zero(): take a word, and determine if it has a zero byte in it. It gets the word, the pointer to the constant pool, and a pointer to an intermediate "data" field it can set. This is the "quick-and-dirty" zero tester: it's what is run inside the hot loops. - "prep_zero_mask()": take the word, the data that has_zero() produced, and the constant pool, and generate an *exact* mask of which byte had the first zero. This is run directly *outside* the loop, and allows the "has_zero()" function to answer the "is there a zero byte" question without necessarily getting exactly *which* byte is the first one to contain a zero. If you do multiple byte lookups concurrently (eg "hash_name()", which looks for both NUL and '/' bytes), after you've done the prep_zero_mask() phase, the result of those can be or'ed together to get the "either or" case. - The result from "prep_zero_mask()" can then be fed into "find_zero()" (to find the byte offset of the first byte that was zero) or into "zero_bytemask()" (to find the bytemask of the bytes preceding the zero byte). The existence of zero_bytemask() is optional, and is not necessary for the normal string routines. But dentry name hashing needs it, so if you enable DENTRY_WORD_AT_A_TIME you need to expose it. This changes the generic strncpy_from_user() function and the dentry hashing functions to use these modified word-at-a-time interfaces. This gets us back to the optimized state of the x86 strncpy that we lost in the previous commit when moving over to the generic version. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 5月, 2012 1 次提交
-
-
由 Chris Metcalf 提交于
The change adds some infrastructure for managing tile pmd's more generally, using pte_pmd() and pmd_pte() methods to translate pmd values to and from ptes, since on TILEPro a pmd is really just a nested structure holding a pgd (aka pte). Several existing pmd methods are moved into this framework, and a whole raft of additional pmd accessors are defined that are used by the transparent hugepage framework. The tile PTE now has a "client2" bit. The bit is used to indicate a transparent huge page is in the process of being split into subpages. This change also fixes a generic bug where the return value of the generic pmdp_splitting_flush() was incorrect. Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
-
- 21 5月, 2012 3 次提交
-
-
由 Paul Gortmaker 提交于
There are two functions in this asm-generic file. Looking at other arch which do not use the generic version, these two fcns are within an #ifdef __KERNEL__ block, so make the generic one consistent with those. Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Marek Szyprowski 提交于
The Contiguous Memory Allocator is a set of helper functions for DMA mapping framework that improves allocations of contiguous memory chunks. CMA grabs memory on system boot, marks it with MIGRATE_CMA migrate type and gives back to the system. Kernel is allowed to allocate only movable pages within CMA's managed memory so that it can be used for example for page cache when DMA mapping do not use it. On dma_alloc_from_contiguous() request such pages are migrated out of CMA area to free required contiguous block and fulfill the request. This allows to allocate large contiguous chunks of memory at any time assuming that there is enough free memory available in the system. This code is heavily based on earlier works by Michal Nazarewicz. Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com> Signed-off-by: NMichal Nazarewicz <mina86@mina86.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Tested-by: NRob Clark <rob.clark@linaro.org> Tested-by: NOhad Ben-Cohen <ohad@wizery.com> Tested-by: NBenjamin Gaignard <benjamin.gaignard@linaro.org> Tested-by: NRobert Nelson <robertcnelson@gmail.com> Tested-by: NBarry Song <Baohua.Song@csr.com>
-
由 Marek Szyprowski 提交于
Add a common helper for dma-mapping core for mapping a coherent buffer to userspace. Reported-by: NSubash Patel <subashrp@gmail.com> Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com> Acked-by: NKyungmin Park <kyungmin.park@samsung.com> Tested-By: NSubash Patel <subash.ramaswamy@linaro.org>
-
- 19 5月, 2012 3 次提交
-
-
由 Grant Likely 提交于
Commit 3d0f7cf0 "gpio: Adjust of_xlate API to support multiple GPIO chips" changed the api of gpiochip_find to drop const from the data parameter of the match hook, but didn't also drop const from data causing a build warning. Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
-
由 Grant Likely 提交于
This patch changes the of_xlate API to make it possible for multiple gpio_chips to refer to the same device tree node. This is useful for banked GPIO controllers that use multiple gpio_chips for a single device. With this change the core code will try calling of_xlate on each gpio_chip that references the device_node and will return the gpio number for the first one to return 'true'. Tested-by: NRoland Stigge <stigge@antcom.de> Acked-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
-
由 Mark Brown 提交于
Allow drivers to use the modern request and configure idiom together with devres. As with plain gpio_request() and gpio_request_one() we can't implement the old school version in terms of _one() as this would force the explicit selection of a direction in gpio_request() which could break systems if we pick the wrong one. Implementing devm_gpio_request_one() in terms of devm_gpio_request() would needlessly complicate things or lead to duplication from the unmanaged version depending on how it's done. Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com> Acked-by: NLinus Walleij <linus.walleij@linaro.org> Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
-
- 17 5月, 2012 1 次提交
-
-
由 Steven Rostedt 提交于
Instead of just sorting the ip's of the functions per ftrace page, sort the entire list before adding them to the ftrace pages. This will allow the bsearch algorithm to be sped up as it can also sort by pages, not just records within a page. Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
- 01 5月, 2012 2 次提交
-
-
由 Bjorn Helgaas 提交于
A PCIe downstream port is a P2P bridge. Its secondary interface is a link that should lead only to device 0 (unless ARI is enabled)[1], so we don't probe for non-zero device numbers. Some Stratus ftServer systems have a PCIe downstream port (02:00.0) that leads to both an upstream port (03:00.0) and a downstream port (03:01.0), and 03:01.0 has important devices below it: [0000:02]-+-00.0-[03-3c]--+-00.0-[04-09]--... \-01.0-[0a-0d]--+-[USB] +-[NIC] +-... Previously, we didn't enumerate device 03:01.0, so USB and the network didn't work. This patch adds a DMI quirk to scan all device numbers, not just 0, below a downstream port. Based on a patch by Prarit Bhargava. [1] PCIe spec r3.0, sec 7.3.1 CC: Myron Stowe <mstowe@redhat.com> CC: Don Dutile <ddutile@redhat.com> CC: James Paradis <james.paradis@stratus.com> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Jesse Barnes <jbarnes@virtuousgeek.org> CC: Prarit Bhargava <prarit@redhat.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
-
由 H. Peter Anvin 提交于
<asm-generic/statfs.h> is exported to userspace, so using BITS_PER_LONG is invalid. We need to use __BITS_PER_LONG instead. This is kernel bugzilla 43165. Reported-by: NH.J. Lu <hjl.tools@gmail.com> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1335465916-16965-1-git-send-email-hpa@linux.intel.comAcked-by: NArnd Bergmann <arnd@arndb.de> Cc: <stable@vger.kernel.org>
-
- 24 4月, 2012 1 次提交
-
-
由 H. Peter Anvin 提交于
For the particular issue of x32, which shares code with i386 in the handling of compat_siginfo_t, the use of a 64-bit clock_t bumps the sigchld structure out of alignment, which triggers a messy cascade of padding. This was already handled on the kernel compat side, but it needs handling on the user space side, which uses the generic header. To make that possible: 1. Allow __kernel_clock_t to be overridden in struct siginfo; 2. Allow there to be attributes added to struct siginfo. Reported-by: NH.J. Lu <hjl.rools@gmail.com> Cc: Bruce J. Beare <bruce.j.beare@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Link: http://lkml.kernel.org/r/CAMe9rOqF6Kh6-NK7oP0Fpzkd4SBAWU%2BG53hwBbSD4iA2UzyxuA@mail.gmail.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 21 4月, 2012 1 次提交
-
-
由 Marcelo Tosatti 提交于
Needed by kvm_para_has_feature(). Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 14 4月, 2012 3 次提交
-
-
由 Will Drewry 提交于
Adds a new return value to seccomp filters that triggers a SIGSYS to be delivered with the new SYS_SECCOMP si_code. This allows in-process system call emulation, including just specifying an errno or cleanly dumping core, rather than just dying. Suggested-by: NMarkus Gutschke <markus@chromium.org> Suggested-by: NJulien Tinnes <jln@chromium.org> Signed-off-by: NWill Drewry <wad@chromium.org> Acked-by: NEric Paris <eparis@redhat.com> v18: - acked-by, rebase - don't mention secure_computing_int() anymore v15: - use audit_seccomp/skip - pad out error spacing; clean up switch (indan@nul.nu) v14: - n/a v13: - rebase on to 88ebdda6 v12: - rebase on to linux-next v11: - clarify the comment (indan@nul.nu) - s/sigtrap/sigsys v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig note suggested-by (though original suggestion had other behaviors) v9: - changes to SIGILL v8: - clean up based on changes to dependent patches v7: - introduction Signed-off-by: NJames Morris <james.l.morris@oracle.com>
-
由 Will Drewry 提交于
This change enables SIGSYS, defines _sigfields._sigsys, and adds x86 (compat) arch support. _sigsys defines fields which allow a signal handler to receive the triggering system call number, the relevant AUDIT_ARCH_* value for that number, and the address of the callsite. SIGSYS is added to the SYNCHRONOUS_MASK because it is desirable for it to have setup_frame() called for it. The goal is to ensure that ucontext_t reflects the machine state from the time-of-syscall and not from another signal handler. The first consumer of SIGSYS would be seccomp filter. In particular, a filter program could specify a new return value, SECCOMP_RET_TRAP, which would result in the system call being denied and the calling thread signaled. This also means that implementing arch-specific support can be dependent upon HAVE_ARCH_SECCOMP_FILTER. Suggested-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NWill Drewry <wad@chromium.org> Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Reviewed-by: NH. Peter Anvin <hpa@zytor.com> Acked-by: NEric Paris <eparis@redhat.com> v18: - added acked by, rebase v17: - rebase and reviewed-by addition v14: - rebase/nochanges v13: - rebase on to 88ebdda6 v12: - reworded changelog (oleg@redhat.com) v11: - fix dropped words in the change description - added fallback copy_siginfo support. - added __ARCH_SIGSYS define to allow stepped arch support. v10: - first version based on suggestion Signed-off-by: NJames Morris <james.l.morris@oracle.com>
-
由 Will Drewry 提交于
Adds a stub for a function that will return the AUDIT_ARCH_* value appropriate to the supplied task based on the system call convention. For audit's use, the value can generally be hard-coded at the audit-site. However, for other functionality not inlined into syscall entry/exit, this makes that information available. seccomp_filter is the first planned consumer and, as such, the comment indicates a tie to CONFIG_HAVE_ARCH_SECCOMP_FILTER. Suggested-by: NRoland McGrath <mcgrathr@chromium.org> Signed-off-by: NWill Drewry <wad@chromium.org> Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Acked-by: NEric Paris <eparis@redhat.com> v18: comment and change reword and rebase. v14: rebase/nochanges v13: rebase on to 88ebdda6 v12: rebase on to linux-next v11: fixed improper return type v10: introduced Signed-off-by: NJames Morris <james.l.morris@oracle.com>
-
- 08 4月, 2012 1 次提交
-
-
由 Eric B Munson 提交于
When a host stops or suspends a VM it will set a flag to show this. The watchdog will use these functions to determine if a softlockup is real, or the result of a suspended VM. Signed-off-by: NEric B Munson <emunson@mgebm.net> asm-generic changes Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 03 4月, 2012 1 次提交
-
-
由 Paul Gortmaker 提交于
Builds of the openrisc or1ksim_defconfig show the following: In file included from arch/openrisc/include/generated/asm/cmpxchg.h:1:0, from include/asm-generic/atomic.h:18, from arch/openrisc/include/generated/asm/atomic.h:1, from include/linux/atomic.h:4, from include/linux/dcache.h:4, from fs/notify/fsnotify.c:19: include/asm-generic/cmpxchg.h: In function '__xchg': include/asm-generic/cmpxchg.h:34:20: error: expected ')' before 'u8' include/asm-generic/cmpxchg.h:34:20: warning: type defaults to 'int' in type name and many more lines of similar errors. It seems specific to the or32 because most other platforms have an arch specific component that would have already included types.h ahead of time, but the o32 does not. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jonas Bonn <jonas@southpole.se> Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Acked-by: David Howells <dhowells@redhat.com Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 3月, 2012 8 次提交
-
-
由 David Howells 提交于
Delete all instances of asm/system.h as they should be redundant by this point. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *` Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
asm/system.h is a cause of circular dependency problems because it contains commonly used primitive stuff like barrier definitions and uncommonly used stuff like switch_to() that might require MMU definitions. asm/system.h has been disintegrated by this point on all arches into the following common segments: (1) asm/barrier.h Moved memory barrier definitions here. (2) asm/cmpxchg.h Moved xchg() and cmpxchg() here. #included in asm/atomic.h. (3) asm/bug.h Moved die() and similar here. (4) asm/exec.h Moved arch_align_stack() here. (5) asm/elf.h Moved AT_VECTOR_SIZE_ARCH here. (6) asm/switch_to.h Moved switch_to() here. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
Split arch_align_stack() out from asm-generic/system.h into its own header of asm-generic/exec.h as part of the asm/system.h disintegration. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NArnd Bergmann <arnd@arndb.de>
-
由 David Howells 提交于
Split the switch_to() wrapper out of asm-generic/system.h into its own asm-generic/system.h as part of the asm/system.h disintegration. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NArnd Bergmann <arnd@arndb.de>
-
由 David Howells 提交于
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h to simplify disintegration of asm/system.h. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NArnd Bergmann <arnd@arndb.de>
-
由 David Howells 提交于
Create asm-generic/barrier.h and move the barrier definitions from asm-generic/system.h to it. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NArnd Bergmann <arnd@arndb.de>
-
由 David Howells 提交于
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h as all arch files that #include the former also #include the latter. See: grep -rl asm-generic/cmpxchg-local[.]h arch/ | sort > b grep -rl asm-generic/cmpxchg[.]h arch/ | sort > a comm a b This simplifies the disintegration of asm-generic/system.h for arches that don't have their own. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NArnd Bergmann <arnd@arndb.de>
-
- 28 3月, 2012 1 次提交
-
-
由 Chris Metcalf 提交于
<asm-generic/unistd.h> was set up to use sys_sendfile() for the 32-bit compat API instead of sys_sendfile64(), but in fact the right thing to do is to use sys_sendfile64() in all cases. The 32-bit sendfile64() API in glibc uses the sendfile64 syscall, so it has to be capable of doing full 64-bit operations. But the sys_sendfile() kernel implementation has a MAX_NON_LFS test in it which explicitly limits the offset to 2^32. So, we need to use the sys_sendfile64() implementation in the kernel for this case. Cc: <stable@kernel.org> Acked-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
-
- 26 3月, 2012 1 次提交
-
-
由 Pawel Moll 提交于
This patch adds a set of macros that can be used to declare kernel parameters to be parsed _before_ initcalls at a chosen level are executed. We rename the now-unused "flags" field of struct kernel_param as the level. It's signed, for when we use this for early params as well, in future. Linker macro collating init calls had to be modified in order to add additional symbols between levels that are later used by the init code to split the calls into blocks. Signed-off-by: NPawel Moll <pawel.moll@arm.com> Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
-
- 24 3月, 2012 2 次提交
-
-
由 Jason Baron 提交于
Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag: MADV_DONTDUMP, which can be set by applications to specifically request memory regions which should not dump core. The specific application I have in mind is qemu: we can add a flag there that wouldn't dump all of guest memory when qemu dumps core. This flag might also be useful for security sensitive apps that want to absolutely make sure that parts of memory are not dumped. To clear the flag use: MADV_DODUMP. [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland] [akpm@linux-foundation.org: fix up the architectures which broke] Signed-off-by: NJason Baron <jbaron@redhat.com> Acked-by: NRoland McGrath <roland@hack.frob.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Avi Kivity <avi@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Beulich 提交于
Due to the alignment of following variables, these typically consume more than just the single byte that 'bool' requires, and as there are a few hundred instances, the cache pollution (not so much the waste of memory) sums up. Put these variables into their own section, outside of any half way frequently used memory range. Do the same also to the __warned variable of rcu_lockdep_assert(). (Don't, however, include the ones used by printk_once() and alike, as they can potentially be hot.) Signed-off-by: NJan Beulich <jbeulich@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 3月, 2012 1 次提交
-
-
由 Andrea Arcangeli 提交于
In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of *pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through */ } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t *pmd) 144 { -> 145 pmd_ERROR(*pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. | | | | .-|---------------------| | | | | | |<-- B(fault) | | | 2 MB | |/////////////////////|-. huge < |/////////////////////| > A(range) page | |/////////////////////|-' | | | | | | '-|---------------------| | | | | '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(¤t->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(*pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. | // | if (pmd_none_or_clear_bad(pmd)) | { | if (unlikely(pmd_bad(*pmd))) | pmd_clear_bad | { | pmd_ERROR | // Log "bad pmd ..." message here. | pmd_clear | // Clear the page's PMD entry. | // Thread B incremented the map count | // in page_add_new_anon_rmap(), but | // now the page is no longer mapped | // by a PMD entry (-> inconsistency). | } | } | v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: NUlrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Jones <davej@redhat.com> Acked-by: NLarry Woodman <lwoodman@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Cc: Mark Salter <msalter@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 3月, 2012 1 次提交
-
-
由 Paul Gortmaker 提交于
If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any other BUG variant in a static inline (i.e. not in a #define) then that header really should be including <linux/bug.h> and not just expecting it to be implicitly present. We can make this change risk-free, since if the files using these headers didn't have exposure to linux/bug.h already, they would have been causing compile failures/warnings. Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-
- 03 3月, 2012 1 次提交
-
-
由 Grant Likely 提交于
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
-
- 27 2月, 2012 1 次提交
-
-
由 James Bottomley 提交于
The problem in commit fea80311 Author: Randy Dunlap <rdunlap@xenotime.net> Date: Sun Jul 24 11:39:14 2011 -0700 iomap: make IOPORT/PCI mapping functions conditional is that if your architecture supplies pci_iomap/pci_iounmap, it expects always to supply them. Adding empty body defitions in the !CONFIG_PCI case, which is what this patch does, breaks the parisc compile because the functions become doubly defined. It took us a while to spot this, because we don't actually build !CONFIG_PCI very often (only if someone is brave enough to test the snake/asp machines). Since the note in the commit log says this is to fix a CONFIG_GENERIC_IOMAP issue (which it does because CONFIG_GENERIC_IOMAP supplies pci_iounmap only if CONFIG_PCI is set), there should actually have been a condition upon this. This should make sure no other architecture's !CONFIG_PCI compile breaks in the same way as parisc. The fix had to be updated to take account of the GENERIC_PCI_IOMAP separation. Reported-by: NRolf Eike Beer <eike@sf-mail.de> Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>
-
- 25 2月, 2012 2 次提交
-
-
由 Oleg Nesterov 提交于
This patch is intentionally incomplete to simplify the review. It ignores ep_unregister_pollwait() which plays with the same wqh. See the next change. epoll assumes that the EPOLL_CTL_ADD'ed file controls everything f_op->poll() needs. In particular it assumes that the wait queue can't go away until eventpoll_release(). This is not true in case of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand which is not connected to the file. This patch adds the special event, POLLFREE, currently only for epoll. It expects that init_poll_funcptr()'ed hook should do the necessary cleanup. Perhaps it should be defined as EPOLLFREE in eventpoll. __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if ->signalfd_wqh is not empty, we add the new signalfd_cleanup() helper. ep_poll_callback(POLLFREE) simply does list_del_init(task_list). This make this poll entry inconsistent, but we don't care. If you share epoll fd which contains our sigfd with another process you should blame yourself. signalfd is "really special". I simply do not know how we can define the "right" semantics if it used with epoll. The main problem is, epoll calls signalfd_poll() once to establish the connection with the wait queue, after that signalfd_poll(NULL) returns the different/inconsistent results depending on who does EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd has nothing to do with the file, it works with the current thread. In short: this patch is the hack which tries to fix the symptoms. It also assumes that nobody can take tasklist_lock under epoll locks, this seems to be true. Note: - we do not have wake_up_all_poll() but wake_up_poll() is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE. - signalfd_cleanup() uses POLLHUP along with POLLFREE, we need a couple of simple changes in eventpoll.c to make sure it can't be "lost". Reported-by: NMaxime Bizon <mbizon@freebox.fr> Cc: <stable@kernel.org> Signed-off-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joerg Roedel 提交于
The new get_order macro introcuded in commit d66acc39 does not use parentheses around all uses of the parameter n. This causes new compile warnings, for example in the amd_iommu_init.c function: drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses] drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses] Fix those warnings by adding the missing parentheses. Reported-by: NIngo Molnar <mingo@elte.hu> Cc: David Howells <dhowells@redhat.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com> Link: http://lkml.kernel.org/r/1330088295-28732-1-git-send-email-joerg.roedel@amd.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 24 2月, 2012 2 次提交
-
-
由 Ben Greear 提交于
This is useful for testing RX handling of frames with bad CRCs. Requires driver support to actually put the packet on the wire properly. Signed-off-by: NBen Greear <greearb@candelatech.com> Tested-by: NAaron Brown <aaron.f.brown@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
由 Bjorn Helgaas 提交于
Everybody uses the generic pcibios_resource_to_bus() supplied by the core now, so remove the ARCH_HAS_GENERIC_PCI_OFFSETS used during conversion. Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
-