- 10 10月, 2014 4 次提交
-
-
由 Joonsoo Kim 提交于
Now, we track caller if tracing or slab debugging is enabled. If they are disabled, we could save one argument passing overhead by calling __kmalloc(_node)(). But, I think that it would be marginal. Furthermore, default slab allocator, SLUB, doesn't use this technique so I think that it's okay to change this situation. After this change, we can turn on/off CONFIG_DEBUG_SLAB without full kernel build and remove some complicated '#if' defintion. It looks more benefitial to me. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
We don't need to keep kmem_cache definition in include/linux/slab.h if we don't need to inline kmem_cache_size(). According to my code inspection, this function is only called at lc_create() in lib/lru_cache.c which may be called at initialization phase of something, so we don't need to inline it. Therfore, move it to slab_common.c and move kmem_cache definition to internal header. After this change, we can change kmem_cache definition easily without full kernel build. For instance, we can turn on/off CONFIG_SLUB_STATS without full kernel build. [akpm@linux-foundation.org: export kmem_cache_size() to modules] [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors] Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
False positive: mm/slab_common.c: In function 'kmem_cache_create': mm/slab_common.c:204: warning: 's' may be used uninitialized in this function Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
- Rename vm_is_stack() to task_of_stack() and change it to return "struct task_struct *" rather than the global (and thus wrong in general) pid_t. - Add the new pid_of_stack() helper which calls task_of_stack() and uses the right namespace to report the correct pid_t. Unfortunately we need to define this helper twice, in task_mmu.c and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c and move at least pid_of_stack/task_of_stack there to avoid the code duplication. - Change show_map_vma() and show_numa_map() to use the new helper. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 10月, 2014 4 次提交
-
-
由 Johannes Weiner 提交于
The zone allocation batches can easily underflow due to higher-order allocations or spills to remote nodes. On SMP that's fine, because underflows are expected from concurrency and dealt with by returning 0. But on UP, zone_page_state will just return a wrapped unsigned long, which will get past the <= 0 check and then consider the zone eligible until its watermarks are hit. Commit 3a025760 ("mm: page_alloc: spill to remote nodes before waking kswapd") already made the counter-resetting use atomic_long_read() to accomodate underflows from remote spills, but it didn't go all the way with it. Make it clear that these batches are expected to go negative regardless of concurrency, and use atomic_long_read() everywhere. Fixes: 81c0a2bb ("mm: page_alloc: fair zone allocator policy") Reported-by: NVlastimil Babka <vbabka@suse.cz> Reported-by: NLeon Romanovsky <leon@leon.nu> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The cgroup iterators yield css objects that have not yet gone through css_online(), but they are not complete memcgs at this point and so the memcg iterators should not return them. Commit d8ad3055 ("mm/memcg: iteration skip memcgs not yet fully initialized") set out to implement exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does not meet the ordering requirements for memcg, and so the iterator may skip over initialized groups, or return partially initialized memcgs. The cgroup core can not reasonably provide a clear answer on whether the object around the css has been fully initialized, as that depends on controller-specific locking and lifetime rules. Thus, introduce a memcg-specific flag that is set after the memcg has been initialized in css_online(), and read before mem_cgroup_iter() callers access the memcg members. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This patch reverts 1ba6e0b5 ("mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte"). If a huge page is being split due a protection change and the tail will be in a PROT_NONE vma then NUMA hinting PTEs are temporarily created in the protected VMA. VM_RW|VM_PROTNONE |-----------------| ^ split here In the specific case above, it should get fixed up by change_pte_range() but there is a window of opportunity for weirdness to happen. Similarly, if a huge page is shrunk and split during a protection update but before pmd_numa is cleared then a pte_numa can be left behind. Instead of adding complexity trying to deal with the case, this patch will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults will not be triggered which is marginal in comparison to the complexity in dealing with the corner cases during THP split. Cc: stable@vger.kernel.org Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
A migration entry is marked as write if pte_write was true at the time the entry was created. The VMA protections are not double checked when migration entries are being removed as mprotect marks write-migration-entries as read. It means that potentially we take a spurious fault to mark PTEs write again but it's straight-forward. However, there is a race between write migrations being marked read and migrations finishing. This potentially allows a PTE to be write that should have been read. Close this race by double checking the VMA permissions using maybe_mkwrite when migration completes. [torvalds@linux-foundation.org: use maybe_mkwrite] Cc: stable@vger.kernel.org Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 9月, 2014 2 次提交
-
-
由 Miklos Szeredi 提交于
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of bytes a caller asked to pack into fuse request. This value may be lesser than capacity of fuse request or iov_iter. So fuse_get_user_pages() must ensure that *nbytesp won't grow. Now, when helper iov_iter_get_pages() performs all hard work of extracting pages from iov_iter, it can be done by passing properly calculated "maxsize" to the helper. The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need this capability, so pass LONG_MAX as the maxsize argument here. Fixes: c9c37e2e ("fuse: switch to iov_iter_get_pages()") Reported-by: NWerner Baumann <werner.baumann@onlinehome.de> Tested-by: NMaxim Patlasov <mpatlasov@parallels.com> Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Miklos Szeredi 提交于
If overwriting an empty directory with rename, then need to drop the extra nlink. Test prog: #include <stdio.h> #include <fcntl.h> #include <err.h> #include <sys/stat.h> int main(void) { const char *test_dir1 = "test-dir1"; const char *test_dir2 = "test-dir2"; int res; int fd; struct stat statbuf; res = mkdir(test_dir1, 0777); if (res == -1) err(1, "mkdir(\"%s\")", test_dir1); res = mkdir(test_dir2, 0777); if (res == -1) err(1, "mkdir(\"%s\")", test_dir2); fd = open(test_dir2, O_RDONLY); if (fd == -1) err(1, "open(\"%s\")", test_dir2); res = rename(test_dir1, test_dir2); if (res == -1) err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2); res = fstat(fd, &statbuf); if (res == -1) err(1, "fstat(%i)", fd); if (statbuf.st_nlink != 0) { fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink); return 1; } return 0; } Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 26 9月, 2014 2 次提交
-
-
由 Peter Feiner 提交于
This fixes the same bug as b43790ee ("mm: softdirty: don't forget to save file map softdiry bit on unmap") and 9aed8614 ("mm/memory.c: don't forget to set softdirty on file mapped fault") where the return value of pte_*mksoft_dirty was being ignored. To be sure that no other pte/pmd "mk" function return values were being ignored, I annotated the functions in arch/x86/include/asm/pgtable.h with __must_check and rebuilt. The userspace effect of this bug is that the softdirty mark might be lost if a file mapped pte get zapped. Signed-off-by: NPeter Feiner <pfeiner@google.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Since commit 45906855 ("mm/sl[aou]b: Common alignment code"), the "ralign" automatic variable in __kmem_cache_create() may be used as uninitialized. The proper alignment defaults to BYTES_PER_WORD and can be overridden by SLAB_RED_ZONE or the alignment specified by the caller. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=85031Signed-off-by: NDavid Rientjes <rientjes@google.com> Reported-by: NAndrei Elovikov <a.elovikov@gmail.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 9月, 2014 3 次提交
-
-
由 NeilBrown 提交于
This will allow NFS to wait for PG_private to be cleared and, particularly, to send a wake-up when it is. Signed-off-by: NNeilBrown <neilb@suse.de> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 NeilBrown 提交于
In commit c1221321 sched: Allow wait_on_bit_action() functions to support a timeout I suggested that a "wait_on_bit_timeout()" interface would not meet my need. This isn't true - I was just over-engineering. Including a 'private' field in wait_bit_key instead of a focused "timeout" field was just premature generalization. If some other use is ever found, it can be generalized or added later. So this patch renames "private" to "timeout" with a meaning "stop waiting when "jiffies" reaches or passes "timeout", and adds two of the many possible wait..bit..timeout() interfaces: wait_on_page_bit_killable_timeout(), which is the one I want to use, and out_of_line_wait_on_bit_timeout() which is a reasonably general example. Others can be added as needed. Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NNeilBrown <neilb@suse.de> Acked-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Zefan Li 提交于
When we change cpuset.memory_spread_{page,slab}, cpuset will flip PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset. This should be done using atomic bitops, but currently we don't, which is broken. Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened when one thread tried to clear PF_USED_MATH while at the same time another thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on the same task. Here's the full report: https://lkml.org/lkml/2014/9/19/230 To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags. v4: - updated mm/slab.c. (Fengguang Wu) - updated Documentation. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Kees Cook <keescook@chromium.org> Fixes: 950592f7 ("cpusets: update tasks' page/slab spread flags in time") Cc: <stable@vger.kernel.org> # 2.6.31+ Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 24 9月, 2014 2 次提交
-
-
由 Andres Lagar-Cavilla 提交于
1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Andres Lagar-Cavilla 提交于
When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been swapped out or is behind a filemap, this will trigger async readahead and return immediately. The rationale is that KVM will kick back the guest with an "async page fault" and allow for some other guest process to take over. If async PFs are enabled the fault is retried asap from an async workqueue. If not, it's retried immediately in the same code path. In either case the retry will not relinquish the mmap semaphore and will block on the IO. This is a bad thing, as other mmap semaphore users now stall as a function of swap or filemap latency. This patch ensures both the regular and async PF path re-enter the fault allowing for the mmap semaphore to be relinquished in the case of IO wait. Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com> Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 9月, 2014 1 次提交
-
-
由 Krzysztof Hałasa 提交于
dma_pool_create() needs to unlock the mutex in error case. The bug was introduced in the 3.16 by commit cc6b664a ("mm/dmapool.c: remove redundant NULL check for dev in dma_pool_create()")/ Signed-off-by: NKrzysztof Hałasa <khc@piap.pl> Cc: stable@vger.kernel.org # v3.16 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 9月, 2014 1 次提交
-
-
由 Ard Biesheuvel 提交于
In order to make the static inline function is_zero_pfn() callable by modules, export its symbol dependencies 'zero_pfn' and (for s390 and mips) 'zero_page_mask'. We need this for KVM, as CONFIG_KVM is a tristate for all supported architectures except ARM and arm64, and testing a pfn whether it refers to the zero page is required to correctly distinguish the zero page from other special RAM ranges that may also have the PG_reserved bit set, but need to be treated as MMIO memory. Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 11 9月, 2014 2 次提交
-
-
由 Sasha Levin 提交于
Make sure we actually see the output of validate_mm() and browse_rb() before triggering a BUG(). pr_info isn't shown by default so the reason for the BUG() isn't obvious. Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xishi Qiu 提交于
Let memblock skip the hotpluggable memory regions in __next_mem_range(), it is used to to prevent memblock from allocating hotpluggable memory for the kernel at early time. The code is the same as __next_mem_range_rev(). Clear hotpluggable flag before releasing free pages to the buddy allocator. If we don't clear hotpluggable flag in free_low_memory_core_early(), the memory which marked hotpluggable flag will not free to buddy allocator. Because __next_mem_range() will skip them. free_low_memory_core_early for_each_free_mem_range for_each_mem_range __next_mem_range [akpm@linux-foundation.org: fix warning] Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 9月, 2014 1 次提交
-
-
由 Masanari Iida 提交于
This patch fix spelling typo found in DocBook/kernel-api.xml. It is because the file is generated from the source comments, I have to fix the comments in source codes. Signed-off-by: NMasanari Iida <standby24x7@gmail.com> Acked-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 05 9月, 2014 1 次提交
-
-
由 Johannes Weiner 提交于
Dave Hansen reports a massive scalability regression in an uncontained page fault benchmark with more than 30 concurrent threads, which he bisected down to 05b84301 ("mm: memcontrol: use root_mem_cgroup res_counter") and pin-pointed on res_counter spinlock contention. That change relied on the per-cpu charge caches to mostly swallow the res_counter costs, but it's apparent that the caches don't scale yet. Revert memcg back to bypassing res_counters on the root level in order to restore performance for uncontained workloads. Reported-by: NDave Hansen <dave@sr71.net> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: NDave Hansen <dave.hansen@intel.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 8月, 2014 5 次提交
-
-
由 Hugh Dickins 提交于
Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page->mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfffffff and 100000000, which would need special access. Commit c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL pte no longer passes the pte_special() test, so zap_pte_range() goes on to try to access a non-existent struct page. Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A hint that this was a problem was that c46a7c81 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: This was papering over a pte_special() snag when the zero page was encountered during zap. This patch reverts vm_normal_page() to how it was before, relying on pte_special(). It still appears that this patch may be incomplete: aren't there other places which need to be handling PROTNONE along with PRESENT? For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE area, that would make it pte_special(). This is side-stepped by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be interesting. Fixes: c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") Reported-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: <stable@vger.kernel.org> [3.16] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
spin_lock may be an empty struct for !SMP configurations and so arch_spin_is_locked may return unconditional 0 and trigger the VM_BUG_ON even when the lock is held. Replace spin_is_locked by lockdep_assert_held. We will not BUG anymore but it is questionable whether crashing makes a lot of sense in the uncharge path. Uncharge happens after the last page reference was released so nobody should touch the page and the function doesn't update any shared state except for res counter which uses synchronization of its own. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kees Cook 提交于
To avoid potential format string expansion via module parameters, do not use the zpool type directly in request_module() without a format string. Additionally, to avoid arbitrary modules being loaded via zpool API (e.g. via the zswap_zpool_type module parameter) add a "zpool-" prefix to the requested module, as well as module aliases for the existing zpool types (zbud and zsmalloc). Signed-off-by: NKees Cook <keescook@chromium.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: NDan Streetman <ddstreet@ieee.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
Commit 67f87463 ("mm: clear pmd_numa before invalidating") cleared the NUMA bit in a copy of the PMD entry, but then wrote back the original Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
In memblock_find_in_range_node(), we defined ret as int. But it should be phys_addr_t because it is used to store the return value from __memblock_find_range_bottom_up(). The bug has not been triggered because when allocating low memory near the kernel end, the "int ret" won't turn out to be negative. When we started to allocate memory on other nodes, and the "int ret" could be minus. Then the kernel will panic. A simple way to reproduce this: comment out the following code in numa_init(), memblock_set_bottom_up(false); and the kernel won't boot. Reported-by: NXishi Qiu <qiuxishi@huawei.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Tested-by: NXishi Qiu <qiuxishi@huawei.com> Cc: <stable@vger.kernel.org> [3.13+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 8月, 2014 1 次提交
-
-
由 Josh Triplett 提交于
Many embedded systems will not need these syscalls, and omitting them saves space. Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS (default y) to support compiling them out. bloat-o-meter: add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250) function old new delta sys_fadvise64 57 - -57 sys_fadvise64_64 691 - -691 sys_madvise 1502 - -1502 Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
-
- 16 8月, 2014 3 次提交
-
-
由 Honggang Li 提交于
Currently, only SMP system free the percpu allocation info. Uniprocessor system should free it too. For example, one x86 UML virtual machine with 256MB memory, UML kernel wastes one page memory. Signed-off-by: NHonggang Li <enjoymindful@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
-
由 Tejun Heo 提交于
If pcpu_map_pages() fails midway, it unmaps the already mapped pages. Currently, it doesn't flush tlb after the partial unmapping. This may be okay in most cases as the established mapping hasn't been used at that point but it can go wrong and when it goes wrong it'd be extremely difficult to track down. Flush tlb after the partial unmapping. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
-
由 Tejun Heo 提交于
When pcpu_alloc_pages() fails midway, pcpu_free_pages() is invoked to free what has already been allocated. The invocation is across the whole requested range and pcpu_free_pages() will try to free all non-NULL pages; unfortunately, this is incorrect as pcpu_get_pages_and_bitmap(), unlike what its comment suggests, doesn't clear the pages array and thus the array may have entries from the previous invocations making the partial failure path free incorrect pages. Fix it by open-coding the partial freeing of the already allocated pages. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
-
- 15 8月, 2014 1 次提交
-
-
由 David Rientjes 提交于
Memcg aligns memory.limit_in_bytes to PAGE_SIZE as part of the resource counter since it makes no sense to allow a partial page to be charged. As a result of the hugetlb cgroup using the resource counter, it is also aligned to PAGE_SIZE but makes no sense unless aligned to the size of the hugepage being limited. Align hugetlb cgroup limit to hugepage size. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 8月, 2014 1 次提交
-
-
由 Al Viro 提交于
If DIO results in short write and sync write fails, we want to bugger off whether the DIO part has written anything or not; the logics on the return will take care of the right return value. Cc: stable@vger.kernel.org [3.16] Reported-by: NAnton Altaparmakov <aia21@cam.ac.uk> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 09 8月, 2014 6 次提交
-
-
由 David Herrmann 提交于
If we set SEAL_WRITE on a file, we must make sure there cannot be any ongoing write-operations on the file. For write() calls, we simply lock the inode mutex, for mmap() we simply verify there're no writable mappings. However, there might be pages pinned by AIO, Direct-IO and similar operations via GUP. We must make sure those do not write to the memfd file after we set SEAL_WRITE. As there is no way to notify GUP users to drop pages or to wait for them to be done, we implement the wait ourself: When setting SEAL_WRITE, we check all pages for their ref-count. If it's bigger than 1, we know there's some user of the page. We then mark the page and wait for up to 150ms for those ref-counts to be dropped. If the ref-counts are not dropped in time, we refuse the seal operation. Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Herrmann 提交于
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It can support sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it. memfd_create() returns the raw shmem file, so calls like ftruncate() can be used to modify the underlying inode. Also calls like fstat() will return proper information and mark the file as regular file. If you want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not supported (like on all other regular files). Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to a filesystem size limit. It is still properly accounted to memcg limits, though, and to the same overcommit or no-overcommit accounting as all user memory. Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Herrmann 提交于
If two processes share a common memory region, they usually want some guarantees to allow safe access. This often includes: - one side cannot overwrite data while the other reads it - one side cannot shrink the buffer while the other accesses it - one side cannot grow the buffer beyond previously set boundaries If there is a trust-relationship between both parties, there is no need for policy enforcement. However, if there's no trust relationship (eg., for general-purpose IPC) sharing memory-regions is highly fragile and often not possible without local copies. Look at the following two use-cases: 1) A graphics client wants to share its rendering-buffer with a graphics-server. The memory-region is allocated by the client for read/write access and a second FD is passed to the server. While scanning out from the memory region, the server has no guarantee that the client doesn't shrink the buffer at any time, requiring rather cumbersome SIGBUS handling. 2) A process wants to perform an RPC on another process. To avoid huge bandwidth consumption, zero-copy is preferred. After a message is assembled in-memory and a FD is passed to the remote side, both sides want to be sure that neither modifies this shared copy, anymore. The source may have put sensible data into the message without a separate copy and the target may want to parse the message inline, to avoid a local copy. While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide ways to achieve most of this, the first one is unproportionally ugly to use in libraries and the latter two are broken/racy or even disabled due to denial of service attacks. This patch introduces the concept of SEALING. If you seal a file, a specific set of operations is blocked on that file forever. Unlike locks, seals can only be set, never removed. Hence, once you verified a specific set of seals is set, you're guaranteed that no-one can perform the blocked operations on this file, anymore. An initial set of SEALS is introduced by this patch: - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced in size. This affects ftruncate() and open(O_TRUNC). - GROW: If SEAL_GROW is set, the file in question cannot be increased in size. This affects ftruncate(), fallocate() and write(). - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) are possible. This affects fallocate(PUNCH_HOLE), mmap() and write(). - SEAL: If SEAL_SEAL is set, no further seals can be added to a file. This basically prevents the F_ADD_SEAL operation on a file and can be set to prevent others from adding further seals that you don't want. The described use-cases can easily use these seals to provide safe use without any trust-relationship: 1) The graphics server can verify that a passed file-descriptor has SEAL_SHRINK set. This allows safe scanout, while the client is allowed to increase buffer size for window-resizing on-the-fly. Concurrent writes are explicitly allowed. 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE are set. This guarantees that neither process can modify the data while the other side parses it. Furthermore, it guarantees that even with writable FDs passed to the peer, it cannot increase the size to hit memory-limits of the source process (in case the file-storage is accounted to the source). The new API is an extension to fcntl(), adding two new commands: F_GET_SEALS: Return a bitset describing the seals on the file. This can be called on any FD if the underlying file supports sealing. F_ADD_SEALS: Change the seals of a given file. This requires WRITE access to the file and F_SEAL_SEAL may not already be set. Furthermore, the underlying file must support sealing and there may not be any existing shared mapping of that file. Otherwise, EBADF/EPERM is returned. The given seals are _added_ to the existing set of seals on the file. You cannot remove seals again. The fcntl() handler is currently specific to shmem and disabled on all files. A file needs to explicitly support sealing for this interface to work. A separate syscall is added in a follow-up, which creates files that support sealing. There is no intention to support this on other file-systems. Semantics are unclear for non-volatile files and we lack any use-case right now. Therefore, the implementation is specific to shmem. Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Herrmann 提交于
This patch (of 6): The i_mmap_writable field counts existing writable mappings of an address_space. To allow drivers to prevent new writable mappings, make this counter signed and prevent new writable mappings if it is negative. This is modelled after i_writecount and DENYWRITE. This will be required by the shmem-sealing infrastructure to prevent any new writable mappings after the WRITE seal has been set. In case there exists a writable mapping, this operation will fail with EBUSY. Note that we rely on the fact that iff you already own a writable mapping, you can increase the counter without using the helpers. This is the same that we do for i_writecount. Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Lutomirski 提交于
The core mm code will provide a default gate area based on FIXADDR_USER_START and FIXADDR_USER_END if !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR). This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit UML, and x86_32 have their own code just to disable it. arm, 32-bit UML, and x86_64 have gate areas, but they have their own implementations. This gets rid of the default and moves the code into ia64. This should save some code on architectures without a gate area: it's now possible to inline the gate_area functions in the default case. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Acked-by: NNathan Lynch <nathan_lynch@mentor.com> Acked-by: NH. Peter Anvin <hpa@linux.intel.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle] Acked-by: Richard Weinberger <richard@nod.at> [for um] Acked-by: Will Deacon <will.deacon@arm.com> [for arm64] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nathan Lynch <Nathan_Lynch@mentor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fabian Frederick 提交于
zswap_entry_cache_destroy() is only called by __init init_zswap(). This patch also fixes function name zswap_entry_cache_ s/destory/destroy Signed-off-by: NFabian Frederick <fabf@skynet.be> Acked-by: NSeth Jennings <sjennings@variantweb.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-