- 19 10月, 2016 10 次提交
-
-
由 Lorenzo Stoakes 提交于
This removes the 'write' argument from access_remote_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
This removes the 'write' argument from __access_remote_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
This removes the 'write' and 'force' from get_user_pages_remote() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
This removes the 'write' and 'force' from get_user_pages() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NChristian König <christian.koenig@amd.com> Acked-by: NJesper Nilsson <jesper.nilsson@axis.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
This removes the 'write' and 'force' from get_vaddr_frames() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
This removes the 'write' and 'force' use from get_user_pages_locked() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
This removes the 'write' and 'force' use from get_user_pages_unlocked() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
This removes the redundant 'write' and 'force' parameters from __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lorenzo Stoakes 提交于
This removes the redundant 'write' and 'force' parameters from __get_user_pages_locked() to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9 ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit f33ea7f4 ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in abf09bed ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Reported-and-tested-by: NPhil "not Paul" Oester <kernel@linuxace.com> Acked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NMichal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 10月, 2016 1 次提交
-
-
由 Catalin Marinas 提交于
Some of the kmemleak_*() callbacks in memblock, bootmem, CMA convert a physical address to a virtual one using __va(). However, such physical addresses may sometimes be located in highmem and using __va() is incorrect, leading to inconsistent object tracking in kmemleak. The following functions have been added to the kmemleak API and they take a physical address as the object pointer. They only perform the corresponding action if the address has a lowmem mapping: kmemleak_alloc_phys kmemleak_free_part_phys kmemleak_not_leak_phys kmemleak_ignore_phys The affected calling places have been updated to use the new kmemleak API. Link: http://lkml.kernel.org/r/1471531432-16503-1-git-send-email-catalin.marinas@arm.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com> Reported-by: NVignesh R <vigneshr@ti.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 10月, 2016 1 次提交
-
-
由 Al Viro 提交于
by making sure we call iov_iter_advance() on original iov_iter even if direct_IO (done on its copy) has returned 0. It's a no-op for old iov_iter flavours and does the right thing (== truncation of the stuff we'd allocated, but not filled) in ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with by cleanup in generic_file_read_iter(). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 08 10月, 2016 28 次提交
-
-
由 Andreas Gruenbacher 提交于
These inode operations are no longer used; remove them. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Joe Perches 提交于
Allow some seq_puts removals by taking a string instead of a single char. [akpm@linux-foundation.org: update vmstat_show(), per Joe] Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com> Cc: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
Every current KDE system has process named ksysguardd polling files below once in several seconds: $ strace -e trace=open -p $(pidof ksysguardd) Process 1812 attached open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8 open("/proc/net/dev", O_RDONLY) = 8 open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory) open("/proc/stat", O_RDONLY) = 8 open("/proc/vmstat", O_RDONLY) = 8 Hell knows what it is doing but speed up reading /proc/vmstat by 33%! Benchmark is open+read+close 1.000.000 times. BEFORE $ perf stat -r 10 taskset -c 3 ./proc-vmstat Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs): 13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% ) 15 context-switches # 0.001 K/sec ( +- 1.41% ) 1 cpu-migrations # 0.000 K/sec ( +- 11.11% ) 104 page-faults # 0.008 K/sec ( +- 0.57% ) 45,489,799,349 cycles # 3.460 GHz ( +- 0.03% ) 9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% ) 2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% ) 79,241,190,850 instructions # 1.74 insn per cycle # 0.13 stalled cycles per insn ( +- 0.00% ) 17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% ) 176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% ) 13.691078109 seconds time elapsed ( +- 0.03% ) ^^^^^^^^^^^^ AFTER $ perf stat -r 10 taskset -c 3 ./proc-vmstat Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs): 8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% ) 10 context-switches # 0.001 K/sec ( +- 2.13% ) 1 cpu-migrations # 0.000 K/sec 104 page-faults # 0.012 K/sec ( +- 0.56% ) 30,384,010,730 cycles # 3.497 GHz ( +- 0.07% ) 12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% ) 3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% ) 28,969,052,879 instructions # 0.95 insn per cycle # 0.42 stalled cycles per insn ( +- 0.01% ) 6,308,245,891 branches # 726.058 M/sec ( +- 0.00% ) 214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% ) 9.146081052 seconds time elapsed ( +- 0.07% ) ^^^^^^^^^^^ vsnprintf() is slow because: 1. format_decode() is busy looking for format specifier: 2 branches per character (not in this case, but in others) 2. approximately million branches while parsing format mini language and everywhere 3. just look at what string() does /proc/vmstat is good case because most of its content are strings Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.bySigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 zhong jiang 提交于
When the huge page is added to the page cahce (huge_add_to_page_cache), the page private flag will be cleared. since this code (remove_inode_hugepages) will only be called for pages in the page cahce, PagePrivate(page) will always be false. The patch remove the code without any functional change. Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Tested-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Currently we do warn only about allocation failures but small allocations are basically nofail and they might loop in the page allocator for a long time. Especially when the reclaim cannot make any progress - e.g. GFP_NOFS cannot invoke the oom killer and rely on a different context to make a forward progress in case there is a lot memory used by filesystems. Give us at least a clue when something like this happens and warn about allocations which take more than 10s. Print the basic allocation context information along with the cumulative time spent in the allocation as well as the allocation stack. Repeat the warning after every 10 seconds so that we know that the problem is permanent rather than ephemeral. Link: http://lkml.kernel.org/r/20160929084407.7004-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
warn_alloc_failed is currently used from the page and vmalloc allocators. This is a good reuse of the code except that vmalloc would appreciate a slightly different warning message. This is already handled by the fmt parameter except that "%s: page allocation failure: order:%u, mode:%#x(%pGg)" is printed anyway. This might be quite misleading because it might be a vmalloc failure which leads to the warning while the page allocator is not the culprit here. Fix this by always using the fmt string and only print the context that makes sense for the particular context (e.g. order makes only very little sense for the vmalloc context). Rename the function to not miss any user and also because a later patch will reuse it also for !failure cases. Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Fang 提交于
We triggered a deadloop in truncate_inode_pages_range() on 32 bits architecture with the test case bellow: ... fd = open(); write(fd, buf, 4096); preadv64(fd, &iovec, 1, 0xffffffff000); ftruncate(fd, 0); ... Then ftruncate() will not return forever. The filesystem used in this case is ubifs, but it can be triggered on many other filesystems. When preadv64() is called with offset=0xffffffff000, a page with index=0xffffffff will be added to the radix tree of ->mapping. Then this page can be found in ->mapping with pagevec_lookup(). After that, truncate_inode_pages_range(), which is called in ftruncate(), will fall into an infinite loop: - find a page with index=0xffffffff, since index>=end, this page won't be truncated - index++, and index become 0 - the page with index=0xffffffff will be found again The data type of index is unsigned long, so index won't overflow to 0 on 64 bits architecture in this case, and the dead loop won't happen. Since truncate_inode_pages_range() is executed with holding lock of inode->i_rwsem, any operation related with this lock will be blocked, and a hung task will happen, e.g.: INFO: task truncate_test:3364 blocked for more than 120 seconds. ... call_rwsem_down_write_failed+0x17/0x30 generic_file_write_iter+0x32/0x1c0 ubifs_write_iter+0xcc/0x170 __vfs_write+0xc4/0x120 vfs_write+0xb2/0x1b0 SyS_write+0x46/0xa0 The page with index=0xffffffff added to ->mapping is useless. Fix this by checking the read position before allocating pages. Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.comSigned-off-by: NWei Fang <fangwei1@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yisheng Xie 提交于
Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic page. No functional change with this patch. Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com> Suggested-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NSellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Let's add comment on why we skip page_mapcount() for sl[aou]b pages. Link: http://lkml.kernel.org/r/20160922105532.GB24593@nodeSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
The old code was always doing: vma->vm_end = next->vm_end vma_rb_erase(next) // in __vma_unlink vma->vm_next = next->vm_next // in __vma_unlink next = vma->vm_next vma_gap_update(next) The new code still does the above for remove_next == 1 and 2, but for remove_next == 3 it has been changed and it does: next->vm_start = vma->vm_start vma_rb_erase(vma) // in __vma_unlink vma_gap_update(next) In the latter case, while unlinking "vma", validate_mm_rb() is told to ignore "vma" that is being removed, but next->vm_start was reduced instead. So for the new case, to avoid the false positive from validate_mm_rb, it should be "next" that is ignored when "vma" is being unlinked. "vma" and "next" in the above comment, considered pre-swap(). Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Tested-by: NShaun Tancheff <shaun.tancheff@seagate.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
The cases are three not two. Link: http://lkml.kernel.org/r/1474492522-2261-3-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
If next would be NULL we couldn't reach such code path. Link: http://lkml.kernel.org/r/1474309513-20313-2-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
The rmap_walk can access vm_page_prot (and potentially vm_flags in the pte/pmd manipulations). So it's not safe to wait the caller to update the vm_page_prot/vm_flags after vma_merge returned potentially removing the "next" vma and extending the "current" vma over the next->vm_start,vm_end range, but still with the "current" vma vm_page_prot, after releasing the rmap locks. The vm_page_prot/vm_flags must be transferred from the "next" vma to the current vma while vma_merge still holds the rmap locks. The side effect of this race condition is pte corruption during migrate as remove_migration_ptes when run on a address of the "next" vma that got removed, used the vm_page_prot of the current vma. migrate mprotect ------------ ------------- migrating in "next" vma vma_merge() # removes "next" vma and # extends "current" vma # current vma is not with # vm_page_prot updated remove_migration_ptes read vm_page_prot of current "vma" establish pte with wrong permissions vm_set_page_prot(vma) # too late! change_protection in the old vma range only, next range is not updated This caused segmentation faults and potentially memory corruption in heavy mprotect loads with some light page migration caused by compaction in the background. Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge which confirms the case 8 is only buggy one where the race can trigger, in all other vma_merge cases the above cannot happen. This fix removes the oddness factor from case 8 and it converts it from: AAAA PPPPNNNNXXXX -> PPPPNNNNNNNN to: AAAA PPPPNNNNXXXX -> PPPPXXXXXXXX XXXX has the right vma properties for the whole merged vma returned by vma_adjust, so it solves the problem fully. It has the added benefits that the callers could stop updating vma properties when vma_merge succeeds however the callers are not updated by this patch (there are bits like VM_SOFTDIRTY that still need special care for the whole range, as the vma merging ignores them, but as long as they're not processed by rmap walks and instead they're accessed with the mmap_sem at least for reading, they are fine not to be updated within vma_adjust before releasing the rmap_locks). Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NAditya Mandaleeka <adityam@microsoft.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
mm->highest_vm_end doesn't need any update. After finally removing the oddness from vma_merge case 8 that was causing: 1) constant risk of trouble whenever anybody would check vma fields from rmap_walks, like it happened when page migration was introduced and it read the vma->vm_page_prot from a rmap_walk 2) the callers of vma_merge to re-initialize any value different from the current vma, instead of vma_merge() more reliably returning a vma that already matches all fields passed as parameter .. it is also worth to take the opportunity of cleaning up superfluous code in vma_adjust(), that if not removed adds up to the hard readability of the function. Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
vma->vm_page_prot is read lockless from the rmap_walk, it may be updated concurrently and this prevents the risk of reading intermediate values. Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 zhong jiang 提交于
According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can in rare cases cause a hung task warning. At present, if alloc_stable_node() allocation fails, two break_cows may want to allocate a couple of pages, and the issue will come up when free memory is under pressure. We fix it by adding __GFP_HIGH to GFP, to grant access to memory reserves, increasing the likelihood of allocation success. [akpm@linux-foundation.org: tweak comment] Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com> Suggested-by: NHugh Dickins <hughd@google.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yisheng Xie 提交于
Fix typo in comment. Link: http://lkml.kernel.org/r/1474788764-5774-1-git-send-email-ysxie@foxmail.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gerald Schaefer 提交于
For every pfn aligned to minimum_order, dissolve_free_huge_pages() will call dissolve_free_huge_page() which takes the hugetlb spinlock, even if the page is not huge at all or a hugepage that is in-use. Improve this by doing the PageHuge() and page_count() checks already in dissolve_free_huge_pages() before calling dissolve_free_huge_page(). In dissolve_free_huge_page(), when holding the spinlock, those checks need to be revalidated. Link: http://lkml.kernel.org/r/20160926172811.94033-4-gerald.schaefer@de.ibm.comSigned-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rui Teng <rui.teng@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gerald Schaefer 提交于
In dissolve_free_huge_pages(), free hugepages will be dissolved without making sure that there are enough of them left to satisfy hugepage reservations. Fix this by adding a return value to dissolve_free_huge_pages() and checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may lead to the situation where dissolve_free_huge_page() returns an error and all free hugepages that were dissolved before that error are lost, while the memory block still cannot be set offline. Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.comSigned-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rui Teng <rui.teng@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gerald Schaefer 提交于
Patch series "mm/hugetlb: memory offline issues with hugepages", v4. This addresses several issues with hugepages and memory offline. While the first patch fixes a panic, and is therefore rather important, the last patch is just a performance optimization. The second patch fixes a theoretical issue with reserved hugepages, while still leaving some ugly usability issue, see description. This patch (of 3): dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a list corruption and addressing exception when trying to set a memory block offline that is part (but not the first part) of a "gigantic" hugetlb page with a size > memory block size. When no other smaller hugetlb page sizes are present, the VM_BUG_ON() will trigger directly. In the other case we will run into an addressing exception later, because dissolve_free_huge_page() will not work on the head page of the compound hugetlb page which will result in a NULL hstate from page_hstate(). To fix this, first remove the VM_BUG_ON() because it is wrong, and then use the compound head page in dissolve_free_huge_page(). This means that an unused pre-allocated gigantic page that has any part of itself inside the memory block that is going offline will be dissolved completely. Losing an unused gigantic hugepage is preferable to failing the memory offline, for example in the situation where a (possibly faulty) memory DIMM needs to go offline. Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.comSigned-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rui Teng <rui.teng@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wanlong Gao 提交于
Commit b4def350 ("mm, nobootmem: clean-up of free_low_memory_core_early()") removed the unnecessary nodeid argument, after that, this comment becomes more confused. We should move it to the right place. Fixes: b4def350 ("mm, nobootmem: clean-up of free_low_memory_core_early()") Link: http://lkml.kernel.org/r/1473996082-14603-1-git-send-email-wanlong.gao@gmail.comSigned-off-by: NWanlong Gao <wanlong.gao@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rasmus Villemoes 提交于
Every other dentry_operations instance is const, and this one might as well be. Link: http://lkml.kernel.org/r/1473890528-7009-1-git-send-email-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The cgroup core and the memory controller need to track socket ownership for different purposes, but the tracking sites being entirely different is kind of ugly. Be a better citizen and rename the memory controller callbacks to match the cgroup core callbacks, then move them to the same place. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NTejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
So they are CONFIG_DEBUG_VM-only and more informative. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David S. Miller <davem@davemloft.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <axboe@fb.com> Cc: Joe Perches <joe@perches.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tetsuo Handa 提交于
Commit c32b3cbe ("oom, PM: make OOM detection in the freezer path raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order to warn when we raced with disabling the OOM killer. Now, patch "oom, suspend: fix oom_killer_disable vs. pm suspend properly" introduced a timeout for oom_killer_disable(). Even if we raced with disabling the OOM killer and the system is OOM livelocked, the OOM killer will be enabled eventually (in 20 seconds by default) and the OOM livelock will be solved. Therefore, we no longer need to warn when we raced with disabling the OOM killer. Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Fragmentation index and the vm.extfrag_threshold sysctl is meant as a heuristic to prevent excessive compaction for costly orders (i.e. THP). It's unlikely to make any difference for non-costly orders, especially with the default threshold. But we cannot afford any uncertainty for the non-costly orders where the only alternative to successful reclaim/compaction is OOM. After the recent patches we are guaranteed maximum effort without heuristics from compaction before deciding OOM, and fragindex is the last remaining heuristic. Therefore skip fragindex altogether for non-costly orders. Suggested-by: NMichal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The compaction_zonelist_suitable() function tries to determine if compaction will be able to proceed after sufficient reclaim, i.e. whether there are enough reclaimable pages to provide enough order-0 freepages for compaction. This addition of reclaimable pages to the free pages works well for the order-0 watermark check, but in the fragmentation index check we only consider truly free pages. Thus we can get fragindex value close to 0 which indicates failure do to lack of memory, and wrongly decide that compaction won't be suitable even after reclaim. Instead of trying to somehow adjust fragindex for reclaimable pages, let's just skip it from compaction_zonelist_suitable(). Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-