- 01 12月, 2018 1 次提交
-
-
由 Dmitry Vyukov 提交于
commit 61448479a9f2c954cde0cfe778cb6bec5d0a748d upstream. Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE, instead it falls back to kmalloc_large(). For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls kmalloc_slab() for all allocations relying on NULL return value for over-sized allocations. This inconsistency leads to unwanted warnings from kmalloc_slab() for over-sized allocations for slab. Returning NULL for failed allocations is the expected behavior. Make slub and slab code consistent by checking size > KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab(). While we are here also fix the check in kmalloc_slab(). We should check against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda worked because for slab the constants are the same, and slub always checks the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will happen. For example, in case of a newly introduced bug in slub code. Also move the check in kmalloc_slab() from function entry to the size > 192 case. This partially compensates for the additional check in slab code and makes slub code a bit faster (at least theoretically). Also drop __GFP_NOWARN in the warning check. This warning means a bug in slab code itself, user-passed flags have nothing to do with it. Nothing of this affects slob. Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com> Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 21 11月, 2018 4 次提交
-
-
由 Vasily Averin 提交于
commit 873d7bcfd066663e3e50113dc4a0de19289b6354 upstream. Commit a2468cc9 ("swap: choose swap device according to numa node") changed 'avail_lists' field of 'struct swap_info_struct' to an array. In popular linux distros it increased size of swap_info_struct up to 40 Kbytes and now swap_info_struct allocation requires order-4 page. Switch to kvzmalloc allows to avoid unexpected allocation failures. Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com Fixes: a2468cc9 ("swap: choose swap device according to numa node") Signed-off-by: NVasily Averin <vvs@virtuozzo.com> Acked-by: NAaron Lu <aaron.lu@intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Mike Kravetz 提交于
commit 5e41540c8a0f0e98c337dda8b391e5dda0cde7cf upstream. This bug has been experienced several times by the Oracle DB team. The BUG is in remove_inode_hugepages() as follows: /* * If page is mapped, it was faulted in after being * unmapped in caller. Unmap (again) now after taking * the fault mutex. The mutex will prevent faults * until we finish removing the page. * * This race can only happen in the hole punch case. * Getting here in a truncate operation is a bug. */ if (unlikely(page_mapped(page))) { BUG_ON(truncate_op); In this case, the elevated map count is not the result of a race. Rather it was incorrectly incremented as the result of a bug in the huge pmd sharing code. Consider the following: - Process A maps a hugetlbfs file of sufficient size and alignment (PUD_SIZE) that a pmd page could be shared. - Process B maps the same hugetlbfs file with the same size and alignment such that a pmd page is shared. - Process B then calls mprotect() to change protections for the mapping with the shared pmd. As a result, the pmd is 'unshared'. - Process B then calls mprotect() again to chage protections for the mapping back to their original value. pmd remains unshared. - Process B then forks and process C is created. During the fork process, we do dup_mm -> dup_mmap -> copy_page_range to copy page tables. Copying page tables for hugetlb mappings is done in the routine copy_hugetlb_page_range. In copy_hugetlb_page_range(), the destination pte is obtained by: dst_pte = huge_pte_alloc(dst, addr, sz); If pmd sharing is possible, the returned pointer will be to a pte in an existing page table. In the situation above, process C could share with either process A or process B. Since process A is first in the list, the returned pte is a pointer to a pte in process A's page table. However, the check for pmd sharing in copy_hugetlb_page_range is: /* If the pagetables are shared don't copy or take references */ if (dst_pte == src_pte) continue; Since process C is sharing with process A instead of process B, the above test fails. The code in copy_hugetlb_page_range which follows assumes dst_pte points to a huge_pte_none pte. It copies the pte entry from src_pte to dst_pte and increments this map count of the associated page. This is how we end up with an elevated map count. To solve, check the dst_pte entry for huge_pte_none. If !none, this implies PMD sharing so do not copy. Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com Fixes: c5c99429 ("fix hugepages leak due to pagetable page sharing") Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Michal Hocko 提交于
commit dd33ad7b251f900481701b2a82d25de583867708 upstream. We have received a bug report that unbinding a large pmem (>1TB) can result in a soft lockup: NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365] [...] Supported: Yes CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4 Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018 task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000 RIP: 0010:__put_page+0x62/0x80 Call Trace: devm_memremap_pages_release+0x152/0x260 release_nodes+0x18d/0x1d0 device_release_driver_internal+0x160/0x210 unbind_store+0xb3/0xe0 kernfs_fop_write+0x102/0x180 __vfs_write+0x26/0x150 vfs_write+0xad/0x1a0 SyS_write+0x42/0x90 do_syscall_64+0x74/0x150 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fd13166b3d0 It has been reported on an older (4.12) kernel but the current upstream code doesn't cond_resched in the hot remove code at all and the given range to remove might be really large. Fix the issue by calling cond_resched once per memory section. Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Thumshirn <jthumshirn@suse.de> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Andrea Arcangeli 提交于
commit ac5b2c18911ffe95c08d69273917f90212cf5659 upstream. THP allocation might be really disruptive when allocated on NUMA system with the local node full or hard to reclaim. Stefan has posted an allocation stall report on 4.12 based SLES kernel which suggests the same issue: kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null) kvm cpuset=/ mems_allowed=0-1 CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased) Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017 Call Trace: dump_stack+0x5c/0x84 warn_alloc+0xe0/0x180 __alloc_pages_slowpath+0x820/0xc90 __alloc_pages_nodemask+0x1cc/0x210 alloc_pages_vma+0x1e5/0x280 do_huge_pmd_wp_page+0x83f/0xf00 __handle_mm_fault+0x93d/0x1060 handle_mm_fault+0xc6/0x1b0 __do_page_fault+0x230/0x430 do_page_fault+0x2a/0x70 page_fault+0x7b/0x80 [...] Mem-Info: active_anon:126315487 inactive_anon:1612476 isolated_anon:5 active_file:60183 inactive_file:245285 isolated_file:0 unevictable:15657 dirty:286 writeback:1 unstable:0 slab_reclaimable:75543 slab_unreclaimable:2509111 mapped:81814 shmem:31764 pagetables:370616 bounce:0 free:32294031 free_pcp:6233 free_cma:0 Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no The defrag mode is "madvise" and from the above report it is clear that the THP has been allocated for MADV_HUGEPAGA vma. Andrea has identified that the main source of the problem is __GFP_THISNODE usage: : The problem is that direct compaction combined with the NUMA : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very : hard the local node, instead of failing the allocation if there's no : THP available in the local node. : : Such logic was ok until __GFP_THISNODE was added to the THP allocation : path even with MPOL_DEFAULT. : : The idea behind the __GFP_THISNODE addition, is that it is better to : provide local memory in PAGE_SIZE units than to use remote NUMA THP : backed memory. That largely depends on the remote latency though, on : threadrippers for example the overhead is relatively low in my : experience. : : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in : extremely slow qemu startup with vfio, if the VM is larger than the : size of one host NUMA node. This is because it will try very hard to : unsuccessfully swapout get_user_pages pinned pages as result of the : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE : allocations and instead of trying to allocate THP on other nodes (it : would be even worse without vfio type1 GUP pins of course, except it'd : be swapping heavily instead). Fix this by removing __GFP_THISNODE for THP requests which are requesting the direct reclaim. This effectivelly reverts 5265047a on the grounds that the zone/node reclaim was known to be disruptive due to premature reclaim when there was memory free. While it made sense at the time for HPC workloads without NUMA awareness on rare machines, it was ultimately harmful in the majority of cases. The existing behaviour is similar, if not as widespare as it applies to a corner case but crucially, it cannot be tuned around like zone_reclaim_mode can. The default behaviour should always be to cause the least harm for the common case. If there are specialised use cases out there that want zone_reclaim_mode in specific cases, then it can be built on top. Longterm we should consider a memory policy which allows for the node reclaim like behavior for the specific memory ranges which would allow a [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com Mel said: : Both patches look correct to me but I'm responding to this one because : it's the fix. The change makes sense and moves further away from the : severe stalling behaviour we used to see with both THP and zone reclaim : mode. : : I put together a basic experiment with usemem configured to reference a : buffer multiple times that is 80% the size of main memory on a 2-socket : box with symmetric node sizes and defrag set to "always". The defrag : setting is not the default but it would be functionally similar to : accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to : reference the buffer multiple times and while it's not an interesting : workload, it would be expected to complete reasonably quickly as it fits : within memory. The results were; : : usemem : vanilla noreclaim-v1 : Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%) : Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%) : Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%) : : This shows the elapsed time in seconds for 1 thread, 3 threads and 4 : threads referencing buffers 80% the size of memory. With the patches : applied, it's 37.18% faster for the single thread and 73% faster with two : threads. Note that 4 threads showing little difference does not indicate : the problem is related to thread counts. It's simply the case that 4 : threads gets spread so their workload mostly fits in one node. : : The overall view from /proc/vmstats is more startling : : 4.19.0-rc1 4.19.0-rc1 : vanillanoreclaim-v1r1 : Minor Faults 35593425 708164 : Major Faults 484088 36 : Swap Ins 3772837 0 : Swap Outs 3932295 0 : : Massive amounts of swap in/out without the patch : : Direct pages scanned 6013214 0 : Kswapd pages scanned 0 0 : Kswapd pages reclaimed 0 0 : Direct pages reclaimed 4033009 0 : : Lots of reclaim activity without the patch : : Kswapd efficiency 100% 100% : Kswapd velocity 0.000 0.000 : Direct efficiency 67% 100% : Direct velocity 11191.956 0.000 : : Mostly from direct reclaim context as you'd expect without the patch. : : Page writes by reclaim 3932314.000 0.000 : Page writes file 19 0 : Page writes anon 3932295 0 : Page reclaim immediate 42336 0 : : Writes from reclaim context is never good but the patch eliminates it. : : We should never have default behaviour to thrash the system for such a : basic workload. If zone reclaim mode behaviour is ever desired but on a : single task instead of a global basis then the sensible option is to build : a mempolicy that enforces that behaviour. This was a severe regression compared to previous kernels that made important workloads unusable and it starts when __GFP_THISNODE was added to THP allocations under MADV_HUGEPAGE. It is not a significant risk to go to the previous behavior before __GFP_THISNODE was added, it worked like that for years. This was simply an optimization to some lucky workloads that can fit in a single node, but it ended up breaking the VM for others that can't possibly fit in a single node, so going back is safe. [mhocko@suse.com: rewrote the changelog based on the one from Andrea] Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org Fixes: 5265047a ("mm, thp: really limit transparent hugepage allocation to local node") Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NStefan Priebe <s.priebe@profihost.ag> Debugged-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NAlex Williamson <alex.williamson@redhat.com> Reviewed-by: NMel Gorman <mgorman@techsingularity.net> Tested-by: NMel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 14 11月, 2018 3 次提交
-
-
由 Ralph Campbell 提交于
commit 86a2d59841ab0b147ffc1b7b3041af87927cf312 upstream. In hmm_mirror_unregister(), mm->hmm is set to NULL and then mmu_notifier_unregister_no_release() is called. That creates a small window where mmu_notifier can call mmu_notifier_ops with mm->hmm equal to NULL. Fix this by first unregistering mmu notifier callbacks and then setting mm->hmm to NULL. Similarly in hmm_register(), set mm->hmm before registering mmu_notifier callbacks so callback functions always see mm->hmm set. Link: http://lkml.kernel.org/r/20181019160442.18723-4-jglisse@redhat.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com> Signed-off-by: NJérôme Glisse <jglisse@redhat.com> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Reviewed-by: NJérôme Glisse <jglisse@redhat.com> Reviewed-by: NBalbir Singh <bsingharora@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Ralph Campbell 提交于
commit aab8d0520e6e7c2a61f71195e6ce7007a4843afb upstream. Private ZONE_DEVICE pages use a special pte entry and thus are not present. Properly handle this case in map_pte(), it is already handled in check_pte(), the map_pte() part was lost in some rebase most probably. Without this patch the slow migration path can not migrate back to any private ZONE_DEVICE memory to regular memory. This was found after stress testing migration back to system memory. This ultimatly can lead to the CPU constantly page fault looping on the special swap entry. Link: http://lkml.kernel.org/r/20181019160442.18723-3-jglisse@redhat.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com> Signed-off-by: NJérôme Glisse <jglisse@redhat.com> Reviewed-by: NBalbir Singh <bsingharora@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Mike Kravetz 提交于
commit 22146c3ce98962436e401f7b7016a6f664c9ffb5 upstream. Some test systems were experiencing negative huge page reserve counts and incorrect file block counts. This was traced to /proc/sys/vm/drop_caches removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs explicit code removes the pages, the appropriate accounting is not performed. This can be recreated as follows: fallocate -l 2M /dev/hugepages/foo echo 1 > /proc/sys/vm/drop_caches fallocate -l 2M /dev/hugepages/foo grep -i huge /proc/meminfo AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 2048 HugePages_Free: 2047 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 4194304 kB ls -lsh /dev/hugepages/foo 4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo To address this issue, dirty pages as they are added to pagecache. This can easily be reproduced with fallocate as shown above. Read faulted pages will eventually end up being marked dirty. But there is a window where they are clean and could be impacted by code such as drop_caches. So, just dirty them all as they are added to the pagecache. Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com Fixes: 6bda666a ("hugepages: fold find_or_alloc_pages into huge_no_page()") Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMihcla Hocko <mhocko@suse.com> Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 18 10月, 2018 1 次提交
-
-
由 Linus Torvalds 提交于
Jann Horn points out that our TLB flushing was subtly wrong for the mremap() case. What makes mremap() special is that we don't follow the usual "add page to list of pages to be freed, then flush tlb, and then free pages". No, mremap() obviously just _moves_ the page from one page table location to another. That matters, because mremap() thus doesn't directly control the lifetime of the moved page with a freelist: instead, the lifetime of the page is controlled by the page table locking, that serializes access to the entry. As a result, we need to flush the TLB not just before releasing the lock for the source location (to avoid any concurrent accesses to the entry), but also before we release the destination page table lock (to avoid the TLB being flushed after somebody else has already done something to that page). This also makes the whole "need_flush" logic unnecessary, since we now always end up flushing the TLB for every valid entry. Reported-and-tested-by: NJann Horn <jannh@google.com> Acked-by: NWill Deacon <will.deacon@arm.com> Tested-by: NIngo Molnar <mingo@kernel.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 13 10月, 2018 2 次提交
-
-
由 Jérôme Glisse 提交于
Inside set_pmd_migration_entry() we are holding page table locks and thus we can not sleep so we can not call invalidate_range_start/end() So remove call to mmu_notifier_invalidate_range_start/end() because they are call inside the function calling set_pmd_migration_entry() (see try_to_unmap_one()). Link: http://lkml.kernel.org/r/20181012181056.7864-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Reported-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu> Acked-by: NMichal Hocko <mhocko@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jann Horn 提交于
Daniel Micay reports that attempting to use MAP_FIXED_NOREPLACE in an application causes that application to randomly crash. The existing check for handling MAP_FIXED_NOREPLACE looks up the first VMA that either overlaps or follows the requested region, and then bails out if that VMA overlaps *the start* of the requested region. It does not bail out if the VMA only overlaps another part of the requested region. Fix it by checking that the found VMA only starts at or after the end of the requested region, in which case there is no overlap. Test case: user@debian:~$ cat mmap_fixed_simple.c #include <sys/mman.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #ifndef MAP_FIXED_NOREPLACE #define MAP_FIXED_NOREPLACE 0x100000 #endif int main(void) { char *p; errno = 0; p = mmap((void*)0x10001000, 0x4000, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0); printf("p1=%p err=%m\n", p); errno = 0; p = mmap((void*)0x10000000, 0x2000, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0); printf("p2=%p err=%m\n", p); char cmd[100]; sprintf(cmd, "cat /proc/%d/maps", getpid()); system(cmd); return 0; } user@debian:~$ gcc -o mmap_fixed_simple mmap_fixed_simple.c user@debian:~$ ./mmap_fixed_simple p1=0x10001000 err=Success p2=0x10000000 err=Success 10000000-10002000 r--p 00000000 00:00 0 10002000-10005000 ---p 00000000 00:00 0 564a9a06f000-564a9a070000 r-xp 00000000 fe:01 264004 /home/user/mmap_fixed_simple 564a9a26f000-564a9a270000 r--p 00000000 fe:01 264004 /home/user/mmap_fixed_simple 564a9a270000-564a9a271000 rw-p 00001000 fe:01 264004 /home/user/mmap_fixed_simple 564a9a54a000-564a9a56b000 rw-p 00000000 00:00 0 [heap] 7f8eba447000-7f8eba5dc000 r-xp 00000000 fe:01 405885 /lib/x86_64-linux-gnu/libc-2.24.so 7f8eba5dc000-7f8eba7dc000 ---p 00195000 fe:01 405885 /lib/x86_64-linux-gnu/libc-2.24.so 7f8eba7dc000-7f8eba7e0000 r--p 00195000 fe:01 405885 /lib/x86_64-linux-gnu/libc-2.24.so 7f8eba7e0000-7f8eba7e2000 rw-p 00199000 fe:01 405885 /lib/x86_64-linux-gnu/libc-2.24.so 7f8eba7e2000-7f8eba7e6000 rw-p 00000000 00:00 0 7f8eba7e6000-7f8eba809000 r-xp 00000000 fe:01 405876 /lib/x86_64-linux-gnu/ld-2.24.so 7f8eba9e9000-7f8eba9eb000 rw-p 00000000 00:00 0 7f8ebaa06000-7f8ebaa09000 rw-p 00000000 00:00 0 7f8ebaa09000-7f8ebaa0a000 r--p 00023000 fe:01 405876 /lib/x86_64-linux-gnu/ld-2.24.so 7f8ebaa0a000-7f8ebaa0b000 rw-p 00024000 fe:01 405876 /lib/x86_64-linux-gnu/ld-2.24.so 7f8ebaa0b000-7f8ebaa0c000 rw-p 00000000 00:00 0 7ffcc99fa000-7ffcc9a1b000 rw-p 00000000 00:00 0 [stack] 7ffcc9b44000-7ffcc9b47000 r--p 00000000 00:00 0 [vvar] 7ffcc9b47000-7ffcc9b49000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] user@debian:~$ uname -a Linux debian 4.19.0-rc6+ #181 SMP Wed Oct 3 23:43:42 CEST 2018 x86_64 GNU/Linux user@debian:~$ As you can see, the first page of the mapping at 0x10001000 was clobbered. Link: http://lkml.kernel.org/r/20181010152736.99475-1-jannh@google.com Fixes: a4ff8e86 ("mm: introduce MAP_FIXED_NOREPLACE") Signed-off-by: NJann Horn <jannh@google.com> Reported-by: NDaniel Micay <danielmicay@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohn Hubbard <jhubbard@nvidia.com> Acked-by: NKees Cook <keescook@chromium.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 09 10月, 2018 1 次提交
-
-
由 Srikar Dronamraju 提交于
Remove the leftover pglist_data::numabalancing_migrate_lock and its initialization, we stopped using this lock with: efaffc5e ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration") [ mingo: Rewrote the changelog. ] Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux-MM <linux-mm@kvack.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 10月, 2018 1 次提交
-
-
由 Mike Rapoport 提交于
The commit ca460b3c ("percpu: introduce bitmap metadata blocks") introduced bitmap metadata blocks. These metadata blocks are allocated whenever a new chunk is created, but they are never freed. Fix it. Fixes: ca460b3c ("percpu: introduce bitmap metadata blocks") Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: NDennis Zhou <dennis@kernel.org>
-
- 06 10月, 2018 9 次提交
-
-
由 Daniel Black 提交于
Reproducer, assuming 2M of hugetlbfs available: Hugetlbfs mounted, size=2M and option user=testuser # mount | grep ^hugetlbfs hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan) # sysctl vm.nr_hugepages=1 vm.nr_hugepages = 1 # grep Huge /proc/meminfo AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 1 HugePages_Free: 1 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 2048 kB Code: #include <sys/mman.h> #include <stddef.h> #define SIZE 2*1024*1024 int main() { void *ptr; ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0); madvise(ptr, SIZE, MADV_DONTDUMP); madvise(ptr, SIZE, MADV_DODUMP); } Compile and strace: mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000 madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0 madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument) hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on author testing with analysis from Florian Weimer[1]. The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a consequence of the large useage of VM_DONTEXPAND in device drivers. A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be marked DODUMP. A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs memory for a while and later request that madvise(MADV_DODUMP) on the same memory. We correct this omission by allowing madvice(MADV_DODUMP) on hugetlbfs pages. [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit [2] commit 0103bd16 ("mm: prepare VM_DONTDUMP for using in drivers") Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com Link: https://lists.launchpad.net/maria-discuss/msg05245.html Fixes: 0103bd16 ("mm: prepare VM_DONTDUMP for using in drivers") Reported-by: NKenneth Penza <kpenza@gmail.com> Signed-off-by: NDaniel Black <daniel@linux.ibm.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Kirill Tkhai 提交于
do_shrink_slab() returns unsigned long value, and the placing into int variable cuts high bytes off. Then we compare ret and 0xfffffffe (since SHRINK_EMPTY is converted to ret type). Thus a large number of objects returned by do_shrink_slab() may be interpreted as SHRINK_EMPTY, if low bytes of their value are equal to 0xfffffffe. Fix that by declaration ret as unsigned long in these functions. Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Reported-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jann Horn 提交于
5dd0b16c ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside the kernel unconditional to reduce #ifdef soup, but (either to avoid showing dummy zero counters to userspace, or because that code was missed) didn't update the vmstat_array, meaning that all following counters would be shown with incorrect values. This only affects kernel builds with CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n. Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com Fixes: 5dd0b16c ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP") Signed-off-by: NJann Horn <jannh@google.com> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jann Horn 提交于
7a9cdebd ("mm: get rid of vmacache_flush_all() entirely") removed the VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding entry in vmstat_text. This causes an out-of-bounds access in vmstat_show(). Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which is probably very rare. Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com Fixes: 7a9cdebd ("mm: get rid of vmacache_flush_all() entirely") Signed-off-by: NJann Horn <jannh@google.com> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Anshuman Khandual 提交于
split_huge_page_to_list() fails on HugeTLB pages. I was experimenting with moving 32MB contig HugeTLB pages on arm64 (with a debug patch applied) and hit the following stack trace when the kernel crashed. [ 3732.462797] Call trace: [ 3732.462835] split_huge_page_to_list+0x3b0/0x858 [ 3732.462913] migrate_pages+0x728/0xc20 [ 3732.462999] soft_offline_page+0x448/0x8b0 [ 3732.463097] __arm64_sys_madvise+0x724/0x850 [ 3732.463197] el0_svc_handler+0x74/0x110 [ 3732.463297] el0_svc+0x8/0xc [ 3732.463347] Code: d1000400 f90b0e60 f2fbd5a2 a94982a1 (f9000420) When unmap_and_move[_huge_page]() fails due to lack of memory, the splitting should happen only for transparent huge pages not for HugeTLB pages. PageTransHuge() returns true for both THP and HugeTLB pages. Hence the conditonal check should test PagesHuge() flag to make sure that given pages is not a HugeTLB one. Link: http://lkml.kernel.org/r/1537798495-4996-1-git-send-email-anshuman.khandual@arm.com Fixes: 94723aaf ("mm: unclutter THP migration") Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 YueHaibing 提交于
get_user_pages_fast() will return negative value if no pages were pinned, then be converted to a unsigned, which is compared to zero, giving the wrong result. Link: http://lkml.kernel.org/r/20180921095015.26088-1-yuehaibing@huawei.com Fixes: 09e35a4a ("mm/gup_benchmark: handle gup failures") Signed-off-by: NYueHaibing <yuehaibing@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Kirill A. Shutemov 提交于
A transparent huge page is represented by a single entry on an LRU list. Therefore, we can only make unevictable an entire compound page, not individual subpages. If a user tries to mlock() part of a huge page, we want the rest of the page to be reclaimable. We handle this by keeping PTE-mapped huge pages on normal LRU lists: the PMD on border of VM_LOCKED VMA will be split into PTE table. Introduction of THP migration breaks[1] the rules around mlocking THP pages. If we had a single PMD mapping of the page in mlocked VMA, the page will get mlocked, regardless of PTE mappings of the page. For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in remove_migration_pmd(). Anon THP pages can only be shared between processes via fork(). Mlocked page can only be shared if parent mlocked it before forking, otherwise CoW will be triggered on mlock(). For Anon-THP, we can fix the issue by munlocking the page on removing PTE migration entry for the page. PTEs for the page will always come after mlocked PMD: rmap walks VMAs from oldest to newest. Test-case: #include <unistd.h> #include <sys/mman.h> #include <sys/wait.h> #include <linux/mempolicy.h> #include <numaif.h> int main(void) { unsigned long nodemask = 4; void *addr; addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0); if (fork()) { wait(NULL); return 0; } mlock(addr, 4UL << 10); mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES, &nodemask, 4, MPOL_MF_MOVE); return 0; } [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com Fixes: 616b8371 ("mm: thp: enable thp migration in generic path") Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NVegard Nossum <vegard.nossum@oracle.com> Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Mike Kravetz 提交于
When fixing an issue with PMD sharing and migration, it was discovered via code inspection that other callers of huge_pmd_unshare potentially have an issue with cache and tlb flushing. Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst case ranges for mmu notifiers. Ensure that this range is flushed if huge_pmd_unshare succeeds and unmaps a PUD_SUZE area. Link: http://lkml.kernel.org/r/20180823205917.16297-3-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Mike Kravetz 提交于
The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 no matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page. This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost. This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined due to ECC memory errors. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes while continually writing to the huge pages being migrated. To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops the reference on the PMD page. After this, flush caches and TLB. mmu notifiers are called before locking page tables, but we can not be sure of PMD sharing until page tables are locked. Therefore, check for the possibility of PMD sharing before locking so that notifiers can prepare for the worst possible case. Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com [mike.kravetz@oracle.com: make _range_in_vma() a static inline] Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com Fixes: 39dde65c ("shared page table for hugetlb page") Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 02 10月, 2018 2 次提交
-
-
由 Mel Gorman 提交于
Rate limiting of page migrations due to automatic NUMA balancing was introduced to mitigate the worst-case scenario of migrating at high frequency due to false sharing or slowly ping-ponging between nodes. Since then, a lot of effort was spent on correctly identifying these pages and avoiding unnecessary migrations and the safety net may no longer be required. Jirka Hladky reported a regression in 4.17 due to a scheduler patch that avoids spreading STREAM tasks wide prematurely. However, once the task was properly placed, it delayed migrating the memory due to rate limiting. Increasing the limit fixed the problem for him. Currently, the limit is hard-coded and does not account for the real capabilities of the hardware. Even if an estimate was attempted, it would not properly account for the number of memory controllers and it could not account for the amount of bandwidth used for normal accesses. Rather than fudging, this patch simply eliminates the rate limiting. However, Jirka reports that a STREAM configuration using multiple processes achieved similar performance to 4.16. In local tests, this patch improved performance of STREAM relative to the baseline but it is somewhat machine-dependent. Most workloads show little or not performance difference implying that there is not a heavily reliance on the throttling mechanism and it is safe to remove. STREAM on 2-socket machine 4.19.0-rc5 4.19.0-rc5 numab-v1r1 noratelimit-v1r1 MB/sec copy 43298.52 ( 0.00%) 44673.38 ( 3.18%) MB/sec scale 30115.06 ( 0.00%) 31293.06 ( 3.91%) MB/sec add 32825.12 ( 0.00%) 34883.62 ( 6.27%) MB/sec triad 32549.52 ( 0.00%) 34906.60 ( 7.24% Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Reviewed-by: NRik van Riel <riel@surriel.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux-MM <linux-mm@kvack.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Srikar Dronamraju 提交于
Since this spinlock will only serialize the migrate rate limiting, convert the spin_lock() to a spin_trylock(). If another thread is updating, this task can move on. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 205332 198512 -3.32145 1 319785 313559 -1.94693 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 8 74912 74761.9 -0.200368 1 206585 214874 4.01239 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 189162 180536 -4.56011 1 213760 210281 -1.62753 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 58736.8 56511.4 -3.78877 1 105419 104899 -0.49327 Avoiding stretching of window intervals may be the reason for the regression. Also code now uses READ_ONCE/WRITE_ONCE. That may also be hurting performance to some extent. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 14,285,708 13,818,546 migrations 1,180,621 1,149,960 faults 339,114 385,583 cache-misses 55,205,631,894 55,259,546,768 sched:sched_move_numa 843 2,257 sched:sched_stick_numa 6 9 sched:sched_swap_numa 219 512 migrate:mm_migrate_pages 365 2,225 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 26907 72692 numa_hint_faults_local 24279 62270 numa_hit 239771 238762 numa_huge_pte_updates 0 48 numa_interleave 68 75 numa_local 239688 238676 numa_other 83 86 numa_pages_migrated 363 2225 numa_pte_updates 27415 98557 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,202,779 3,173,490 migrations 37,186 36,966 faults 106,076 108,776 cache-misses 12,024,873,744 12,200,075,320 sched:sched_move_numa 931 1,264 sched:sched_stick_numa 0 0 sched:sched_swap_numa 1 0 migrate:mm_migrate_pages 637 899 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 17409 21109 numa_hint_faults_local 14367 17120 numa_hit 73953 72934 numa_huge_pte_updates 20 42 numa_interleave 25 33 numa_local 73892 72866 numa_other 61 68 numa_pages_migrated 668 915 numa_pte_updates 27276 42326 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,474,013 8,312,022 migrations 254,934 231,705 faults 320,506 310,242 cache-misses 110,580,458 402,324,573 sched:sched_move_numa 725 193 sched:sched_stick_numa 0 0 sched:sched_swap_numa 7 3 migrate:mm_migrate_pages 145 93 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 22797 11838 numa_hint_faults_local 21539 11216 numa_hit 89308 90689 numa_huge_pte_updates 0 0 numa_interleave 865 1579 numa_local 88955 89634 numa_other 353 1055 numa_pages_migrated 149 92 numa_pte_updates 22930 12109 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,195,628 2,170,481 migrations 11,179 10,126 faults 149,656 160,962 cache-misses 8,117,515 10,834,845 sched:sched_move_numa 49 10 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 5 2 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 3577 403 numa_hint_faults_local 3476 358 numa_hit 26142 25898 numa_huge_pte_updates 0 0 numa_interleave 358 207 numa_local 26042 25860 numa_other 100 38 numa_pages_migrated 5 2 numa_pte_updates 3587 400 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 100,602,296 110,339,633 migrations 4,135,630 4,139,812 faults 789,256 863,622 cache-misses 226,160,621,058 231,838,045,660 sched:sched_move_numa 1,366 2,196 sched:sched_stick_numa 16 33 sched:sched_swap_numa 374 544 migrate:mm_migrate_pages 1,350 2,469 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 47857 85748 numa_hint_faults_local 39768 66831 numa_hit 240165 242213 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 240165 242211 numa_other 0 2 numa_pages_migrated 1224 2376 numa_pte_updates 48354 86233 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 58,515,496 59,331,057 migrations 564,845 552,019 faults 245,807 266,586 cache-misses 73,603,757,976 73,796,312,990 sched:sched_move_numa 996 981 sched:sched_stick_numa 10 54 sched:sched_swap_numa 193 286 migrate:mm_migrate_pages 646 713 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 13422 14807 numa_hint_faults_local 5619 5738 numa_hit 36118 36230 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 36116 36228 numa_other 2 2 numa_pages_migrated 616 703 numa_pte_updates 13374 14742 Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 21 9月, 2018 3 次提交
-
-
由 Roman Gushchin 提交于
9092c71b ("mm: use sc->priority for slab shrink targets") changed the way that the target slab pressure is calculated and made it priority-based: delta = freeable >> priority; delta *= 4; do_div(delta, shrinker->seeks); The problem is that on a default priority (which is 12) no pressure is applied at all, if the number of potentially reclaimable objects is less than 4096 (1<<12). This causes the last objects on slab caches of no longer used cgroups to (almost) never get reclaimed. It's obviously a waste of memory. It can be especially painful, if these stale objects are holding a reference to a dying cgroup. Slab LRU lists are reparented on memcg offlining, but corresponding objects are still holding a reference to the dying cgroup. If we don't scan these objects, the dying cgroup can't go away. Most likely, the parent cgroup hasn't any directly charged objects, only remaining objects from dying children cgroups. So it can easily hold a reference to hundreds of dying cgroups. If there are no big spikes in memory pressure, and new memory cgroups are created and destroyed periodically, this causes the number of dying cgroups grow steadily, causing a slow-ish and hard-to-detect memory "leak". It's not a real leak, as the memory can be eventually reclaimed, but it could not happen in a real life at all. I've seen hosts with a steadily climbing number of dying cgroups, which doesn't show any signs of a decline in months, despite the host is loaded with a production workload. It is an obvious waste of memory, and to prevent it, let's apply a minimal pressure even on small shrinker lists. E.g. if there are freeable objects, let's scan at least min(freeable, scan_batch) objects. This fix significantly improves a chance of a dying cgroup to be reclaimed, and together with some previous patches stops the steady growth of the dying cgroups number on some of our hosts. Link: http://lkml.kernel.org/r/20180905230759.12236-1-guro@fb.com Fixes: 9092c71b ("mm: use sc->priority for slab shrink targets") Signed-off-by: NRoman Gushchin <guro@fb.com> Acked-by: NRik van Riel <riel@surriel.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Joel Fernandes (Google) 提交于
Directories and inodes don't necessarily need to be in the same lockdep class. For ex, hugetlbfs splits them out too to prevent false positives in lockdep. Annotate correctly after new inode creation. If its a directory inode, it will be put into a different class. This should fix a lockdep splat reported by syzbot: > ====================================================== > WARNING: possible circular locking dependency detected > 4.18.0-rc8-next-20180810+ #36 Not tainted > ------------------------------------------------------ > syz-executor900/4483 is trying to acquire lock: > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock > include/linux/fs.h:765 [inline] > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602 > > but task is already holding lock: > 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630 > drivers/staging/android/ashmem.c:448 > > which lock already depends on the new lock. > > -> #2 (ashmem_mutex){+.+.}: > __mutex_lock_common kernel/locking/mutex.c:925 [inline] > __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073 > mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088 > ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361 > call_mmap include/linux/fs.h:1844 [inline] > mmap_region+0xf27/0x1c50 mm/mmap.c:1762 > do_mmap+0xa10/0x1220 mm/mmap.c:1535 > do_mmap_pgoff include/linux/mm.h:2298 [inline] > vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357 > ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585 > __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline] > __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline] > __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91 > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > -> #1 (&mm->mmap_sem){++++}: > __might_fault+0x155/0x1e0 mm/memory.c:4568 > _copy_to_user+0x30/0x110 lib/usercopy.c:25 > copy_to_user include/linux/uaccess.h:155 [inline] > filldir+0x1ea/0x3a0 fs/readdir.c:196 > dir_emit_dot include/linux/fs.h:3464 [inline] > dir_emit_dots include/linux/fs.h:3475 [inline] > dcache_readdir+0x13a/0x620 fs/libfs.c:193 > iterate_dir+0x48b/0x5d0 fs/readdir.c:51 > __do_sys_getdents fs/readdir.c:231 [inline] > __se_sys_getdents fs/readdir.c:212 [inline] > __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212 > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > -> #0 (&sb->s_type->i_mutex_key#9){++++}: > lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924 > down_write+0x8f/0x130 kernel/locking/rwsem.c:70 > inode_lock include/linux/fs.h:765 [inline] > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602 > ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455 > ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797 > vfs_ioctl fs/ioctl.c:46 [inline] > file_ioctl fs/ioctl.c:501 [inline] > do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685 > ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702 > __do_sys_ioctl fs/ioctl.c:709 [inline] > __se_sys_ioctl fs/ioctl.c:707 [inline] > __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707 > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > other info that might help us debug this: > > Chain exists of: > &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(ashmem_mutex); > lock(&mm->mmap_sem); > lock(ashmem_mutex); > lock(&sb->s_type->i_mutex_key#9); > > *** DEADLOCK *** > > 1 lock held by syz-executor900/4483: > #0: 0000000025208078 (ashmem_mutex){+.+.}, at: > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448 Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.orgSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Reviewed-by: NNeilBrown <neilb@suse.com> Suggested-by: NNeilBrown <neilb@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Pasha Tatashin 提交于
Deferred struct page init is needed only on systems with large amount of physical memory to improve boot performance. 32-bit systems do not benefit from this feature. Jiri reported a problem where deferred struct pages do not work well with x86-32: [ 0.035162] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) [ 0.035725] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) [ 0.036269] Initializing CPU#0 [ 0.036513] Initializing HighMem for node 0 (00036ffe:0007ffe0) [ 0.038459] page:f6780000 is uninitialized and poisoned [ 0.038460] raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff [ 0.039509] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) [ 0.040038] ------------[ cut here ]------------ [ 0.040399] kernel BUG at include/linux/page-flags.h:293! [ 0.040823] invalid opcode: 0000 [#1] SMP PTI [ 0.041166] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc1_pt_jiri #9 [ 0.041694] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 [ 0.042496] EIP: free_highmem_page+0x64/0x80 [ 0.042839] Code: 13 46 d8 c1 e8 18 5d 83 e0 03 8d 04 c0 c1 e0 06 ff 80 ec 5f 44 d8 c3 8d b4 26 00 00 00 00 ba 08 65 28 d8 89 d8 e8 fc 71 02 00 <0f> 0b 8d 76 00 8d bc 27 00 00 00 00 ba d0 b1 26 d8 89 d8 e8 e4 71 [ 0.044338] EAX: 0000003c EBX: f6780000 ECX: 00000000 EDX: d856cbe8 [ 0.044868] ESI: 0007ffe0 EDI: d838df20 EBP: d838df00 ESP: d838defc [ 0.045372] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210086 [ 0.045913] CR0: 80050033 CR2: 00000000 CR3: 18556000 CR4: 00040690 [ 0.046413] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 0.046913] DR6: fffe0ff0 DR7: 00000400 [ 0.047220] Call Trace: [ 0.047419] add_highpages_with_active_regions+0xbd/0x10d [ 0.047854] set_highmem_pages_init+0x5b/0x71 [ 0.048202] mem_init+0x2b/0x1e8 [ 0.048460] start_kernel+0x1d2/0x425 [ 0.048757] i386_start_kernel+0x93/0x97 [ 0.049073] startup_32_smp+0x164/0x168 [ 0.049379] Modules linked in: [ 0.049626] ---[ end trace 337949378db0abbb ]--- We free highmem pages before their struct pages are initialized: mem_init() set_highmem_pages_init() add_highpages_with_active_regions() free_highmem_page() .. Access uninitialized struct page here.. Because there is no reason to have this feature on 32-bit systems, just disable it. Link: http://lkml.kernel.org/r/20180831150506.31246-1-pavel.tatashin@microsoft.com Fixes: 2e3ca40f ("mm: relax deferred struct page requirements") Signed-off-by: NPavel Tatashin <pavel.tatashin@microsoft.com> Reported-by: NJiri Slaby <jslaby@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 14 9月, 2018 1 次提交
-
-
由 Linus Torvalds 提交于
Jann Horn points out that the vmacache_flush_all() function is not only potentially expensive, it's buggy too. It also happens to be entirely unnecessary, because the sequence number overflow case can be avoided by simply making the sequence number be 64-bit. That doesn't even grow the data structures in question, because the other adjacent fields are already 64-bit. So simplify the whole thing by just making the sequence number overflow case go away entirely, which gets rid of all the complications and makes the code faster too. Win-win. [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics also just goes away entirely with this ] Reported-by: NJann Horn <jannh@google.com> Suggested-by: NWill Deacon <will.deacon@arm.com> Acked-by: NDavidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 9月, 2018 6 次提交
-
-
由 Dave Jiang 提交于
It looks like I missed the PUD path when doing VM_MIXEDMAP removal. This can be triggered by: 1. Boot with memmap=4G!8G 2. build ndctl with destructive flag on 3. make TESTS=device-dax check [ +0.000675] kernel BUG at mm/huge_memory.c:824! Applying the same change that was applied to vmf_insert_pfn_pmd() in the original patch. Link: http://lkml.kernel.org/r/153565957352.35524.1005746906902065126.stgit@djiang5-desk3.ch.intel.com Fixes: e1fb4a08 ("dax: remove VM_MIXEDMAP for fsdax and device dax") Signed-off-by: NDave Jiang <dave.jiang@intel.com> Reported-by: NVishal Verma <vishal.l.verma@intel.com> Tested-by: NVishal Verma <vishal.l.verma@intel.com> Acked-by: NJeff Moyer <jmoyer@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
When scanning for movable pages, filter out Hugetlb pages if hugepage migration is not supported. Without this we hit infinte loop in __offline_pages() where we do pfn = scan_movable_pages(start_pfn, end_pfn); if (pfn) { /* We have movable pages */ ret = do_migrate_range(pfn, end_pfn); goto repeat; } Fix this by checking hugepage_migration_supported both in has_unmovable_pages which is the primary backoff mechanism for page offlining and for consistency reasons also into scan_movable_pages because it doesn't make any sense to return a pfn to non-migrateable huge page. This issue was revealed by, but not caused by 72b39cfc ("mm, memory_hotplug: do not fail offlining too early"). Link: http://lkml.kernel.org/r/20180824063314.21981-1-aneesh.kumar@linux.ibm.com Fixes: 72b39cfc ("mm, memory_hotplug: do not fail offlining too early") Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reported-by: NHaren Myneni <haren@linux.vnet.ibm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Scooped from an email from Matthew. Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vincent Whitchurch 提交于
If kmemleak built in to the kernel, but is disabled by default, the debugfs file is never registered. Because of this, it is not possible to find out if the kernel is built with kmemleak support by checking for the presence of this file. To allow this, always register the file. After this patch, if the file doesn't exist, kmemleak is not available in the kernel. If writing "scan" or any other value than "clear" to this file results in EBUSY, then kmemleak is available but is disabled by default and can be activated via the kernel command line. Catalin: "that's also consistent with a late disabling of kmemleak when the debugfs entry sticks around." Link: http://lkml.kernel.org/r/20180824131220.19176-1-vincent.whitchurch@axis.comSigned-off-by: NVincent Whitchurch <vincent.whitchurch@axis.com> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tetsuo Handa 提交于
Commit 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers") has added an ability to skip over vmas with blockable mmu notifiers. This however didn't call tlb_finish_mmu as it should. As a result inc_tlb_flush_pending has been called without its pairing dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush even though this is not really needed. This alone is not harmful and it seems there shouldn't be any such callers for oom victims at all but there is no real reason to skip tlb_finish_mmu on early skip either so call it. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
When the memcg OOM killer runs out of killable tasks, it currently prints a WARN with no further OOM context. This has caused some user confusion. Warnings indicate a kernel problem. In a reported case, however, the situation was triggered by a nonsensical memcg configuration (hard limit set to 0). But without any VM context this wasn't obvious from the report, and it took some back and forth on the mailing list to identify what is actually a trivial issue. Handle this OOM condition like we handle it in the global OOM killer: dump the full OOM context and tell the user we ran out of tasks. This way the user can identify misconfigurations easily by themselves and rectify the problem - without having to go through the hassle of running into an obscure but unsettling warning, finding the appropriate kernel mailing list and waiting for a kernel developer to remote-analyze that the memcg configuration caused this. If users cannot make sense of why the OOM killer was triggered or why it failed, they will still report it to the mailing list, we know that from experience. So in case there is an actual kernel bug causing this, kernel developers will very likely hear about it. Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 9月, 2018 1 次提交
-
-
由 Dennis Zhou (Facebook) 提交于
Currently, blkcg destruction relies on a sequence of events: 1. Destruction starts. blkcg_css_offline() is called and blkgs release their reference to the blkcg. This immediately destroys the cgwbs (writeback). 2. With blkgs giving up their reference, the blkcg ref count should become zero and eventually call blkcg_css_free() which finally frees the blkcg. Jiufei Xue reported that there is a race between blkcg_bio_issue_check() and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent on the completion of all writeback associated with the blkcg. A count of the number of cgwbs is maintained and once that goes to zero, blkg destruction can follow. This should prevent premature blkg destruction related to writeback. The new process for blkcg cleanup is as follows: 1. Destruction starts. blkcg_css_offline() is called which offlines writeback. Blkg destruction is delayed on the cgwb_refcnt count to avoid punting potentially large amounts of outstanding writeback to root while maintaining any ongoing policies. Here, the base cgwb_refcnt is put back. 2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called and handles destruction of blkgs. This is where the css reference held by each blkg is released. 3. Once the blkcg ref count goes to zero, blkcg_css_free() is called. This finally frees the blkg. It seems in the past blk-throttle didn't do the most understandable things with taking data from a blkg while associating with current. So, the simplification and unification of what blk-throttle is doing caused this. Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups") Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 8月, 2018 1 次提交
-
-
由 Amir Goldstein 提交于
The implementation of readahead(2) syscall is identical to that of fadvise64(POSIX_FADV_WILLNEED) with a few exceptions: 1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64() ignores the request and returns 0. 2. fadvise64() checks for integer overflow corner case 3. fadvise64() calls the optional filesystem fadvise() file operation Unite the two implementations by calling vfs_fadvise() from readahead(2) syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve documented syscall ABI behaviour. Suggested-by: NMiklos Szeredi <mszeredi@redhat.com> Fixes: d1d04ef8 ("ovl: stack file ops") Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
- 30 8月, 2018 2 次提交
-
-
由 Amir Goldstein 提交于
This is going to be used by overlayfs and possibly useful for other filesystems. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Mukesh Ojha 提交于
The conversion of the hotplug notifiers to a state machine left the notifier.h includes around in some places. Remove them. Signed-off-by: NMukesh Ojha <mojha@codeaurora.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
-
- 26 8月, 2018 1 次提交
-
-
由 Linus Torvalds 提交于
This is not normally noticeable, but repeated forks are unnecessarily expensive because they repeatedly dirty the parent page tables during the page table copy operation. It's trivial to just avoid write protecting the page table entry if it was already not writable. This patch was inspired by https://bugzilla.kernel.org/show_bug.cgi?id=200447 which points to an ancient "waste time re-doing fork" issue in the presence of lots of signals. That bug was fixed by Eric Biederman's signal handling series culminating in commit c3ad2c3b ("signal: Don't restart fork when signals come in"), but the unnecessary work for repeated forks is still work just fixing, particularly since the fix is trivial. Cc: Eric Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 8月, 2018 1 次提交
-
-
由 Souptick Joarder 提交于
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t") The aim is to change the return type of finish_fault() and handle_mm_fault() to vm_fault_t type. As part of that clean up return type of all other recursively called functions have been changed to vm_fault_t type. The places from where handle_mm_fault() is getting invoked will be change to vm_fault_t type but in a separate patch. vmf_error() is the newly introduce inline function in 4.17-rc6. [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()] Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-