- 19 10月, 2020 1 次提交
-
-
由 Daniel Axtens 提交于
mainline inclusion from mainline-v5.5-rc3 commit be1db475 category: feature bugzilla: NA CVE: NA --------------------------- apply_to_page_range() takes an address range, and if any parts of it are not covered by the existing page table hierarchy, it allocates memory to fill them in. In some use cases, this is not what we want - we want to be able to operate exclusively on PTEs that are already in the tables. Add apply_to_existing_page_range() for this. Adjust the walker functions for apply_to_page_range to take 'create', which switches them between the old and new modes. This will be used in KASAN vmalloc. [akpm@linux-foundation.org: reduce code duplication] [akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/] [akpm@linux-foundation.org: initialize __apply_to_page_range::err] Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NRui Xiang <rui.xiang@huawei.com> Reviewed-by: NDing Tianhong <dingtianhong@huawei.com> Reviewed-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 13 10月, 2020 1 次提交
-
-
由 Laurent Dufour 提交于
commit c1d0da83 upstream. Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 10 10月, 2020 2 次提交
-
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-5.1-rc1 commit 9a4e9f3b category: bugfix bugzilla: 34611 CVE: NA ------------------------------------------------- This patch updates get_user_pages_longterm to migrate pages allocated out of CMA region. This makes sure that we don't keep non-movable pages (due to page reference count) in the CMA area. This will be used by ppc64 in a later patch to avoid pinning pages in the CMA region. ppc64 uses CMA region for allocation of the hardware page table (hash page table) and not able to migrate pages out of CMA region results in page table allocation failures. One case where we hit this easy is when a guest using a VFIO passthrough device. VFIO locks all the guest's memory and if the guest memory is backed by CMA region, it becomes unmovable resulting in fragmenting the CMA and possibly preventing other guests from allocation a large enough hash page table. NOTE: We allocate the new page without using __GFP_THISNODE Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9a4e9f3b) Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Thomas Hellstrom 提交于
mainline inclusion from mainline-v5.5-rc1 commit bf1a12a8 category: bugfix bugzilla: 34611 CVE: NA ------------------------------------------------- The asm-generic/pgtable.h include file appears to be the correct place for the backup x_devmap() inline functions. Moving them here is also necessary if we want to include x_devmap() in the [pmd|pud]_unstable functions. So move the x_devmap() functions to asm-generic/pgtable.h Link: http://lkml.kernel.org/r/20191115115808.21181-1-thomas_os@shipmail.orgSigned-off-by: NThomas Hellstrom <thellstrom@vmware.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 25 8月, 2020 1 次提交
-
-
由 Ding Tianhong 提交于
ascend inclusion category: feature bugzilla: NA CVE: NA ------------------------------------------------- There are too many ascend features enable flag, all of them is used for all ascend soc till now, so use a new enable flag to enable all of them for ascend platform by default, it would clean and simplify the bootargs. Also clean some code warning. v2: modify the wrong config name. v3: modify the wrong include head file. Signed-off-by: NDing Tianhong <dingtianhong@huawei.com> Reviewed-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 17 8月, 2020 1 次提交
-
-
由 Fang Lijun 提交于
ascend inclusion category: Bugfix bugzilla: NA CVE: NA -------------- System cann't use the cdm nodes memory, but it can mmap all nodes huge pages, so it will cause Bus error when mmap succeed but the huge pages were not enough. When set the cdmmask, users will transfer the numa id by mmap flag to map the specific numa node hugepages, if there was not enough hugepages on this node, return -ENOMEM. v2: Fix compile error when disable CONFIG_COHERENT_DEVICE Signed-off-by: NFang Lijun <fanglijun3@huawei.com> Reviewed-by: NDing Tianhong <dingtianhong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 12 8月, 2020 1 次提交
-
-
由 Bixuan Cui 提交于
ascend inclusion category: feature bugzilla: NA CVE: NA ------------------------------------------------- Add alloc and release memory functions in svm. And the physical address of the memory is within 4GB. For example: /* alloc */ fd = open("dev/svm0",); mmap(0, ALLOC_SIZE,, MAP_PA32BIT, fd, 0); /* free */ ioctl(fd, SVM_IOCTL_RELEASE_PHYS32,); close(fd); Signed-off-by: NBixuan Cui <cuibixuan@huawei.com> Reviewed-by: NZefan Li <lizefan@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 28 6月, 2020 1 次提交
-
-
由 Waiman Long 提交于
[ Upstream commit d4eaa283 ] For kvmalloc'ed data object that contains sensitive information like cryptographic keys, we need to make sure that the buffer is always cleared before freeing it. Using memset() alone for buffer clearing may not provide certainty as the compiler may compile it away. To be sure, the special memzero_explicit() has to be used. This patch introduces a new kvfree_sensitive() for freeing those sensitive data objects allocated by kvmalloc(). The relevant places where kvfree_sensitive() can be used are modified to use it. Fixes: 4f088249 ("KEYS: Avoid false positive ENOMEM error on key read") Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NEric Biggers <ebiggers@google.com> Acked-by: NDavid Howells <dhowells@redhat.com> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Uladzislau Rezki <urezki@gmail.com> Link: http://lkml.kernel.org/r/20200407200318.11711-1-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 12 6月, 2020 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
[ Upstream commit 6988f31d ] Replace superfluous VM_BUG_ON() with comment about correct usage. Technically reverts commit 1d148e21 ("mm: add VM_BUG_ON_PAGE() to page_mapcount()"), but context lines have changed. Function isolate_migratepages_block() runs some checks out of lru_lock when choose pages for migration. After checking PageLRU() it checks extra page references by comparing page_count() and page_mapcount(). Between these two checks page could be removed from lru, freed and taken by slab. As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount(). Race window is tiny. For certain workload this happens around once a year. page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0 flags: 0x500000000008100(slab|head) raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180 raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(PageSlab(page)) ------------[ cut here ]------------ kernel BUG at ./include/linux/mm.h:628! invalid opcode: 0000 [#1] SMP NOPTI CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1 Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019 RIP: 0010:isolate_migratepages_block+0x986/0x9b0 The code in isolate_migratepages_block() was added in commit 119d6d59 ("mm, compaction: avoid isolating pinned pages") before adding VM_BUG_ON into page_mapcount(). This race has been predicted in 2015 by Vlastimil Babka (see link below). [akpm@linux-foundation.org: comment tweaks, per Hugh] Fixes: 1d148e21 ("mm: add VM_BUG_ON_PAGE() to page_mapcount()") Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/159032779896.957378.7852761411265662220.stgit@buzz Link: https://lore.kernel.org/lkml/557710E1.6060103@suse.cz/ Link: https://lore.kernel.org/linux-mm/158937872515.474360.5066096871639561424.stgit@buzz/T/ (v1) Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 17 4月, 2020 1 次提交
-
-
由 Yunfeng Ye 提交于
euleros inclusion category: feature feature: pagecache percpu refcount bugzilla: 31398 CVE: NA ------------------------------------------------- The pagecache manages the file physical pages, and the life cycle of page is managed by atomic counting. With the increasing number of cpu cores, the cost of atomic counting is very large when reading file pagecaches at large concurrent. For example, when running nginx http application, the biggest hotspot is found in the atomic operation of find_get_entry(): 11.94% [kernel] [k] find_get_entry 7.45% [kernel] [k] do_tcp_sendpages 6.12% [kernel] [k] generic_file_buffered_read So we using the percpu refcount mechanism to fix this problem. and the test result show that the read performance of nginx http can be improved by 100%: worker original(requests/sec) percpu(requests/sec) imporve 64 759656.87 1627088.95 114.2% Notes: we use page->lru to save percpu information, so the pages with percpu attribute will not be recycled by memory recycling process, we should avoid grow the file size unlimited. Signed-off-by: NYunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 27 12月, 2019 8 次提交
-
-
由 Yang Yingliang 提交于
ascend inclusion category: bugfix bugzilla: NA CVE: NA ------------ It's specific code for ascend, so remove it. This reverts commit 25cf3abca447947a56e74ee9393943e9a3adf8a8. Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Yang Shi 提交于
commit 169226f7 upstream. We have a usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below: 7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB cat /sys/kernel/debug/kvm/largepages 12 And some benchmarks do worse than with anonymous THPs. By digging into the code we figured out that commit 127393fb ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But the _mapcount < 0 check doesn't work for page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry. So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time. Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate. With this fix /sys/kernel/debug/kvm/largepage would show reasonable pages are PMD mapped by EPT as the below: 7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 557056 kB [snip] Locked: 0 kB cat /sys/kernel/debug/kvm/largepages 271 And the benchmarks are as same as anonymous THPs. [yang.shi@linux.alibaba.com: v4] Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com Fixes: dd78fedd ("rmap: support file thp") Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reported-by: NGang Deng <gavin.dg@linux.alibaba.com> Tested-by: NGang Deng <gavin.dg@linux.alibaba.com> Suggested-by: NHugh Dickins <hughd@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jiankang Chen 提交于
ascend inclusion category: bugfix bugzilla: NA CVE: NA ------------------- There is a bug in dvpp. The virtual address used by dvpp must have same higt 16bits. Add a MAP_VA32BIT for mmap. mmap(..., MAP_VA32BIT) will return the virtual address with same higt 16bits. Signed-off-by: NJiankang Chen <chenjiankang1@huawei.com> Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Lijun Fang 提交于
euler inclusion category: feature bugzilla: 11082 CVE: NA ----------------- CDM nodes should not be part of mems_allowed, However, It must be allowed to alloc from CDM node, when mpol->mode was MPOL_BIND. Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Anshuman Khandual 提交于
euler inclusion category: feature bugzilla: 11082 CVE: NA ------------------- Mark all the applicable VMAs with VM_CDM explicitly during mbind(MPOL_BIND) call if the user provided nodemask has a CDM node. Mark the corresponding VMA with VM_CDM flag if the allocated page happens to be from a CDM node. This can be expensive from performance stand point. There are multiple checks to avoid an expensive page_to_nid lookup but it can be optimized further. Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Linus Torvalds 提交于
mainline inclusion from mainline-5.1-rc5 commit f958d7b5 category: 13690 bugzilla: NA CVE: CVE-2019-11487 There are four commits to fix this CVE: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit ------------------------------------------------- We have a VM_BUG_ON() to check that the page reference count doesn't underflow (or get close to overflow) by checking the sign of the count. That's all fine, but we actually want to allow people to use a "get page ref unless it's already very high" helper function, and we want that one to use the sign of the page ref (without triggering this VM_BUG_ON). Change the VM_BUG_ON to only check for small underflows (or _very_ close to overflowing), and ignore overflows which have strayed into negative territory. Acked-by: NMatthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Linus Torvalds 提交于
mainline inclusion from mainline-5.1-rc5 commit 88b1a17d category: bugfix bugzilla: 13947 CVE: NA ------------------------------------------------ This is the same as the traditional 'get_page()' function, but instead of unconditionally incrementing the reference count of the page, it only does so if the count was "safe". It returns whether the reference count was incremented (and is marked __must_check, since the caller obviously has to be aware of it). Also like 'get_page()', you can't use this function unless you already had a reference to the page. The intent is that you can use this exactly like get_page(), but in situations where you want to limit the maximum reference count. The code currently does an unconditional WARN_ON_ONCE() if we ever hit the reference count issues (either zero or negative), as a notification that the conditional non-increment actually happened. NOTE! The count access for the "safety" check is inherently racy, but that doesn't matter since the buffer we use is basically half the range of the reference count (ie we look at the sign of the count). Acked-by: NMatthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Tan Xiaojun 提交于
hulk inclusion category: feature bugzilla: 13276 CVE: NA ------------------------------- Reserve space for the structure in memory subsystem. Signed-off-by: NTan Xiaojun <tanxiaojun@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 29 12月, 2018 1 次提交
-
-
由 Martin Schwidefsky 提交于
[ Upstream commit 6d212db1 ] The common mm code calls mm_dec_nr_pmds() and mm_dec_nr_puds() in free_pgtables() if the address range spans a full pud or pmd. If mm_dec_nr_puds/mm_dec_nr_pmds are non-empty due to configuration settings they blindly subtract the size of the pmd or pud table from pgtable_bytes even if the pud or pmd page table layer is folded. Add explicit mm_[pmd|pud]_folded checks to the four pgtable_bytes accounting functions mm_inc_nr_puds, mm_inc_nr_pmds, mm_dec_nr_puds and mm_dec_nr_pmds. As the check for folded page tables can be overwritten by the architecture, this allows to keep a correct pgtable_bytes value for platforms that use a dynamic number of page table levels. Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 06 10月, 2018 1 次提交
-
-
由 Mike Kravetz 提交于
The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 no matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page. This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost. This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined due to ECC memory errors. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes while continually writing to the huge pages being migrated. To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops the reference on the PMD page. After this, flush caches and TLB. mmu notifiers are called before locking page tables, but we can not be sure of PMD sharing until page tables are locked. Therefore, check for the possibility of PMD sharing before locking so that notifiers can prepare for the worst possible case. Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com [mike.kravetz@oracle.com: make _range_in_vma() a static inline] Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com Fixes: 39dde65c ("shared page table for hugetlb page") Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 24 8月, 2018 1 次提交
-
-
由 Souptick Joarder 提交于
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t") The aim is to change the return type of finish_fault() and handle_mm_fault() to vm_fault_t type. As part of that clean up return type of all other recursively called functions have been changed to vm_fault_t type. The places from where handle_mm_fault() is getting invoked will be change to vm_fault_t type but in a separate patch. vmf_error() is the newly introduce inline function in 4.17-rc6. [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()] Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 8月, 2018 3 次提交
-
-
由 Oscar Salvador 提交于
Currently, whenever a new node is created/re-used from the memhotplug path, we call free_area_init_node()->free_area_init_core(). But there is some code that we do not really need to run when we are coming from such path. free_area_init_core() performs the following actions: 1) Initializes pgdat internals, such as spinlock, waitqueues and more. 2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on when creating hash tables. 3) Account number of managed_pages per zone, substracting dma_reserved and memmap pages. 4) Initializes some fields of the zone structure data 5) Calls init_currently_empty_zone to initialize all the freelists 6) Calls memmap_init to initialize all pages belonging to certain zone When called from memhotplug path, free_area_init_core() only performs actions #1 and #4. Action #2 is pointless as the zones do not have any pages since either the node was freed, or we are re-using it, eitherway all zones belonging to this node should have 0 pages. For the same reason, action #3 results always in manages_pages being 0. Action #5 and #6 are performed later on when onlining the pages: online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone() online_pages()->move_pfn_range_to_zone()->memmap_init_zone() This patch does two things: First, moves the node/zone initializtion to their own function, so it allows us to create a small version of free_area_init_core, where we only perform: 1) Initialization of pgdat internals, such as spinlock, waitqueues and more 4) Initialization of some fields of the zone structure data These two functions are: pgdat_init_internals() and zone_init_internals(). The second thing this patch does, is to introduce free_area_init_core_hotplug(), the memhotplug version of free_area_init_core(): Currently, we call free_area_init_node() from the memhotplug path. In there, we set some pgdat's fields, and call calculate_node_totalpages(). calculate_node_totalpages() calculates the # of pages the node has. Since the node is either new, or we are re-using it, the zones belonging to this node should not have any pages, so there is no point to calculate this now. Actually, we re-set these values to 0 later on with the calls to: reset_node_managed_pages() reset_node_present_pages() The # of pages per node and the # of pages per zone will be calculated when onlining the pages: online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range() online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range() Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace __paginginit with __init, so their code gets freed up. [osalvador@techadventures.net: fix section usage] Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net [osalvador@suse.de: v6] Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.netSigned-off-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Tatashin 提交于
zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to have inline functions to access this field in order to avoid ifdef's in c files. Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.netSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Rather than in vm_area_alloc(). To ensure that the various oddball stack-based vmas are in a good state. Some of the callers were zeroing them out, others were not. Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 8月, 2018 4 次提交
-
-
由 Pavel Tatashin 提交于
Rename new_sparse_init() to sparse_init() which enables it. Delete old sparse_init() and all the code that became obsolete with. [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()] Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Tatashin 提交于
Now that both variants of sparse memory use the same buffers to populate memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the common place. Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Tatashin 提交于
Patch series "sparse_init rewrite", v6. In sparse_init() we allocate two large buffers to temporary hold usemap and memmap for the whole machine. However, we can avoid doing that if we changed sparse_init() to operated on per-node bases instead of doing it on the whole machine beforehand. As shown by Baoquan http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com The buffers are large enough to cause machine stop to boot on small memory systems. Another benefit of these changes is that they also obsolete CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER. This patch (of 5): When struct pages are allocated for sparse-vmemmap VA layout, we first try to allocate one large buffer, and than if that fails allocate struct pages for each section as we go. The code that allocates buffer is uses global variables and is spread across several call sites. Cleanup the code by introducing three functions to handle the global buffer: sparse_buffer_init() initialize the buffer sparse_buffer_fini() free the remaining part of the buffer sparse_buffer_alloc() alloc from the buffer, and if buffer is empty return NULL Define these functions in sparse.c instead of sparse-vmemmap.c because later we will use them for non-vmemmap sparse allocations as well. [akpm@linux-foundation.org: use PTR_ALIGN()] [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/] Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Reviewed-by: NOscar Salvador <osalvador@suse.de> Tested-by: NOscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when copying huge page on x86_64 platform, the cache footprint is 4M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache contention is heavy when copying the huge page, and we copy the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing copying the end of the huge page. And it is possible for the application to access the begin of the huge page after copying the huge page. In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing huge page"), to keep the cache lines of the target subpage hot, the order to clear the subpages in the huge page in clear_huge_page() is changed to clearing the subpage which is furthest from the target subpage firstly, and the target subpage last. The similar order changing helps huge page copying too. That is implemented in this patch. Because we have put the order algorithm into a separate function, the implementation is quite simple. The patch is a generic optimization which should benefit quite some workloads, not for a specific use case. To demonstrate the performance benefit of the patch, we tested it with vm-scalability run on transparent huge page. With this patch, the throughput increases ~16.6% in vm-scalability anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case set /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big anonymous memory area and populate it, then forked 36 child processes, each writes to the anonymous memory area from the begin to the end, so cause copy on write. For each child process, other child processes could be seen as other workloads which generate heavy cache pressure. At the same time, the IPC (instruction per cycle) increased from 0.63 to 0.78, and the time spent in user space is reduced ~7.2%. Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 8月, 2018 1 次提交
-
-
由 Linus Torvalds 提交于
Commit 2c4541e2 ("mm: use vma_init() to initialize VMAs on stack and data segments") tried to initialize various left-over ad-hoc vma's "properly", but actually made things worse for the temporary vma's used for TLB flushing. vma_init() doesn't actually initialize all of the vma, just a few fields, so doing something like - struct vm_area_struct vma = { .vm_mm = tlb->mm, }; + struct vm_area_struct vma; + + vma_init(&vma, tlb->mm); was actually very bad: instead of having a nicely initialized vma with every field but "vm_mm" zeroed, you'd have an entirely uninitialized vma with only a couple of fields initialized. And they weren't even fields that the code in question mostly cared about. The flush_tlb_range() function takes a "struct vma" rather than a "struct mm_struct", because a few architectures actually care about what kind of range it is - being able to only do an ITLB flush if it's a range that doesn't have data accesses enabled, for example. And all the normal users already have the vma for doing the range invalidation. But a few people want to call flush_tlb_range() with a range they just made up, so they also end up using a made-up vma. x86 just has a special "flush_tlb_mm_range()" function for this, but other architectures (arm and ia64) do the "use fake vma" thing instead, and thus got caught up in the vma_init() changes. At the same time, the TLB flushing code really doesn't care about most other fields in the vma, so vma_init() is just unnecessary and pointless. This fixes things by having an explicit "this is just an initializer for the TLB flush" initializer macro, which is used by the arm/arm64/ia64 people who mis-use this interface with just a dummy vma. Fixes: 2c4541e2 ("mm: use vma_init() to initialize VMAs on stack and data segments") Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 7月, 2018 2 次提交
-
-
由 Kirill A. Shutemov 提交于
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous VMA. This is unreliable as ->mmap may not set ->vm_ops. False-positive vma_is_anonymous() may lead to crashes: next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0 prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000 pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000 flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare) ------------[ cut here ]------------ kernel BUG at mm/memory.c:1422! invalid opcode: 0000 [#1] SMP KASAN CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline] RIP: 0010:zap_pud_range mm/memory.c:1466 [inline] RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline] RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508 Call Trace: unmap_single_vma+0x1a0/0x310 mm/memory.c:1553 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644 unmap_mapping_range_vma mm/memory.c:2792 [inline] unmap_mapping_range_tree mm/memory.c:2813 [inline] unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845 unmap_mapping_range+0x48/0x60 mm/memory.c:2880 truncate_pagecache+0x54/0x90 mm/truncate.c:800 truncate_setsize+0x70/0xb0 mm/truncate.c:826 simple_setattr+0xe9/0x110 fs/libfs.c:409 notify_change+0xf13/0x10f0 fs/attr.c:335 do_truncate+0x1ac/0x2b0 fs/open.c:63 do_sys_ftruncate+0x492/0x560 fs/open.c:205 __do_sys_ftruncate fs/open.c:215 [inline] __se_sys_ftruncate fs/open.c:213 [inline] __x64_sys_ftruncate+0x59/0x80 fs/open.c:213 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reproducer: #include <stdio.h> #include <stddef.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long) #define KCOV_ENABLE _IO('c', 100) #define KCOV_DISABLE _IO('c', 101) #define COVER_SIZE (1024<<10) #define KCOV_TRACE_PC 0 #define KCOV_TRACE_CMP 1 int main(int argc, char **argv) { int fd; unsigned long *cover; system("mount -t debugfs none /sys/kernel/debug"); fd = open("/sys/kernel/debug/kcov", O_RDWR); ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE); cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); munmap(cover, COVER_SIZE * sizeof(unsigned long)); cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long), PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); memset(cover, 0, COVER_SIZE * sizeof(unsigned long)); ftruncate(fd, 3UL << 20); return 0; } This can be fixed by assigning anonymous VMAs own vm_ops and not relying on it being NULL. If ->mmap() failed to set ->vm_ops, mmap_region() will set it to dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs. Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> -
由 Kirill A. Shutemov 提交于
Not all VMAs allocated with vm_area_alloc(). Some of them allocated on stack or in data segment. The new helper can be use to initialize VMA properly regardless where it was allocated. Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 7月, 2018 1 次提交
-
-
由 Dan Williams 提交于
mce: Uncorrected hardware memory error in user-access at af34214200 {1}[Hardware Error]: It has been corrected by h/w and requires no further action mce: [Hardware Error]: Machine check events logged {1}[Hardware Error]: event severity: corrected Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users [..] Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed mce: Memory error not recovered In contrast to typical memory, dev_pagemap pages may be dax mapped. With dax there is no possibility to map in another page dynamically since dax establishes 1:1 physical address to file offset associations. Also dev_pagemap pages associated with NVDIMM / persistent memory devices can internal remap/repair addresses with poison. While memory_failure() assumes that it can discard typical poisoned pages and keep them unmapped indefinitely, dev_pagemap pages may be returned to service after the error is cleared. Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST dev_pagemap pages that have poison consumed by userspace. Mark the memory as UC instead of unmapping it completely to allow ongoing access via the device driver (nd_pmem). Later, nd_pmem will grow support for marking the page back to WB when the error is cleared. Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NDave Jiang <dave.jiang@intel.com>
-
- 22 7月, 2018 2 次提交
-
-
由 Linus Torvalds 提交于
Like vm_area_dup(), it initializes the anon_vma_chain head, and the basic mm pointer. The rest of the fields end up being different for different users, although the plan is to also initialize the 'vm_ops' field to a dummy entry. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> -
由 Linus Torvalds 提交于
The vm_area_struct is one of the most fundamental memory management objects, but the management of it is entirely open-coded evertwhere, ranging from allocation and freeing (using kmem_cache_[z]alloc and kmem_cache_free) to initializing all the fields. We want to unify this in order to end up having some unified initialization of the vmas, and the first step to this is to at least have basic allocation functions. Right now those functions are literally just wrappers around the kmem_cache_*() calls. This is a purely mechanical conversion: # new vma: kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc() # copy old vma kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old) # free vma kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma) to the point where the old vma passed in to the vm_area_dup() function isn't even used yet (because I've left all the old manual initialization alone). Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 7月, 2018 1 次提交
-
-
由 Pavel Tatashin 提交于
Moving zero_resv_unavail before memmap_init_zone(), caused a regression on x86-32. The cause is that we access struct pages before they are allocated when CONFIG_FLAT_NODE_MEM_MAP is used. free_area_init_nodes() zero_resv_unavail() mm_zero_struct_page(pfn_to_page(pfn)); <- struct page is not alloced free_area_init_node() if CONFIG_FLAT_NODE_MEM_MAP alloc_node_mem_map() memblock_virt_alloc_node_nopanic() <- struct page alloced here On the other hand memblock_virt_alloc_node_nopanic() zeroes all the memory that it returns, so we do not need to do zero_resv_unavail() here. Fixes: e181ae0c ("mm: zero unavailable pages before memmap init") Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com> Tested-by: NMatt Hart <matt@mattface.org> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 6月, 2018 1 次提交
-
-
由 Kees Cook 提交于
The kv*alloc()-family was missing kvcalloc(). Adding this allows for 2-argument multiplication conversions of kvzalloc(a * b, ...) into kvcalloc(a, b, ...). Signed-off-by: NKees Cook <keescook@chromium.org>
-
- 08 6月, 2018 4 次提交
-
-
由 Sahara 提交于
When commit bd33ef36 ("mm: enable page poisoning early at boot") got rid of the PAGE_EXT_DEBUG_POISON, page_is_poisoned in the header left behind. This patch cleans up the leftovers under the table. Link: http://lkml.kernel.org/r/1528101069-21637-1-git-send-email-kpark3469@gmail.comSigned-off-by: NSahara <keun-o.park@darkmatter.ae> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
Define a new PageTable bit in the page_type and use it to mark pages in use as page tables. This can be helpful when debugging crashdumps or analysing memory fragmentation. Add a KPF flag to report these pages to userspace and update page-types.c to interpret that flag. Note that only pages currently accounted as NR_PAGETABLES are tracked as PageTable; this does not include pgd/p4d/pud/pmd pages. Those will be the subject of a later patch. Link: http://lkml.kernel.org/r/20180518194519.3820-4-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Souptick Joarder 提交于
Use new return type vm_fault_t for fault handler in struct vm_operations_struct. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Souptick Joarder 提交于
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f4220 ("mm: change return type to vm_fault_t") There was an existing bug inside dax_load_hole() if vm_insert_mixed had failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of VM_FAULT_OOM. With new vmf_insert_mixed() this issue is addressed. vm_insert_mixed_mkwrite has inefficiency when it returns an error value, driver has to convert it to vm_fault_t type. With new vmf_insert_mixed_mkwrite() this limitation will be addressed. Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-