- 09 9月, 2017 25 次提交
-
-
由 Jérôme Glisse 提交于
Platform with advance system bus (like CAPI or CCIX) allow device memory to be accessible from CPU in a cache coherent fashion. Add a new type of ZONE_DEVICE to represent such memory. The use case are the same as for the un-addressable device memory but without all the corners cases. Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
This allows callers of migrate_vma() to allocate new page for empty CPU page table entry (pte_none or back by zero page). This is only for anonymous memory and it won't allow new page to be instanced if the userfaultfd is armed. This is useful to device driver that want to migrate a range of virtual address and would rather allocate new memory than having to fault later on. Link: http://lkml.kernel.org/r/20170817000548.32038-18-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
Allow to unmap and restore special swap entry of un-addressable ZONE_DEVICE memory. Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
Common case for migration of virtual address range is page are map only once inside the vma in which migration is taking place. Because we already walk the CPU page table for that range we can directly do the unmap there and setup special migration swap entry. Link: http://lkml.kernel.org/r/20170817000548.32038-16-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: NSherry Cheung <SCheung@nvidia.com> Signed-off-by: NSubhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
This patch add a new memory migration helpers, which migrate memory backing a range of virtual address of a process to different memory (which can be allocated through special allocator). It differs from numa migration by working on a range of virtual address and thus by doing migration in chunk that can be large enough to use DMA engine or special copy offloading engine. Expected users are any one with heterogeneous memory where different memory have different characteristics (latency, bandwidth, ...). As an example IBM platform with CAPI bus can make use of this feature to migrate between regular memory and CAPI device memory. New CPU architecture with a pool of high performance memory not manage as cache but presented as regular memory (while being faster and with lower latency than DDR) will also be prime user of this patch. Migration to private device memory will be useful for device that have large pool of such like GPU, NVidia plans to use HMM for that. Link: http://lkml.kernel.org/r/20170817000548.32038-15-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: NSherry Cheung <SCheung@nvidia.com> Signed-off-by: NSubhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
Introduce a new migration mode that allow to offload the copy to a device DMA engine. This changes the workflow of migration and not all address_space migratepage callback can support this. This is intended to be use by migrate_vma() which itself is use for thing like HMM (see include/linux/hmm.h). No additional per-filesystem migratepage testing is needed. I disables MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i added comment in those to explain why (part of this patch). The commit message is unclear it should say that any callback that wish to support this new mode need to be aware of the difference in the migration flow from other mode. Some of these callbacks do extra locking while copying (aio, zsmalloc, balloon, ...) and for DMA to be effective you want to copy multiple pages in one DMA operations. But in the problematic case you can not easily hold the extra lock accross multiple call to this callback. Usual flow is: For each page { 1 - lock page 2 - call migratepage() callback 3 - (extra locking in some migratepage() callback) 4 - migrate page state (freeze refcount, update page cache, buffer head, ...) 5 - copy page 6 - (unlock any extra lock of migratepage() callback) 7 - return from migratepage() callback 8 - unlock page } The new mode MIGRATE_SYNC_NO_COPY: 1 - lock multiple pages For each page { 2 - call migratepage() callback 3 - abort in all problematic migratepage() callback 4 - migrate page state (freeze refcount, update page cache, buffer head, ...) } // finished all calls to migratepage() callback 5 - DMA copy multiple pages 6 - unlock all the pages To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a new callback migratepages() (for instance) that deals with multiple pages in one transaction. Because the problematic cases are not important for current usage I did not wanted to complexify this patchset even more for no good reason. Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
This introduce a dummy HMM device class so device driver can use it to create hmm_device for the sole purpose of registering device memory. It is useful to device driver that want to manage multiple physical device memory under same struct device umbrella. Link: http://lkml.kernel.org/r/20170817000548.32038-13-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: NSherry Cheung <SCheung@nvidia.com> Signed-off-by: NSubhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
This introduce a simple struct and associated helpers for device driver to use when hotpluging un-addressable device memory as ZONE_DEVICE. It will find a unuse physical address range and trigger memory hotplug for it which allocates and initialize struct page for the device memory. Device driver should use this helper during device initialization to hotplug the device memory. It should only need to remove the memory once the device is going offline (shutdown or hotremove). There should not be any userspace API to hotplug memory expect maybe for host device driver to allow to add more memory to a guest device driver. Device's memory is manage by the device driver and HMM only provides helpers to that effect. Link: http://lkml.kernel.org/r/20170817000548.32038-12-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: NSherry Cheung <SCheung@nvidia.com> Signed-off-by: NSubhash Gutti <sgutti@nvidia.com> Signed-off-by: NBalbir Singh <bsingharora@gmail.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
HMM pages (private or public device pages) are ZONE_DEVICE page and thus need special handling when it comes to lru or refcount. This patch make sure that memcontrol properly handle those when it face them. Those pages are use like regular pages in a process address space either as anonymous page or as file back page. So from memcg point of view we want to handle them like regular page for now at least. Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Acked-by: NBalbir Singh <bsingharora@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
HMM pages (private or public device pages) are ZONE_DEVICE page and thus you can not use page->lru fields of those pages. This patch re-arrange the uncharge to allow single page to be uncharge without modifying the lru field of the struct page. There is no change to memcontrol logic, it is the same as it was before this patch. Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have any user. For device private pages this is important to catch and thus we need to special case put_page() for this. Link: http://lkml.kernel.org/r/20170817000548.32038-9-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
HMM (heterogeneous memory management) need struct page to support migration from system main memory to device memory. Reasons for HMM and migration to device memory is explained with HMM core patch. This patch deals with device memory that is un-addressable memory (ie CPU can not access it). Hence we do not want those struct page to be manage like regular memory. That is why we extend ZONE_DEVICE to support different types of memory. A persistent memory type is define for existing user of ZONE_DEVICE and a new device un-addressable type is added for the un-addressable memory type. There is a clear separation between what is expected from each memory type and existing user of ZONE_DEVICE are un-affected by new requirement and new use of the un-addressable type. All specific code path are protect with test against the memory type. Because memory is un-addressable we use a new special swap type for when a page is migrated to device memory (this reduces the number of maximum swap file). The main two additions beside memory type to ZONE_DEVICE is two callbacks. First one, page_free() is call whenever page refcount reach 1 (which means the page is free as ZONE_DEVICE page never reach a refcount of 0). This allow device driver to manage its memory and associated struct page. The second callback page_fault() happens when there is a CPU access to an address that is back by a device page (which are un-addressable by the CPU). This callback is responsible to migrate the page back to system main memory. Device driver can not block migration back to system memory, HMM make sure that such page can not be pin into device memory. If device is in some error condition and can not migrate memory back then a CPU page fault to device memory should end with SIGBUS. [arnd@arndb.de: fix warning] Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NDan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
This handles page fault on behalf of device driver, unlike handle_mm_fault() it does not trigger migration back to system memory for device memory. Link: http://lkml.kernel.org/r/20170817000548.32038-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: NSherry Cheung <SCheung@nvidia.com> Signed-off-by: NSubhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
This does not use existing page table walker because we want to share same code for our page fault handler. Link: http://lkml.kernel.org/r/20170817000548.32038-5-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: NSherry Cheung <SCheung@nvidia.com> Signed-off-by: NSubhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
This is a heterogeneous memory management (HMM) process address space mirroring. In a nutshell this provide an API to mirror process address space on a device. This boils down to keeping CPU and device page table synchronize (we assume that both device and CPU are cache coherent like PCIe device can be). This patch provide a simple API for device driver to achieve address space mirroring thus avoiding each device driver to grow its own CPU page table walker and its own CPU page table synchronization mechanism. This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more hardware in the future. [jglisse@redhat.com: fix hmm for "mmu_notifier kill invalidate_page callback"] Link: http://lkml.kernel.org/r/20170830231955.GD9445@redhat.com Link: http://lkml.kernel.org/r/20170817000548.32038-4-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: NSherry Cheung <SCheung@nvidia.com> Signed-off-by: NSubhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
HMM provides 3 separate types of functionality: - Mirroring: synchronize CPU page table and device page table - Device memory: allocating struct page for device memory - Migration: migrating regular memory to device memory This patch introduces some common helpers and definitions to all of those 3 functionality. Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com> Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: NSherry Cheung <SCheung@nvidia.com> Signed-off-by: NSubhash Gutti <sgutti@nvidia.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
This patch enables thp migration for memory hotremove. Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
This patch enables thp migration for move_pages(2). Link: http://lkml.kernel.org/r/20170717193955.20207-10-zi.yan@sent.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
This patch enables thp migration for mbind(2) and migrate_pages(2). Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Soft dirty bit is designed to keep tracked over page migration. This patch makes it work in the same manner for thp migration too. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zi Yan 提交于
When THP migration is being used, memory management code needs to handle pmd migration entries properly. This patch uses !pmd_present() or is_swap_pmd() (depending on whether pmd_none() needs separate code or not) to check pmd migration entries at the places where a pmd entry is present. Since pmd-related code uses split_huge_page(), split_huge_pmd(), pmd_trans_huge(), pmd_trans_unstable(), or pmd_none_or_trans_huge_or_clear_bad(), this patch: 1. adds pmd migration entry split code in split_huge_pmd(), 2. takes care of pmd migration entries whenever pmd_trans_huge() is present, 3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware. Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable() is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change them. Until this commit, a pmd entry should be: 1. pointing to a pte page, 2. is_swap_pmd(), 3. pmd_trans_huge(), 4. pmd_devmap(), or 5. pmd_none(). Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zi Yan 提交于
Add thp migration's core code, including conversions between a PMD entry and a swap entry, setting PMD migration entry, removing PMD migration entry, and waiting on PMD migration entries. This patch makes it possible to support thp migration. If you fail to allocate a destination page as a thp, you just split the source thp as we do now, and then enter the normal page migration. If you succeed to allocate destination thp, you enter thp migration. Subsequent patches actually enable thp migration for each caller of page migration by allowing its get_new_page() callback to allocate thps. [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning] Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu [akpm@linux-foundation.org: fix x86_64 allnoconfig warning] Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration functionality to x86_64, which should be safer at the first step. Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
TTU_MIGRATION is used to convert pte into migration entry until thp split completes. This behavior conflicts with thp migration added later patches, so let's introduce a new TTU flag specifically for freezing. try_to_unmap() is used both for thp split (via freeze_page()) and page migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for head page is like below (assuming anonymous thp): (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \ TTU_MIGRATION | TTU_SPLIT_HUGE_PMD) and ttu_flag given for tail pages is: (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \ TTU_MIGRATION) __unmap_and_move() calls try_to_unmap() with ttu_flag: (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS) Now I'm trying to insert a branch for thp migration at the top of try_to_unmap_one() like below static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { ... /* PMD-mapped THP migration entry */ if (!pvmw.pte && (flags & TTU_MIGRATION)) { if (!PageAnon(page)) continue; set_pmd_migration_entry(&pvmw, page); continue; } ... } so try_to_unmap() for tail pages called by thp split can go into thp migration code path (which converts *pmd* into migration entry), while the expectation is to freeze thp (which converts *pte* into migration entry.) I detected this failure as a "bad page state" error in a testcase where split_huge_page() is called from queue_pages_pte_range(). Link: http://lkml.kernel.org/r/20170717193955.20207-4-zi.yan@sent.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Patch series "mm: page migration enhancement for thp", v9. Motivations: 1. THP migration becomes important in the upcoming heterogeneous memory systems. As David Nellans from NVIDIA pointed out from other threads (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349227.html), future GPUs or other accelerators will have their memory managed by operating systems. Moving data into and out of these memory nodes efficiently is critical to applications that use GPUs or other accelerators. Existing page migration only supports base pages, which has a very low memory bandwidth utilization. My experiments (see below) show THP migration can migrate pages more efficiently. 2. Base page migration vs THP migration throughput. Here are cross-socket page migration results from calling move_pages() syscall: In x86_64, a Intel two-socket E5-2640v3 box, - single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW, - single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW, - 512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW. In ppc64, a two-socket Power8 box, - single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW, - single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW, - 256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW. THP migration can give us 3x and 1.15x throughput over base page migration in x86_64 and ppc64 respectivley. You can test it out by using the code here: https://github.com/x-y-z/thp-migration-bench 3. Existing page migration splits THP before migration and cannot guarantee the migrated pages are still contiguous. Contiguity is always what GPUs and accelerators look for. Without THP migration, khugepaged needs to do extra work to reassemble the migrated pages back to THPs. This patch (of 10): Introduce a separate check routine related to MPOL_MF_INVERT flag. This patch just does cleanup, no behavioral change. Link: http://lkml.kernel.org/r/20170717193955.20207-2-zi.yan@sent.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 9月, 2017 15 次提交
-
-
由 Rik van Riel 提交于
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty in the child process after fork. This differs from MADV_DONTFORK in one important way. If a child process accesses memory that was MADV_WIPEONFORK, it will get zeroes. The address ranges are still valid, they are just empty. If a child process accesses memory that was MADV_DONTFORK, it will get a segmentation fault, since those address ranges are no longer valid in the child after fork. Since MADV_DONTFORK also seems to be used to allow very large programs to fork in systems with strict memory overcommit restrictions, changing the semantics of MADV_DONTFORK might break existing programs. MADV_WIPEONFORK only works on private, anonymous VMAs. The use case is libraries that store or cache information, and want to know that they need to regenerate it in the child process after fork. Examples of this would be: - systemd/pulseaudio API checks (fail after fork) (replacing a getpid check, which is too slow without a PID cache) - PKCS#11 API reinitialization check (mandated by specification) - glibc's upcoming PRNG (reseed after fork) - OpenSSL PRNG (reseed after fork) The security benefits of a forking server having a re-inialized PRNG in every child process are pretty obvious. However, due to libraries having all kinds of internal state, and programs getting compiled with many different versions of each library, it is unreasonable to expect calling programs to re-initialize everything manually after fork. A further complication is the proliferation of clone flags, programs bypassing glibc's functions to call clone directly, and programs calling unshare, causing the glibc pthread_atfork hook to not get called. It would be better to have the kernel take care of this automatically. The patch also adds MADV_KEEPONFORK, to undo the effects of a prior MADV_WIPEONFORK. This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO: https://man.openbsd.org/minherit.2 [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines] Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com> Reported-by: NFlorian Weimer <fweimer@redhat.com> Reported-by: NColm MacCártaigh <colm@allcosts.net> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Cc: <linux-api@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when clearing huge page on x86_64 platform, the cache footprint is 2M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache pressure is heavy when clearing the huge page, and we clear the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing clearing the end of the huge page. And it is possible for the application to access the begin of the huge page after clearing the huge page. To help the above situation, in this patch, when we clear a huge page, the order to clear sub-pages is changed. In quite some situation, we can get the address that the application will access after we clear the huge page, for example, in a page fault handler. Instead of clearing the huge page from begin to end, we will clear the sub-pages farthest from the the sub-page to access firstly, and clear the sub-page to access last. This will make the sub-page to access most cache-hot and sub-pages around it more cache-hot too. If we cannot know the address the application will access, the begin of the huge page is assumed to be the the address the application will access. With this patch, the throughput increases ~28.3% in vm-scalability anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case creates 72 processes, each process mmap a big anonymous memory area and writes to it from the begin to the end. For each process, other processes could be seen as other workload which generates heavy cache pressure. At the same time, the cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per cycle) increased from 0.56 to 0.74, and the time spent in user space is reduced ~7.9% Christopher Lameter suggests to clear bytes inside a sub-page from end to begin too. But tests show no visible performance difference in the tests. May because the size of page is small compared with the cache size. Thanks Andi Kleen to propose to use address to access to determine the order of sub-pages to clear. The hugetlbfs access address could be improved, will do that in another patch. [ying.huang@intel.com: improve readability of clear_huge_page()] Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.comSuggested-by: NAndi Kleen <andi.kleen@intel.com> Signed-off-by: N"Huang, Ying" <ying.huang@intel.com> Acked-by: NJan Kara <jack@suse.cz> Reviewed-by: NMichal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f1 ("oom: keep mm of the killed task available") Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Reported-by: NDavid Rientjes <rientjes@google.com> Tested-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NMichal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aaron Lu 提交于
If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap device to use in get_swap_pages() to get better performance. The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same priority, they are used round robin. This patch changes the previous single global swap_avail_list into a per-numa-node list, i.e. for each numa node, it sees its own priority based list of available swap devices. Swap device's priority can be promoted on its matching node's swap_avail_list. The current swap device's priority is set as: user can set a >=0 value, or the system will pick one starting from -1 then downwards. The priority value in the swap_avail_list is the negated value of the swap device's due to plist being sorted from low to high. The new policy doesn't change the semantics for priority >=0 cases, the previous starting from -1 then downwards now becomes starting from -2 then downwards and -1 is reserved as the promoted value. Take 4-node EX machine as an example, suppose 4 swap devices are available, each sit on a different node: swapA on node 0 swapB on node 1 swapC on node 2 swapD on node 3 After they are all swapped on in the sequence of ABCD. Current behaviour: their priorities will be: swapA: -1 swapB: -2 swapC: -3 swapD: -4 And their position in the global swap_avail_list will be: swapA -> swapB -> swapC -> swapD prio:1 prio:2 prio:3 prio:4 New behaviour: their priorities will be(note that -1 is skipped): swapA: -2 swapB: -3 swapC: -4 swapD: -5 And their positions in the 4 swap_avail_lists[nid] will be: swap_avail_lists[0]: /* node 0's available swap device list */ swapA -> swapB -> swapC -> swapD prio:1 prio:3 prio:4 prio:5 swap_avali_lists[1]: /* node 1's available swap device list */ swapB -> swapA -> swapC -> swapD prio:1 prio:2 prio:4 prio:5 swap_avail_lists[2]: /* node 2's available swap device list */ swapC -> swapA -> swapB -> swapD prio:1 prio:2 prio:3 prio:5 swap_avail_lists[3]: /* node 3's available swap device list */ swapD -> swapA -> swapB -> swapC prio:1 prio:2 prio:3 prio:4 To see the effect of the patch, a test that starts N process, each mmap a region of anonymous memory and then continually write to it at random position to trigger both swap in and out is used. On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives are used as swap devices with each attached to a different node, the result is: runtime=30m/processes=32/total test size=128G/each process mmap region=4G kernel throughput vanilla 13306 auto-binding 15169 +14% runtime=30m/processes=64/total test size=128G/each process mmap region=2G kernel throughput vanilla 11885 auto-binding 14879 +25% [aaron.lu@intel.com: v2] Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com [akpm@linux-foundation.org: use kmalloc_array()] Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
TIF_MEMDIE is set only to the tasks whick were either directly selected by the OOM killer or passed through mark_oom_victim from the allocator path. tsk_is_oom_victim is more generic and allows to identify all tasks (threads) which share the mm with the oom victim. Please note that the freezer still needs to check TIF_MEMDIE because we cannot thaw tasks which do not participage in oom_victims counting otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns. Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the kthreads") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vitaly Wool 提交于
It's been noted that z3fold doesn't scale well when it's run in a large number of threads on many cores, which can be easily reproduced with fio 'randrw' test with --numjobs=32. E.g. the result for 1 cluster (4 cores) is: Run status group 0 (all jobs): READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ... WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ... While for 8 cores (2 clusters) the result is: Run status group 0 (all jobs): READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ... WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ... The bottleneck here is the pool lock which many threads become waiting upon. To reduce that spin lock contention, z3fold can operate only on the lists local to the current CPU whenever possible. Due to the nature of z3fold unbuddied list handling (it only takes the first entry off the list on a hot path), if the z3fold pool is big enough and balanced well enough, limiting search to only local unbuddied list doesn't lead to a significant compression ratio degrade (2.57x vs 2.65x in our measurements). This patch also introduces two worker threads: one for async in-page object layout optimization and one for releasing freed pages. This is done to speed up z3fold_free() which is often on a hot path. The fio results for 8-core case are now the following: Run status group 0 (all jobs): READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ... WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ... So we're in for almost 6x performance increase. Link: http://lkml.kernel.org/r/20170806181443.f9b65018f8bde25ef990f9e8@gmail.comSigned-off-by: NVitaly Wool <vitalywool@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
VMA based swap readahead will readahead the virtual pages that is continuous in the virtual address space. While the original swap readahead will readahead the swap slots that is continuous in the swap device. Although VMA based swap readahead is more correct for the swap slots to be readahead, it will trigger more small random readings, which may cause the performance of HDD (hard disk) to degrade heavily, and may finally exceed the benefit. To avoid the issue, in this patch, if the HDD is used as swap, the VMA based swap readahead will be disabled, and the original swap readahead will be used instead. Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
The sysfs interface to control the VMA based swap readahead is added as follow, /sys/kernel/mm/swap/vma_ra_enabled Enable the VMA based swap readahead algorithm, or use the original global swap readahead algorithm. /sys/kernel/mm/swap/vma_ra_max_order Set the max order of the readahead window size for the VMA based swap readahead algorithm. The corresponding ABI documentation is added too. Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patch, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
In the original implementation, it is possible that the existing pages in the swap cache (not newly readahead) could be marked as the readahead pages. This will cause the statistics of swap readahead be wrong and influence the swap readahead algorithm too. This is fixed via marking a page as the readahead page only if it is newly allocated and read from the disk. When testing with linpack, after the fixing the swap readahead hit rate increased from ~66% to ~86%. Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
Patch series "mm, swap: VMA based swap readahead", v4. The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory space. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patchset, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. In addition to the swap readahead changes, some new sysfs interface is added to show the efficiency of the readahead algorithm and some other swap statistics. This new implementation will incur more small random read, on SSD, the improved correctness of estimation and readahead target should beat the potential increased overhead, this is also illustrated in the test results below. But on HDD, the overhead may beat the benefit, so the original implementation will be used by default. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. This patch (of 5): The statistics for total readahead pages and total readahead hits are recorded and exported via the following sysfs interface. /sys/kernel/mm/swap/ra_hits /sys/kernel/mm/swap/ra_total With them, the efficiency of the swap readahead could be measured, so that the swap readahead algorithm and parameters could be tuned accordingly. [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n] Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Byungchul Park 提交于
Although llist provides proper APIs, they are not used. Make them used. Link: http://lkml.kernel.org/r/1502095374-16112-1-git-send-email-byungchul.park@lge.comSigned-off-by: NByungchul Park <byungchul.park@lge.com> Cc: zijun_hu <zijun_hu@htc.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joel Fernandes <joelaf@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 SeongJae Park 提交于
Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from pagetypeinfo_show_free()'s comment. This commit fixes it. Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo") Signed-off-by: NSeongJae Park <sj38.park@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Kravetz 提交于
This patch came out of discussions in this e-mail thread: http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com The Oracle JVM team is developing a new garbage collection model. This new model requires multiple mappings of the same anonymous memory. One straight forward way to accomplish this is with memfd_create. They can use the returned fd to create multiple mappings of the same memory. The JVM today has an option to use (static hugetlb) huge pages. If this option is specified, they would like to use the same garbage collection model requiring multiple mappings to the same memory. Using hugetlbfs, it is possible to explicitly mount a filesystem and specify file paths in order to get an fd that can be used for multiple mappings. However, this introduces additional system admin work and coordination. Ideally they would like to get a hugetlbfs fd without requiring explicit mounting of a filesystem. Today, mmap and shmget can make use of hugetlbfs without explicitly mounting a filesystem. The patch adds this functionality to memfd_create. Add a new flag MFD_HUGETLB to memfd_create() that will specify the file to be created resides in the hugetlbfs filesystem. This is the generic hugetlbfs filesystem not associated with any specific mount point. As with other system calls that request hugetlbfs backed pages, there is the ability to encode huge page size in the flag arguments. hugetlbfs does not support sealing operations, therefore specifying MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL. Of course, the memfd_man page would need updating if this type of functionality moves forward. Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-