- 27 7月, 2016 35 次提交
-
-
由 Aneesh Kumar K.V 提交于
This allows an arch which needs to do special handing with respect to different page size when flushing tlb to implement the same in mmu gather. Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
This updates the generic and arch specific implementation to return true if we need to do a tlb flush. That means if a __tlb_remove_page indicate a flush is needed, the page we try to remove need to be tracked and added again after the flush. We need to track it because we have already update the pte to none and we can't just loop back. This change is done to enable us to do a tlb_flush when we try to flush a range that consists of different page sizes. For architectures like ppc64, we can do a range based tlb flush and we need to track page size for that. When we try to remove a huge page, we will force a tlb flush and starts a new mmu gather. [aneesh.kumar@linux.vnet.ibm.com: mm-change-the-interface-for-__tlb_remove_page-v3] Link: http://lkml.kernel.org/r/1465049193-22197-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1464860389-29019-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
For hugetlb like THP (and unlike regular page), we do tlb flush after dropping ptl. Because of the above, we don't need to track force_flush like we do now. Instead we can simply call tlb_remove_page() which will do the flush if needed. No functionality change in this patch. Link: http://lkml.kernel.org/r/1465049193-22197-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing to pte entries, which can be checked with pmd_trans_unstable(). Some callers make this assertion and some do it differently and some not, so let's do it in a unified manner. Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
When there is an isolated_page, post_alloc_hook() is called with page but __free_pages() is called with isolated_page. Since they are the same so no problem but it's very confusing. To reduce it, this patch changes isolated_page to boolean type and uses page variable consistently. Link: http://lkml.kernel.org/r/1466150259-27727-10-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
This patch is motivated from Hugh and Vlastimil's concern [1]. There are two ways to get freepage from the allocator. One is using normal memory allocation API and the other is __isolate_free_page() which is internally used for compaction and pageblock isolation. Later usage is rather tricky since it doesn't do whole post allocation processing done by normal API. One problematic thing I already know is that poisoned page would not be checked if it is allocated by __isolate_free_page(). Perhaps, there would be more. We could add more debug logic for allocated page in the future and this separation would cause more problem. I'd like to fix this situation at this time. Solution is simple. This patch commonize some logic for newly allocated page and uses it on all sites. This will solve the problem. [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3] Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Currently, we store each page's allocation stacktrace on corresponding page_ext structure and it requires a lot of memory. This causes the problem that memory tight system doesn't work well if page_owner is enabled. Moreover, even with this large memory consumption, we cannot get full stacktrace because we allocate memory at boot time and just maintain 8 stacktrace slots to balance memory consumption. We could increase it to more but it would make system unusable or change system behaviour. To solve the problem, this patch uses stackdepot to store stacktrace. It obviously provides memory saving but there is a drawback that stackdepot could fail. stackdepot allocates memory at runtime so it could fail if system has not enough memory. But, most of allocation stack are generated at very early time and there are much memory at this time. So, failure would not happen easily. And, one failure means that we miss just one page's allocation stacktrace so it would not be a big problem. In this patch, when memory allocation failure happens, we store special stracktrace handle to the page that is failed to save stacktrace. With it, user can guess memory usage properly even if failure happens. Memory saving looks as following. (4GB memory system with page_owner) (before the patch -> after the patch) static allocation: 92274688 bytes -> 25165824 bytes dynamic allocation after boot + kernel build: 0 bytes -> 327680 bytes total: 92274688 bytes -> 25493504 bytes 72% reduction in total. Note that implementation looks complex than someone would imagine because there is recursion issue. stackdepot uses page allocator and page_owner is called at page allocation. Using stackdepot in page_owner could re-call page allcator and then page_owner. That is a recursion. To detect and avoid it, whenever we obtain stacktrace, recursion is checked and page_owner is set to dummy information if found. Dummy information means that this page is allocated for page_owner feature itself (such as stackdepot) and it's understandable behavior for user. [iamjoonsoo.kim@lge.com: mm-page_owner-use-stackdepot-to-store-stacktrace-v3] Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1466150259-27727-7-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
split_page() calls set_page_owner() to set up page_owner to each pages. But, it has a drawback that head page and the others have different stacktrace because callsite of set_page_owner() is slightly differnt. To avoid this problem, this patch copies head page's page_owner to the others. It needs to introduce new function, split_page_owner() but it also remove the other function, get_page_owner_gfp() so looks good to do. Link: http://lkml.kernel.org/r/1464230275-25791-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Currently, copy_page_owner() doesn't copy all the owner information. It skips last_migrate_reason because copy_page_owner() is used for migration and it will be properly set soon. But, following patch will use copy_page_owner() and this skip will cause the problem that allocated page has uninitialied last_migrate_reason. To prevent it, this patch also copy last_migrate_reason in copy_page_owner(). Link: http://lkml.kernel.org/r/1464230275-25791-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
It's not necessary to initialized page_owner with holding the zone lock. It would cause more contention on the zone lock although it's not a big problem since it is just debug feature. But, it is better than before so do it. This is also preparation step to use stackdepot in page owner feature. Stackdepot allocates new pages when there is no reserved space and holding the zone lock in this case will cause deadlock. Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
We don't need to split freepages with holding the zone lock. It will cause more contention on zone lock so not desirable. [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Static check warns using tag as bit shifter. It doesn't break current working but not good for redability. Let's use OBJ_TAG_BIT as bit shifter instead of OBJ_ALLOCATED_TAG. Link: http://lkml.kernel.org/r/20160607045146.GF26230@bboxSigned-off-by: NMinchan Kim <minchan@kernel.org> Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
This patch introduces run-time migration feature for zspage. For migration, VM uses page.lru field so it would be better to not use page.next field which is unified with page.lru for own purpose. For that, firstly, we can get first object offset of the page via runtime calculation instead of using page.index so we can use page.index as link for page chaining instead of page.next. In case of huge object, it stores handle to page.index instead of next link of page chaining because huge object doesn't need to next link for page chaining. So get_next_page need to identify huge object to return NULL. For it, this patch uses PG_owner_priv_1 flag of the page flag. For migration, it supports three functions * zs_page_isolate It isolates a zspage which includes a subpage VM want to migrate from class so anyone cannot allocate new object from the zspage. We could try to isolate a zspage by the number of subpage so subsequent isolation trial of other subpage of the zpsage shouldn't fail. For that, we introduce zspage.isolated count. With that, zs_page_isolate can know whether zspage is already isolated or not for migration so if it is isolated for migration, subsequent isolation trial can be successful without trying further isolation. * zs_page_migrate First of all, it holds write-side zspage->lock to prevent migrate other subpage in zspage. Then, lock all objects in the page VM want to migrate. The reason we should lock all objects in the page is due to race between zs_map_object and zs_page_migrate. zs_map_object zs_page_migrate pin_tag(handle) obj = handle_to_obj(handle) obj_to_location(obj, &page, &obj_idx); write_lock(&zspage->lock) if (!trypin_tag(handle)) goto unpin_object zspage = get_zspage(page); read_lock(&zspage->lock); If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can be stale by migration so it goes crash. If it locks all of objects successfully, it copies content from old page to new one, finally, create new zspage chain with new page. And if it's last isolated subpage in the zspage, put the zspage back to class. * zs_page_putback It returns isolated zspage to right fullness_group list if it fails to migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage freeing to workqueue. See below about async zspage freeing. This patch introduces asynchronous zspage free. The reason to need it is we need page_lock to clear PG_movable but unfortunately, zs_free path should be atomic so the apporach is try to grab page_lock. If it got page_lock of all of pages successfully, it can free zspage immediately. Otherwise, it queues free request and free zspage via workqueue in process context. If zs_free finds the zspage is isolated when it try to free zspage, it delays the freeing until zs_page_putback finds it so it will free free the zspage finally. In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL. First of all, it will use ZS_EMPTY list for delay freeing. And with adding ZS_FULL list, it makes to identify whether zspage is isolated or not via list_empty(&zspage->list) test. [minchan@kernel.org: zsmalloc: keep first object offset in struct page] Link: http://lkml.kernel.org/r/1465788015-23195-1-git-send-email-minchan@kernel.org [minchan@kernel.org: zsmalloc: zspage sanity check] Link: http://lkml.kernel.org/r/20160603010129.GC3304@bbox Link: http://lkml.kernel.org/r/1464736881-24886-12-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Zsmalloc stores first free object's <PFN, obj_idx> position into freeobj in each zspage. If we change it with index from first_page instead of position, it makes page migration simple because we don't need to correct other entries for linked list if a page is migrated out. Link: http://lkml.kernel.org/r/1464736881-24886-11-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Currently, putback_zspage does free zspage under class->lock if fullness become ZS_EMPTY but it makes trouble to implement locking scheme for new zspage migration. So, this patch is to separate free_zspage from putback_zspage and free zspage out of class->lock which is preparation for zspage migration. Link: http://lkml.kernel.org/r/1464736881-24886-10-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
We have squeezed meta data of zspage into first page's descriptor. So, to get meta data from subpage, we should get first page first of all. But it makes trouble to implment page migration feature of zsmalloc because any place where to get first page from subpage can be raced with first page migration. IOW, first page it got could be stale. For preventing it, I have tried several approahces but it made code complicated so finally, I concluded to separate metadata from first page. Of course, it consumes more memory. IOW, 16bytes per zspage on 32bit at the moment. It means we lost 1% at *worst case*(40B/4096B) which is not bad I think at the cost of maintenance. Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
For page migration, we need to create page chain of zspage dynamically so this patch factors it out from alloc_zspage. Link: http://lkml.kernel.org/r/1464736881-24886-8-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Upcoming patch will change how to encode zspage meta so for easy review, this patch wraps code to access metadata as accessor. Link: http://lkml.kernel.org/r/1464736881-24886-7-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Use kernel standard bit spin-lock instead of custom mess. Even, it has a bug which doesn't disable preemption. The reason we don't have any problem is that we have used it during preemption disable section by class->lock spinlock. So no need to go to stable. Link: http://lkml.kernel.org/r/1464736881-24886-6-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Every zspage in a size_class has same number of max objects so we could move it to a size_class. Link: http://lkml.kernel.org/r/1464736881-24886-5-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Now, VM has a feature to migrate non-lru movable pages so balloon doesn't need custom migration hooks in migrate.c and compaction.c. Instead, this patch implements the page->mapping->a_ops-> {isolate|migrate|putback} functions. With that, we could remove hooks for ballooning in general migration functions and make balloon compaction simple. [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h] Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com> Signed-off-by: NMinchan Kim <minchan@kernel.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Rafael Aquini <aquini@redhat.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
We have allowed migration for only LRU pages until now and it was enough to make high-order pages. But recently, embedded system(e.g., webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory) so we have seen several reports about troubles of small high-order allocation. For fixing the problem, there were several efforts (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page, reserved memory, vmalloc and so on) but if there are lots of non-movable pages in system, their solutions are void in the long run. So, this patch is to support facility to change non-movable pages with movable. For the feature, this patch introduces functions related to migration to address_space_operations as well as some page flags. If a driver want to make own pages movable, it should define three functions which are function pointers of struct address_space_operations. 1. bool (*isolate_page) (struct page *page, isolate_mode_t mode); What VM expects on isolate_page function of driver is to return *true* if driver isolates page successfully. On returing true, VM marks the page as PG_isolated so concurrent isolation in several CPUs skip the page for isolation. If a driver cannot isolate the page, it should return *false*. Once page is successfully isolated, VM uses page.lru fields so driver shouldn't expect to preserve values in that fields. 2. int (*migratepage) (struct address_space *mapping, struct page *newpage, struct page *oldpage, enum migrate_mode); After isolation, VM calls migratepage of driver with isolated page. The function of migratepage is to move content of the old page to new page and set up fields of struct page newpage. Keep in mind that you should indicate to the VM the oldpage is no longer movable via __ClearPageMovable() under page_lock if you migrated the oldpage successfully and returns 0. If driver cannot migrate the page at the moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time because VM interprets -EAGAIN as "temporal migration failure". On returning any error except -EAGAIN, VM will give up the page migration without retrying in this time. Driver shouldn't touch page.lru field VM using in the functions. 3. void (*putback_page)(struct page *); If migration fails on isolated page, VM should return the isolated page to the driver so VM calls driver's putback_page with migration failed page. In this function, driver should put the isolated page back to the own data structure. 4. non-lru movable page flags There are two page flags for supporting non-lru movable page. * PG_movable Driver should use the below function to make page movable under page_lock. void __SetPageMovable(struct page *page, struct address_space *mapping) It needs argument of address_space for registering migration family functions which will be called by VM. Exactly speaking, PG_movable is not a real flag of struct page. Rather than, VM reuses page->mapping's lower bits to represent it. #define PAGE_MAPPING_MOVABLE 0x2 page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; so driver shouldn't access page->mapping directly. Instead, driver should use page_mapping which mask off the low two bits of page->mapping so it can get right struct address_space. For testing of non-lru movable page, VM supports __PageMovable function. However, it doesn't guarantee to identify non-lru movable page because page->mapping field is unified with other variables in struct page. As well, if driver releases the page after isolation by VM, page->mapping doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether page is LRU or non-lru movable once the page has been isolated. Because LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also good for just peeking to test non-lru movable pages before more expensive checking with lock_page in pfn scanning to select victim. For guaranteeing non-lru movable page, VM provides PageMovable function. Unlike __PageMovable, PageMovable functions validates page->mapping and mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden destroying of page->mapping. Driver using __SetPageMovable should clear the flag via __ClearMovablePage under page_lock before the releasing the page. * PG_isolated To prevent concurrent isolation among several CPUs, VM marks isolated page as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru movable page, it can skip it. Driver doesn't need to manipulate the flag because VM will set/clear it automatically. Keep in mind that if driver sees PG_isolated page, it means the page have been isolated by VM so it shouldn't touch page.lru field. PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag for own purpose. [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru] Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com> Signed-off-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: John Einar Reitan <john.reitan@foss.arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Recently, I got many reports about perfermance degradation in embedded system(Android mobile phone, webOS TV and so on) and easy fork fail. The problem was fragmentation caused by zram and GPU driver mainly. With memory pressure, their pages were spread out all of pageblock and it cannot be migrated with current compaction algorithm which supports only LRU pages. In the end, compaction cannot work well so reclaimer shrinks all of working set pages. It made system very slow and even to fail to fork easily which requires order-[2 or 3] allocations. Other pain point is that they cannot use CMA memory space so when OOM kill happens, I can see many free pages in CMA area, which is not memory efficient. In our product which has big CMA memory, it reclaims zones too exccessively to allocate GPU and zram page although there are lots of free space in CMA so system becomes very slow easily. To solve these problem, this patch tries to add facility to migrate non-lru pages via introducing new functions and page flags to help migration. struct address_space_operations { .. .. bool (*isolate_page)(struct page *, isolate_mode_t); void (*putback_page)(struct page *); .. } new page flags PG_movable PG_isolated For details, please read description in "mm: migrate: support non-lru movable page migration". Originally, Gioh Kim had tried to support this feature but he moved so I took over the work. I took many code from his work and changed a little bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to have many credit, too. And I should mention Chulmin who have tested this patchset heavily so I can find many bugs from him. :) Thanks, Gioh, Konstantin and Chulmin! This patchset consists of five parts. 1. clean up migration mm: use put_page to free page instead of putback_lru_page 2. add non-lru page migration feature mm: migrate: support non-lru movable page migration 3. rework KVM memory-ballooning mm: balloon: use general non-lru movable page feature 4. zsmalloc refactoring for preparing page migration zsmalloc: keep max_object in size_class zsmalloc: use bit_spin_lock zsmalloc: use accessor zsmalloc: factor page chain functionality out zsmalloc: introduce zspage structure zsmalloc: separate free_zspage from putback_zspage zsmalloc: use freeobj for index 5. zsmalloc page migration zsmalloc: page migration support zram: use __GFP_MOVABLE for memory allocation This patch (of 12): Procedure of page migration is as follows: First of all, it should isolate a page from LRU and try to migrate the page. If it is successful, it releases the page for freeing. Otherwise, it should put the page back to LRU list. For LRU pages, we have used putback_lru_page for both freeing and putback to LRU list. It's okay because put_page is aware of LRU list so if it releases last refcount of the page, it removes the page from LRU list. However, It makes unnecessary operations (e.g., lru_cache_add, pagevec and flags operations. It would be not significant but no worth to do) and harder to support new non-lru page migration because put_page isn't aware of non-lru page's data structure. To solve the problem, we can add new hook in put_page with PageMovable flags check but it can increase overhead in hot path and needs new locking scheme to stabilize the flag check with put_page. So, this patch cleans it up to divide two semantic(ie, put and putback). If migration is successful, use put_page instead of putback_lru_page and use putback_lru_page only on failure. That makes code more readable and doesn't add overhead in put_page. Comment from Vlastimil "Yeah, and compaction (perhaps also other migration users) has to drain the lru pvec... Getting rid of this stuff is worth even by itself." Link: http://lkml.kernel.org/r/1464736881-24886-2-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
It's a part of oom context just like allocation order and nodemask, so let's move it to oom_control instead of passing it in the argument list. Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Not used since oom_lock was instroduced. Link: http://lkml.kernel.org/r/1464358093-22663-1-git-send-email-vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Reza Arbab 提交于
When memory is onlined, we are only able to rezone from ZONE_MOVABLE to ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE. To be more flexible, use the following criteria instead; to online memory from zone X into zone Y, * Any zones between X and Y must be unused. * If X is lower than Y, the onlined memory must lie at the end of X. * If X is higher than Y, the onlined memory must lie at the start of X. Add zone_can_shift() to make this determination. Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: NReza Arbab <arbab@linux.vnet.ibm.com> Reviewd-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Reza Arbab 提交于
Add move_pfn_range(), a wrapper to call move_pfn_range_left() or move_pfn_range_right(). No functional change. This will be utilized by a later patch. Link: http://lkml.kernel.org/r/1462816419-4479-2-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: NReza Arbab <arbab@linux.vnet.ibm.com> Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oliver O'Halloran 提交于
As a part of memory initialisation the architecture passes an array to free_area_init_nodes() which specifies the max PFN of each memory zone. This array is not necessarily monotonic (due to unused zones) so this array is parsed to build monotonic lists of the min and max PFN for each zone. ZONE_MOVABLE is special cased here as its limits are managed by the mm subsystem rather than the architecture. Unfortunately, this special casing is broken when ZONE_MOVABLE is the not the last zone in the zone list. The core of the issue is: if (i == ZONE_MOVABLE) continue; arch_zone_lowest_possible_pfn[i] = arch_zone_highest_possible_pfn[i-1]; As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will be set to zero. This patch fixes this bug by adding explicitly tracking where the next zone should start rather than relying on the contents arch_zone_highest_possible_pfn[]. Thie is low priority. To get bitten by this you need to enable a zone that appears after ZONE_MOVABLE in the zone_type enum. As far as I can tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled, so I can't see this affecting too many people. I only noticed this because I've been fiddling with ZONE_DEVICE on powerpc and 4.6 broke my test kernel. This bug, in conjunction with the changes in Taku Izumi's kernelcore=mirror patch (d91749c1) and powerpc being the odd architecture which initialises max_zone_pfn[] to ~0ul instead of 0 caused all of system memory to be placed into ZONE_DEVICE at boot, followed a panic since device memory cannot be used for kernel allocations. I've already submitted a patch to fix the powerpc specific bits, but I figured this should be fixed too. Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.comSigned-off-by: NOliver O'Halloran <oohall@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Li RongQing 提交于
It seems like this parameter has never been used since being introduced by 90254a65 ("memcg: clean up move charge"). Not a big deal because I assume the function would get inlined into the caller anyway but why not get rid of it. [mhocko@suse.com: wrote changelog] Link: http://lkml.kernel.org/r/20160525151831.GJ20132@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1464145026-26693-1-git-send-email-roy.qing.li@gmail.comSigned-off-by: NLi RongQing <roy.qing.li@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yongjun 提交于
Using list_move() instead of list_del() + list_add() to avoid needlessly poisoning the next and prev values. Link: http://lkml.kernel.org/r/1468929772-9174-1-git-send-email-weiyj_lk@163.comSigned-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask. This is a rather harsh way to announce a non-critical issue. Allocator is free to ignore invalid flags. Let's simply replace BUG() by dump_stack to tell the offender and fixup the mask to move on with the allocation request. This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test module: Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code! CPU: 0 PID: 2916 Comm: insmod Tainted: G O 4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 Call Trace: dump_stack+0x67/0x90 cache_alloc_refill+0x201/0x617 kmem_cache_alloc_trace+0xa7/0x24a ? 0xffffffffa0005000 mymodule_init+0x20/0x1000 [test_slab] do_one_initcall+0xe7/0x16c ? rcu_read_lock_sched_held+0x61/0x69 ? kmem_cache_alloc_trace+0x197/0x24a do_init_module+0x5f/0x1d9 load_module+0x1a3d/0x1f21 ? retint_kernel+0x2d/0x2d SyS_init_module+0xe8/0x10e ? SyS_init_module+0xe8/0x10e do_syscall_64+0x68/0x13f entry_SYSCALL64_slow_path+0x25/0x25 Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
printk offers %pGg for quite some time so let's use it to get a human readable list of invalid flags. The original output would be [ 429.191962] gfp: 2 after the change [ 429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM) Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Thomas Garnier 提交于
Implements freelist randomization for the SLUB allocator. It was previous implemented for the SLAB allocator. Both use the same configuration option (CONFIG_SLAB_FREELIST_RANDOM). The list is randomized during initialization of a new set of pages. The order on different freelist sizes is pre-computed at boot for performance. Each kmem_cache has its own randomized freelist. This security feature reduces the predictability of the kernel SLUB allocator against heap overflows rendering attacks much less stable. For example these attacks exploit the predictability of the heap: - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU) - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95) Performance results: slab_test impact is between 3% to 4% on average for 100000 attempts without smp. It is a very focused testing, kernbench show the overall impact on the system is way lower. Before: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles 100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles 100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles 100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles 100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles 100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles 100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles 100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles 100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles 100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles 2. Kmalloc: alloc/free test 100000 times kmalloc(8)/kfree -> 70 cycles 100000 times kmalloc(16)/kfree -> 70 cycles 100000 times kmalloc(32)/kfree -> 70 cycles 100000 times kmalloc(64)/kfree -> 70 cycles 100000 times kmalloc(128)/kfree -> 70 cycles 100000 times kmalloc(256)/kfree -> 69 cycles 100000 times kmalloc(512)/kfree -> 70 cycles 100000 times kmalloc(1024)/kfree -> 73 cycles 100000 times kmalloc(2048)/kfree -> 72 cycles 100000 times kmalloc(4096)/kfree -> 71 cycles After: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles 100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles 100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles 100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles 100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles 100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles 100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles 100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles 100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles 100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles 2. Kmalloc: alloc/free test 100000 times kmalloc(8)/kfree -> 66 cycles 100000 times kmalloc(16)/kfree -> 66 cycles 100000 times kmalloc(32)/kfree -> 66 cycles 100000 times kmalloc(64)/kfree -> 66 cycles 100000 times kmalloc(128)/kfree -> 65 cycles 100000 times kmalloc(256)/kfree -> 67 cycles 100000 times kmalloc(512)/kfree -> 67 cycles 100000 times kmalloc(1024)/kfree -> 64 cycles 100000 times kmalloc(2048)/kfree -> 67 cycles 100000 times kmalloc(4096)/kfree -> 67 cycles Kernbench, before: Average Optimal load -j 12 Run (std deviation): Elapsed Time 101.873 (1.16069) User Time 1045.22 (1.60447) System Time 88.969 (0.559195) Percent CPU 1112.9 (13.8279) Context Switches 189140 (2282.15) Sleeps 99008.6 (768.091) After: Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.47 (0.562732) User Time 1045.3 (1.34263) System Time 88.311 (0.342554) Percent CPU 1105.8 (6.49444) Context Switches 189081 (2355.78) Sleeps 99231.5 (800.358) Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.comSigned-off-by: NThomas Garnier <thgarnie@google.com> Reviewed-by: NKees Cook <keescook@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Thomas Garnier 提交于
The kernel heap allocators are using a sequential freelist making their allocation predictable. This predictability makes kernel heap overflow easier to exploit. An attacker can careful prepare the kernel heap to control the following chunk overflowed. For example these attacks exploit the predictability of the heap: - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU) - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95) ***Problems that needed solving: - Randomize the Freelist (singled linked) used in the SLUB allocator. - Ensure good performance to encourage usage. - Get best entropy in early boot stage. ***Parts: - 01/02 Reorganize the SLAB Freelist randomization to share elements with the SLUB implementation. - 02/02 The SLUB Freelist randomization implementation. Similar approach than the SLAB but tailored to the singled freelist used in SLUB. ***Performance data: slab_test impact is between 3% to 4% on average for 100000 attempts without smp. It is a very focused testing, kernbench show the overall impact on the system is way lower. Before: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles 100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles 100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles 100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles 100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles 100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles 100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles 100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles 100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles 100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles 2. Kmalloc: alloc/free test 100000 times kmalloc(8)/kfree -> 70 cycles 100000 times kmalloc(16)/kfree -> 70 cycles 100000 times kmalloc(32)/kfree -> 70 cycles 100000 times kmalloc(64)/kfree -> 70 cycles 100000 times kmalloc(128)/kfree -> 70 cycles 100000 times kmalloc(256)/kfree -> 69 cycles 100000 times kmalloc(512)/kfree -> 70 cycles 100000 times kmalloc(1024)/kfree -> 73 cycles 100000 times kmalloc(2048)/kfree -> 72 cycles 100000 times kmalloc(4096)/kfree -> 71 cycles After: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles 100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles 100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles 100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles 100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles 100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles 100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles 100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles 100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles 100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles 2. Kmalloc: alloc/free test 100000 times kmalloc(8)/kfree -> 66 cycles 100000 times kmalloc(16)/kfree -> 66 cycles 100000 times kmalloc(32)/kfree -> 66 cycles 100000 times kmalloc(64)/kfree -> 66 cycles 100000 times kmalloc(128)/kfree -> 65 cycles 100000 times kmalloc(256)/kfree -> 67 cycles 100000 times kmalloc(512)/kfree -> 67 cycles 100000 times kmalloc(1024)/kfree -> 64 cycles 100000 times kmalloc(2048)/kfree -> 67 cycles 100000 times kmalloc(4096)/kfree -> 67 cycles Kernbench, before: Average Optimal load -j 12 Run (std deviation): Elapsed Time 101.873 (1.16069) User Time 1045.22 (1.60447) System Time 88.969 (0.559195) Percent CPU 1112.9 (13.8279) Context Switches 189140 (2282.15) Sleeps 99008.6 (768.091) After: Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.47 (0.562732) User Time 1045.3 (1.34263) System Time 88.311 (0.342554) Percent CPU 1105.8 (6.49444) Context Switches 189081 (2355.78) Sleeps 99231.5 (800.358) This patch (of 2): This commit reorganizes the previous SLAB freelist randomization to prepare for the SLUB implementation. It moves functions that will be shared to slab_common. The entropy functions are changed to align with the SLUB implementation, now using get_random_(int|long) functions. These functions were chosen because they provide a bit more entropy early on boot and better performance when specific arch instructions are not available. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.comSigned-off-by: NThomas Garnier <thgarnie@google.com> Reviewed-by: NKees Cook <keescook@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Chinner 提交于
wait_sb_inodes() currently does a walk of all inodes in the filesystem to find dirty one to wait on during sync. This is highly inefficient and wastes a lot of CPU when there are lots of clean cached inodes that we don't need to wait on. To avoid this "all inode" walk, we need to track inodes that are currently under writeback that we need to wait for. We do this by adding inodes to a writeback list on the sb when the mapping is first tagged as having pages under writeback. wait_sb_inodes() can then walk this list of "inodes under IO" and wait specifically just for the inodes that the current sync(2) needs to wait for. Define a couple helpers to add/remove an inode from the writeback list and call them when the overall mapping is tagged for or cleared from writeback. Update wait_sb_inodes() to walk only the inodes under writeback due to the sync. With this change, filesystem sync times are significantly reduced for fs' with largely populated inode caches and otherwise no other work to do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem with a ~10m entry inode cache, sync times are reduced from ~7.3s to less than 0.1s when the filesystem is fully clean. Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.comSigned-off-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Tested-by: NHolger Hoffstätte <holger.hoffstaette@applied-asynchrony.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 7月, 2016 1 次提交
-
-
由 Johannes Weiner 提交于
The memory controller has quite a bit of state that usually outlives the cgroup and pins its CSS until said state disappears. At the same time it imposes a 16-bit limit on the CSS ID space to economically store IDs in the wild. Consequently, when we use cgroups to contain frequent but small and short-lived jobs that leave behind some page cache, we quickly run into the 64k limitations of outstanding CSSs. Creating a new cgroup fails with -ENOSPC while there are only a few, or even no user-visible cgroups in existence. Although pinning CSSs past cgroup removal is common, there are only two instances that actually need an ID after a cgroup is deleted: cache shadow entries and swapout records. Cache shadow entries reference the ID weakly and can deal with the CSS having disappeared when it's looked up later. They pose no hurdle. Swap-out records do need to pin the css to hierarchically attribute swapins after the cgroup has been deleted; though the only pages that remain swapped out after offlining are tmpfs/shmem pages. And those references are under the user's control, so they are manageable. This patch introduces a private 16-bit memcg ID and switches swap and cache shadow entries over to using that. This ID can then be recycled after offlining when the CSS remains pinned only by objects that don't specifically need it. This script demonstrates the problem by faulting one cache page in a new cgroup and deleting it again: set -e mkdir -p pages for x in `seq 128000`; do [ $((x % 1000)) -eq 0 ] && echo $x mkdir /cgroup/foo echo $$ >/cgroup/foo/cgroup.procs echo trex >pages/$x echo $$ >/cgroup/cgroup.procs rmdir /cgroup/foo done When run on an unpatched kernel, we eventually run out of possible IDs even though there are no visible cgroups: [root@ham ~]# ./cssidstress.sh [...] 65000 mkdir: cannot create directory '/cgroup/foo': No space left on device After this patch, the IDs get released upon cgroup destruction and the cache and css objects get released once memory reclaim kicks in. [hannes@cmpxchg.org: init the IDR] Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org Fixes: b2052564 ("mm: memcontrol: continue cache reclaim from offlined groups") Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-by: NJohn Garcia <john.garcia@mesosphere.io> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NTejun Heo <tj@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Cc: <stable@vger.kernel.org> [3.19+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 7月, 2016 4 次提交
-
-
由 Anton Blanchard 提交于
Commit 612e4493 ("mm: workingset: eviction buckets for bigmem/lowbit machines") added a printk without a log level. Quieten it by using pr_info(). Link: http://lkml.kernel.org/r/1466982072-29836-2-git-send-email-anton@ozlabs.orgSigned-off-by: NAnton Blanchard <anton@samba.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's worth: the syzkaller fuzzer hit it again. It's still wrong for some THP cases, because linear_page_index() was never intended to apply to addresses before the start of a vma. That's easily fixed with a signed long cast inside linear_page_index(); and Dmitry has tested such a patch, to verify the false positive. But why extend linear_page_index() just for this case? when the avoidance in page_move_anon_rmap() has already grown ugly, and there's no reason for the check at all (nothing else there is using address or index). Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE, remove CONFIG_DEBUG_VM PageTransHuge adjustment. And one more thing: should the compound_head(page) be done inside or outside page_move_anon_rmap()? It's usually pushed down to the lowest level nowadays (and mm/memory.c shows no other explicit use of it), so I think it's better done in page_move_anon_rmap() than by caller. Fixes: 0798d3c0 ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()") Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvilsSigned-off-by: NHugh Dickins <hughd@google.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
The previous patch addresses the race between split_huge_pmd_address() and someone changing the pmd. The fix is only for splitting of normal thp (i.e. pmd-mapped thp,) and for splitting of pte-mapped thp there still is the similar race. For splitting pte-mapped thp, the pte's conversion is done by try_to_unmap_one(TTU_MIGRATION). This function checks page_check_address() to get the target pte, but it can return NULL under some race, leading to VM_BUG_ON() in freeze_page(). Fortunately, page_check_address() already has an argument to decide whether we do a quick/racy check or not, so let's flip it when called from freeze_page(). Link: http://lkml.kernel.org/r/1466990929-7452-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
I found a race condition triggering VM_BUG_ON() in freeze_page(), when running a testcase with 3 processes: - process 1: keep writing thp, - process 2: keep clearing soft-dirty bits from virtual address of process 1 - process 3: call migratepages for process 1, The kernel message is like this: kernel BUG at /src/linux-dev/mm/huge_memory.c:3096! invalid opcode: 0000 [#1] SMP Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci virtio_ring virtio CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff880037320000 ti: ffff88007cdd0000 task.ti: ffff88007cdd0000 RIP: 0010:[<ffffffff811f8e06>] [<ffffffff811f8e06>] split_huge_page_to_list+0x496/0x590 RSP: 0018:ffff88007cdd3b70 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff88007c7b88c0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000700000200 RDI: ffffea0003188000 RBP: ffff88007cdd3bb8 R08: 0000000000000001 R09: 00003ffffffff000 R10: ffff880000000000 R11: ffffc000001fffff R12: ffffea0003188000 R13: ffffea0003188000 R14: 0000000000000000 R15: 0400000000000080 FS: 00007f8ec241d740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8ec1f3ed20 CR3: 000000003707b000 CR4: 00000000000006f0 Call Trace: ? list_del+0xd/0x30 queue_pages_pte_range+0x4d1/0x590 __walk_page_range+0x204/0x4e0 walk_page_range+0x71/0xf0 queue_pages_range+0x75/0x90 ? queue_pages_hugetlb+0x190/0x190 ? new_node_page+0xc0/0xc0 ? change_prot_numa+0x40/0x40 migrate_to_node+0x71/0xd0 do_migrate_pages+0x1c3/0x210 SyS_migrate_pages+0x261/0x290 entry_SYSCALL_64_fastpath+0x1a/0xa4 Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff RIP split_huge_page_to_list+0x496/0x590 I'm not sure of the full scenario of the reproduction, but my debug showed that split_huge_pmd_address(freeze=true) returned without running main code of pmd splitting because pmd_present(*pmd) in precheck somehow returned 0. If this happens, the subsequent try_to_unmap() fails and returns non-zero (because page_mapcount() still > 0), and finally VM_BUG_ON() fires. This patch tries to fix it by prechecking pmd state inside ptl. Link: http://lkml.kernel.org/r/1466990929-7452-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-