- 10 12月, 2015 1 次提交
-
-
由 Ard Biesheuvel 提交于
This introduces the MEMBLOCK_NOMAP attribute and the required plumbing to make it usable as an indicator that some parts of normal memory should not be covered by the kernel direct mapping. It is up to the arch to actually honor the attribute when laying out this mapping, but the memblock code itself is modified to disregard these regions for allocations and other general use. Cc: linux-mm@kvack.org Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NWill Deacon <will.deacon@arm.com>
-
- 23 11月, 2015 5 次提交
-
-
由 Jesper Dangaard Brouer 提交于
Adjust kmem_cache_alloc_bulk API before we have any real users. Adjust API to return type 'int' instead of previously type 'bool'. This is done to allow future extension of the bulk alloc API. A future extension could be to allow SLUB to stop at a page boundary, when specified by a flag, and then return the number of objects. The advantage of this approach, would make it easier to make bulk alloc run without local IRQs disabled. With an approach of cmpxchg "stealing" the entire c->freelist or page->freelist. To avoid overshooting we would stop processing at a slab-page boundary. Else we always end up returning some objects at the cost of another cmpxchg. To keep compatible with future users of this API linking against an older kernel when using the new flag, we need to return the number of allocated objects with this API change. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jesper Dangaard Brouer 提交于
Initial implementation missed support for kmem cgroup support in kmem_cache_free_bulk() call, add this. If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough to not add any asm code. Incoming bulk free objects can belong to different kmem cgroups, and object free call can happen at a later point outside memcg context. Thus, we need to keep the orig kmem_cache, to correctly verify if a memcg object match against its "root_cache" (s->memcg_params.root_cache). Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jesper Dangaard Brouer 提交于
The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to be called several times inside the bulk alloc for loop, due to the call to memcg_kmem_get_cache(). This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache. As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able to handle an array of objects. A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have same type (size_t) as size argument. This helps the compiler to easier realize that it can remove the loop, when all debug statements inside loop evaluates to nothing. Note, this is only an issue because the kernel is compiled with GCC option: -fno-strict-overflow In slab_alloc_node() the compiler inlines and optimizes the invocation of slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access object directly. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Reported-by: NVladimir Davydov <vdavydov@virtuozzo.com> Suggested-by: NVladimir Davydov <vdavydov@virtuozzo.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jesper Dangaard Brouer 提交于
This change focus on improving the speed of object freeing in the "slowpath" of kmem_cache_free_bulk. The calls slab_free (fastpath) and __slab_free (slowpath) have been extended with support for bulk free, which amortize the overhead of the (locked) cmpxchg_double. To use the new bulking feature, we build what I call a detached freelist. The detached freelist takes advantage of three properties: 1) the free function call owns the object that is about to be freed, thus writing into this memory is synchronization-free. 2) many freelist's can co-exist side-by-side in the same slab-page each with a separate head pointer. 3) it is the visibility of the head pointer that needs synchronization. Given these properties, the brilliant part is that the detached freelist can be constructed without any need for synchronization. The freelist is constructed directly in the page objects, without any synchronization needed. The detached freelist is allocated on the stack of the function call kmem_cache_free_bulk. Thus, the freelist head pointer is not visible to other CPUs. All objects in a SLUB freelist must belong to the same slab-page. Thus, constructing the detached freelist is about matching objects that belong to the same slab-page. The bulk free array is scanned is a progressive manor with a limited look-ahead facility. Kmem debug support is handled in call of slab_free(). Notice kmem_cache_free_bulk no longer need to disable IRQs. This only slowed down single free bulk with approx 3 cycles. Performance data: Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns To get stable and comparable numbers, the kernel have been booted with "slab_merge" (this also improve performance for larger bulk sizes). Performance data, compared against fallback bulking: bulk - fallback bulk - improvement with this patch 1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0% 2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5% 3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6% 4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5% 8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0% 16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3% 30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3% 32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0% 34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0% 48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7% 64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0% 128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0% 158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7% 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4% Performance data, compared current in-kernel bulking: bulk - curr in-kernel - improvement with this patch 1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5% 2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1% 3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5% 4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1% 8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9% 16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6% 30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5% 48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0% 64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2% 128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9% 158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0% 250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0% Performance with normal SLUB merging is significantly slower for larger bulking. This is believed to (primarily) be an effect of not having to share the per-CPU data-structures, as tuning per-CPU size can achieve similar performance. bulk - slab_nomerge - normal SLUB merge 1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0 2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0 3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0 4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0 8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0 16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0 30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5 32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4 34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1 48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1 64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28 128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30 158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29 250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19 Joint work with Alexander Duyck. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c [akpm@linux-foundation.org: BUG_ON -> WARN_ON;return] Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jesper Dangaard Brouer 提交于
Make it possible to free a freelist with several objects by adjusting API of slab_free() and __slab_free() to have head, tail and an objects counter (cnt). Tail being NULL indicate single object free of head object. This allow compiler inline constant propagation in slab_free() and slab_free_freelist_hook() to avoid adding any overhead in case of single object free. This allows a freelist with several objects (all within the same slab-page) to be free'ed using a single locked cmpxchg_double in __slab_free() and with an unlocked cmpxchg_double in slab_free(). Object debugging on the free path is also extended to handle these freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if objects don't belong to the same slab-page. These changes are needed for the next patch to bulk free the detached freelists it introduces and constructs. Micro benchmarking showed no performance reduction due to this change, when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 11月, 2015 7 次提交
-
-
由 Jesper Dangaard Brouer 提交于
The #ifdef of CONFIG_SLUB_DEBUG is located very far from the associated #else. For readability mark it with a comment. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Use the new function that can do allocation while interrupts are disabled. Avoids irq on/off sequences. Signed-off-by: NChristoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Bulk alloc needs a function like that because it enables interrupts before calling __slab_alloc which promptly disables them again using the expensive local_irq_save(). Signed-off-by: NChristoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Ryabinin 提交于
Kmemleak reports the following leak: unreferenced object 0xfffffbfff41ea000 (size 20480): comm "modprobe", pid 65199, jiffies 4298875551 (age 542.568s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff82354f5e>] kmemleak_alloc+0x4e/0xc0 [<ffffffff8152e718>] __vmalloc_node_range+0x4b8/0x740 [<ffffffff81574072>] kasan_module_alloc+0x72/0xc0 [<ffffffff810efe68>] module_alloc+0x78/0xb0 [<ffffffff812f6a24>] module_alloc_update_bounds+0x14/0x70 [<ffffffff812f8184>] layout_and_allocate+0x16f4/0x3c90 [<ffffffff812faa1f>] load_module+0x2ff/0x6690 [<ffffffff813010b6>] SyS_finit_module+0x136/0x170 [<ffffffff8239bbc9>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff kasan_module_alloc() allocates shadow memory for module and frees it on module unloading. It doesn't store the pointer to allocated shadow memory because it could be calculated from the shadowed address, i.e. kasan_mem_to_shadow(addr). Since kmemleak cannot find pointer to allocated shadow, it thinks that memory leaked. Use kmemleak_ignore() to tell kmemleak that this is not a leak and shadow memory doesn't contain any pointers. Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
When building kernel with gcc 5.2, the below warning is raised: mm/page-writeback.c: In function 'balance_dirty_pages.isra.10': mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized] unsigned long m_dirty, m_thresh, m_bg_thresh; The m_dirty{thresh, bg_thresh} are initialized in the block of "if (mdtc)", so if mdts is null, they won't be initialized before being used. Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to keep consistency. They are used later by if condition: !mdtc || m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh) If mdtc is null, dirty_freerun_ceiling will not be called at all, so the initialization will not change any behavior other than just ceasing the compile warning. (akpm: the patch actually reduces .text size by ~20 bytes on gcc-4.x.y) [akpm@linux-foundation.org: add comment] Signed-off-by: NYang Shi <yang.shi@linaro.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jason J. Herne 提交于
MADV_NOHUGEPAGE processing is too restrictive. kvm already disables hugepage but hugepage_madvise() takes the error path when we ask to turn on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's new postcopy migration feature to fail on s390 because its first action is to madvise the guest address space as NOHUGEPAGE. This patch modifies the code so that the operation succeeds without error now. For consistency reasons do the same for MADV_HUGEPAGE. Signed-off-by: NJason J. Herne <jjherne@linux.vnet.ibm.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jerome Marchand 提交于
Commit 71394fe5 ("mm: vmalloc: add flag preventing guard hole allocation") missed a spot. Currently remove_vm_area() decreases vm->size to "remove" the guard hole page, even when it isn't present. All but one users just free the vm_struct rigth away and never access vm->size anyway. Don't touch the size in remove_vm_area() and have __vunmap() use the proper get_vm_area_size() helper. Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 11月, 2015 1 次提交
-
-
由 Yigal Korman 提交于
DAX handling of COW faults has wrong locking sequence: dax_fault does i_mmap_lock_read do_cow_fault does i_mmap_unlock_write Ross's commit[1] missed a fix[2] that Kirill added to Matthew's commit[3]. Original COW locking logic was introduced by Matthew here[4]. This should be applied to v4.3 as well. [1] 0f90cc66 mm, dax: fix DAX deadlocks [2] 52a2b53f mm, dax: use i_mmap_unlock_write() in do_cow_fault() [3] 84317297 dax: fix race between simultaneous faults [4] 2e4cdab0 mm: allow page fault handlers to perform the COW Cc: <stable@vger.kernel.org> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NYigal Korman <yigal@plexistor.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 11 11月, 2015 2 次提交
-
-
由 Naoya Horiguchi 提交于
Recently alloc_buddy_huge_page() was renamed to __alloc_buddy_huge_page(), so let's sync comments. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tony Luck 提交于
In commit a1c34a3b ("mm: Don't offset memmap for flatmem") Laura fixed a problem for Srinivas relating to the bottom 2MB of RAM on an ARM IFC6410 board. One small wrinkle on ia64 is that it allocates the node_mem_map earlier in arch code, so it skips the block of code where "offset" is initialized. Move initialization of start and offset before the check for the node_mem_map so that they will always be available in the latter part of the function. Tested-by: NLaura Abbott <laura@labbott.name> Fixes: a1c34a3b (mm: Don't offset memmap for flatmem) Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 11月, 2015 24 次提交
-
-
由 Kirill A. Shutemov 提交于
Let's try to be consistent about data type of page order. [sfr@canb.auug.org.au: fix build (type of pageblock_order)] [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
The patch halves space occupied by compound_dtor and compound_order in struct page. For compound_order, it's trivial long -> short conversion. For get_compound_page_dtor(), we now use hardcoded table for destructor lookup and store its index in the struct page instead of direct pointer to destructor. It shouldn't be a big trouble to maintain the table: we have only two destructor and NULL currently. This patch free up one word in tail pages for reuse. This is preparation for the next patch. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We are going to rework how compound_head() work. It will not use page->first_page as we have it now. The only other user of page->first_page beyond compound pages is zsmalloc. Let's use page->private instead of page->first_page here. It occupies the same storage space. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We have properly typed page->rcu_head, no need to cast page->lru. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sergey Senozhatsky 提交于
Each `struct size_class' contains `struct zs_size_stat': an array of NR_ZS_STAT_TYPE `unsigned long'. For zsmalloc built with no CONFIG_ZSMALLOC_STAT this results in a waste of `2 * sizeof(unsigned long)' per-class. The patch removes unneeded `struct zs_size_stat' members by redefining NR_ZS_STAT_TYPE (max stat idx in array). Since both NR_ZS_STAT_TYPE and zs_stat_type are compile time constants, GCC can eliminate zs_stat_inc()/zs_stat_dec() calls that use zs_stat_type larger than NR_ZS_STAT_TYPE: CLASS_ALMOST_EMPTY and CLASS_ALMOST_FULL at the moment. ./scripts/bloat-o-meter mm/zsmalloc.o.old mm/zsmalloc.o.new add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-39 (-39) function old new delta fix_fullness_group 97 94 -3 insert_zspage 100 86 -14 remove_zspage 141 119 -22 To summarize: a) each class now uses less memory b) we avoid a number of dec/inc stats (a minor optimization, but still). The gain will increase once we introduce additional stats. A simple IO test. iozone -t 4 -R -r 32K -s 60M -I +Z patched base " Initial write " 4145599.06 4127509.75 " Rewrite " 4146225.94 4223618.50 " Read " 17157606.00 17211329.50 " Re-read " 17380428.00 17267650.50 " Reverse Read " 16742768.00 16162732.75 " Stride read " 16586245.75 16073934.25 " Random read " 16349587.50 15799401.75 " Mixed workload " 10344230.62 9775551.50 " Random write " 4277700.62 4260019.69 " Pwrite " 4302049.12 4313703.88 " Pread " 6164463.16 6126536.72 " Fwrite " 7131195.00 6952586.00 " Fread " 12682602.25 12619207.50 Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hui Zhu 提交于
Signed-off-by: NHui Zhu <zhuhui@xiaomi.com> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sergey Senozhatsky 提交于
We don't let user to disable shrinker in zsmalloc (once it's been enabled), so no need to check ->shrinker_enabled in zs_shrinker_count(), at the moment at least. Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sergey Senozhatsky 提交于
A cosmetic change. Commit c60369f0 ("staging: zsmalloc: prevent mappping in interrupt context") added in_interrupt() check to zs_map_object() and 'hardirq.h' include; but in_interrupt() macro is defined in 'preempt.h' not in 'hardirq.h', so include it instead. Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hui Zhu 提交于
In obj_malloc(): if (!class->huge) /* record handle in the header of allocated chunk */ link->handle = handle; else /* record handle in first_page->private */ set_page_private(first_page, handle); In the hugepage we save handle to private directly. But in obj_to_head(): if (class->huge) { VM_BUG_ON(!is_first_page(page)); return *(unsigned long *)page_private(page); } else return *(unsigned long *)obj; It is used as a pointer. The reason why there is no problem until now is huge-class page is born with ZS_FULL so it can't be migrated. However, we need this patch for future work: "VM-aware zsmalloced page migration" to reduce external fragmentation. Signed-off-by: NHui Zhu <zhuhui@xiaomi.com> Acked-by: NMinchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hui Zhu 提交于
[akpm@linux-foundation.org: fix grammar] Signed-off-by: NHui Zhu <zhuhui@xiaomi.com> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sergey SENOZHATSKY 提交于
Constify `struct zs_pool' ->name. [akpm@inux-foundation.org: constify zpool_create_pool()'s `type' arg also] Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: NDan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Streetman 提交于
Make the return type of zpool_get_type const; the string belongs to the zpool driver and should not be modified. Remove the redundant type field in the struct zpool; it is private to zpool.c and isn't needed since ->driver->type can be used directly. Add comments indicating strings must be null-terminated. Signed-off-by: NDan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Streetman 提交于
Instead of using a fixed-length string for the zswap params, use charp. This simplifies the code and uses less memory, as most zswap param strings will be less than the current maximum length. Signed-off-by: NDan Streetman <ddstreet@ieee.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Klimov 提交于
On the next line entry variable will be re-initialized so no need to init it with NULL. Signed-off-by: NAlexey Klimov <alexey.klimov@linaro.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
gcc version 5.2.1 20151010 (Debian 5.2.1-22) $ size mm/memcontrol.o mm/memcontrol.o.before text data bss dec hex filename 35535 7908 64 43507 a9f3 mm/memcontrol.o 35762 7908 64 43734 aad6 mm/memcontrol.o.before Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aaron Tomlin 提交于
The "vma" parameter to khugepaged_alloc_page() is unused. It has to remain unused or the drop read lock 'map_sem' optimisation introduce by commit 8b164568 ("mm, THP: don't hold mmap_sem in khugepaged when allocating THP") wouldn't be safe. So let's remove it. Signed-off-by: NAaron Tomlin <atomlin@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
There are many places which use mapping_gfp_mask to restrict a more generic gfp mask which would be used for allocations which are not directly related to the page cache but they are performed in the same context. Let's introduce a helper function which makes the restriction explicit and easier to track. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NMichal Hocko <mhocko@suse.com> Suggested-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Andrew stated the following We have quite a history of remote parts of the kernel using weird/wrong/inexplicable combinations of __GFP_ flags. I tend to think that this is because we didn't adequately explain the interface. And I don't think that gfp.h really improved much in this area as a result of this patchset. Could you go through it some time and decide if we've adequately documented all this stuff? This patches first moves some GFP flag combinations that are part of the MM internals to mm/internal.h. The rest of the patch documents the __GFP_FOO bits under various headings and then documents the flag combinations. It will not help callers that are brain damaged but the clarity might motivate some fixes and avoid future mistakes. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The primary purpose of watermarks is to ensure that reclaim can always make forward progress in PF_MEMALLOC context (kswapd and direct reclaim). These assume that order-0 allocations are all that is necessary for forward progress. High-order watermarks serve a different purpose. Kswapd had no high-order awareness before they were introduced (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was particularly important when there were high-order atomic requests. The watermarks both gave kswapd awareness and made a reserve for those atomic requests. There are two important side-effects of this. The most important is that a non-atomic high-order request can fail even though free pages are available and the order-0 watermarks are ok. The second is that high-order watermark checks are expensive as the free list counts up to the requested order must be examined. With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to have high-order watermarks. Kswapd and compaction still need high-order awareness which is handled by checking that at least one suitable high-order page is free. With the patch applied, there was little difference in the allocation failure rates as the atomic reserves are small relative to the number of allocation attempts. The expected impact is that there will never be an allocation failure report that shows suitable pages on the free lists. The one potential side-effect of this is that in a vanilla kernel, the watermark checks may have kept a free page for an atomic allocation. Now, we are 100% relying on the HighAtomic reserves and an early allocation to have allocated them. If the first high-order atomic allocation is after the system is already heavily fragmented then it'll fail. [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
High-order watermark checking exists for two reasons -- kswapd high-order awareness and protection for high-order atomic requests. Historically the kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic allocations on demand and avoids using those blocks for order-0 allocations. This is more flexible and reliable than MIGRATE_RESERVE was. A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation request steals a pageblock but limits the total number to 1% of the zone. Callers that speculatively abuse atomic allocations for long-lived high-order allocations to access the reserve will quickly fail. Note that SLUB is currently not such an abuser as it reclaims at least once. It is possible that the pageblock stolen has few suitable high-order pages and will need to steal again in the near future but there would need to be strong justification to search all pageblocks for an ideal candidate. The pageblocks are unreserved if an allocation fails after a direct reclaim attempt. The watermark checks account for the reserved pageblocks when the allocation request is not a high-order atomic allocation. The reserved pageblocks can not be used for order-0 allocations. This may allow temporary wastage until a failed reclaim reassigns the pageblock. This is deliberate as the intent of the reservation is to satisfy a limited number of atomic high-order short-lived requests if the system requires them. The stutter benchmark was used to evaluate this but while it was running there was a systemtap script that randomly allocated between 1 high-order page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This is much larger than the potential reserve and it does not attempt to be realistic. It is intended to stress random high-order allocations from an unknown source, show that there is a reduction in failures without introducing an anomaly where atomic allocations are more reliable than regular allocations. The amount of memory reserved varied throughout the workload as reserves were created and reclaimed under memory pressure. The allocation failures once the workload warmed up were as follows; 4.2-rc5-vanilla 70% 4.2-rc5-atomic-reserve 56% The failure rate was also measured while building multiple kernels. The failure rate was 14% but is 6% with this patch applied. Overall, this is a small reduction but the reserves are small relative to the number of allocation requests. In early versions of the patch, the failure rate reduced by a much larger amount but that required much larger reserves and perversely made atomic allocations seem more reliable than regular allocations. [yalin.wang2010@gmail.com: fix redundant check and a memory leak] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
MIGRATE_RESERVE preserves an old property of the buddy allocator that existed prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to remain contiguous until the only alternative was to fail the allocation. At the time it was discovered that high-order atomic allocations relied on this property so MIGRATE_RESERVE was introduced. A later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE and supporting code so it'll be easier to review. Note that this patch in isolation may look like a false regression if someone was bisecting high-order atomic allocation failures. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The zonelist cache (zlc) was introduced to skip over zones that were recently known to be full. This avoided expensive operations such as the cpuset checks, watermark calculations and zone_reclaim. The situation today is different and the complexity of zlc is harder to justify. 1) The cpuset checks are no-ops unless a cpuset is active and in general are a lot cheaper. 2) zone_reclaim is now disabled by default and I suspect that was a large source of the cost that zlc wanted to avoid. When it is enabled, it's known to be a major source of stalling when nodes fill up and it's unwise to hit every other user with the overhead. 3) Watermark checks are expensive to calculate for high-order allocation requests. Later patches in this series will reduce the cost of the watermark checking. 4) The most important issue is that in the current implementation it is possible for a failed THP allocation to mark a zone full for order-0 allocations and cause a fallback to remote nodes. The last issue could be addressed with additional complexity but as the benefit of zlc is questionable, it is better to remove it. If stalls due to zone_reclaim are ever reported then an alternative would be to introduce deferring logic based on a timeout inside zone_reclaim itself and leave the page allocator fast paths alone. The impact on page-allocator microbenchmarks is negligible as they don't hit the paths where the zlc comes into play. Most page-reclaim related workloads showed no noticeable difference as a result of the removal. The impact was noticeable in a workload called "stutter". One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. In an ideal world the latency application would not notice the mmap latency. On a 2-node machine the results of this patch are stutter 4.3.0-rc1 4.3.0-rc1 baseline nozlc-v4 Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%) 1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%) 2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%) 3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%) Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%) Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%) Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%) Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%) Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%) Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%) Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%) Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%) Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%) Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%) Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%) Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%) Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%) Note that the maximum stall latency went from 24 seconds to 12 which is still bad but an improvement. The milage varies considerably 2-node machine on an earlier test went from 494 seconds to 47 seconds and a 4-node machine that tested an earlier version of this patch went from a worst case stall time of 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable as it is hammering the system and the milage will vary between machines. There is a secondary impact with potentially more direct reclaim because zones are now being considered instead of being skipped by zlc. In this particular test run it did not occur so will not be described. However, in at least one test the following was observed 1. Direct reclaim rates were higher. This was likely due to direct reclaim being entered instead of the zlc disabling a zone and busy looping. Busy looping may have the effect of allowing kswapd to make more progress and in some cases may be better overall. If this is found then the correct action is to put direct reclaimers to sleep on a waitqueue and allow kswapd make forward progress. Busy looping on the zlc is even worse than when the allocator used to blindly call congestion_wait(). 2. There was higher swap activity as direct reclaim was active. 3. Direct reclaim efficiency was lower. This is related to 1 as more scanning activity also encountered more pages that could not be immediately reclaimed In that case, the direct page scan and reclaim rates are noticeable but it is not considered a problem for a few reasons 1. The test is primarily concerned with latency. The mmap attempts are also faulted which means there are THP allocation requests. The ZLC could cause zones to be disabled causing the process to busy loop instead of reclaiming. This looks like elevated direct reclaim activity but it's the correct action to take based on what processes requested. 2. The test hammers reclaim and compaction heavily. The number of successful THP faults is highly variable but affects the reclaim stats. It's not a realistic or reasonable measure of page reclaim activity. 3. No other page-reclaim intensive workload that was tested showed a problem. 4. If a workload is identified that benefitted from the busy looping then it should be fixed by having direct reclaimers sleep on a wait queue until woken by kswapd instead of busy looping. We had this class of problem before when congestion_waits() with a fixed timeout was a brain damaged decision but happened to benefit some workloads. If a workload is identified that relied on the zlc to busy loop then it should be fixed correctly and have a direct reclaimer sleep on a waitqueue until woken by kswapd. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-