- 09 10月, 2012 12 次提交
-
-
由 Mel Gorman 提交于
Allocation success rates have been far lower since 3.4 due to commit fe2c2a10 ("vmscan: reclaim at order 0 when compaction is enabled"). This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit 7db8889a ("mm: have order > 0 compaction start off where it left") which testing has shown to severely reduce allocation success rates under load - to 0% in one case. This series aims to improve the allocation success rates without regressing the benefits of commit fe2c2a10. The series is based on latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going away. Patch 1 updates a stale comment seeing as I was in the general area. Patch 2 updates reclaim/compaction to reclaim pages scaled on the number of recent failures. Patch 3 captures suitable high-order pages freed by compaction to reduce races with parallel allocation requests. Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction start off where it left] to enable compaction again Patch 5 identifies when compacion is taking too long due to contention and aborts. STRESS-HIGHALLOC 3.6-rc1-akpm full-series Pass 1 36.00 ( 0.00%) 51.00 (15.00%) Pass 2 42.00 ( 0.00%) 63.00 (21.00%) while Rested 86.00 ( 0.00%) 86.00 ( 0.00%) From http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html I know that the allocation success rates in 3.3.6 was 78% in comparison to 36% in in the current akpm tree. With the full series applied, the success rates are up to around 51% with some variability in the results. This is not as high a success rate but it does not reclaim excessively which is a key point. MMTests Statistics: vmstat Page Ins 3050912 3078892 Page Outs 8033528 8039096 Swap Ins 0 0 Swap Outs 0 0 Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates there were 71881 pages swapped out. Direct pages scanned 70942 122976 Kswapd pages scanned 1366300 1520122 Kswapd pages reclaimed 1366214 1484629 Direct pages reclaimed 70936 105716 Kswapd efficiency 99% 97% Kswapd velocity 1072.550 1182.615 Direct efficiency 99% 85% Direct velocity 55.690 95.672 The kswapd velocity changes very little as expected. kswapd velocity is around the 1000 pages/sec mark where as in kernel 3.3.6 with the high allocation success rates it was 8140 pages/second. Direct velocity is higher as a result of patch 2 of the series but this is expected and is acceptable. The direct reclaim and kswapd velocities change very little. If these get accepted for merging then there is a difficulty in how they should be handled. 7db8889a ("mm: have order > 0 compaction start off where it left") is broken but it is already in 3.6-rc1 and needs to be fixed. However, if just patch 4 from this series is applied then Jim Schutt's workload is known to break again as his workload also requires patch 5. While it would be preferred to have all these patches in 3.6 to improve compaction in general, it would at least be acceptable if just patches 4 and 5 were merged to 3.6 to fix a known problem without breaking compaction completely. On the face of it, that would force __GFP_NO_KSWAPD patches to be merged at the same time but I can do a version of this series with __GFP_NO_KSWAPD change reverted and then rebase it on top of this series. That might be best overall because I note that the __GFP_NO_KSWAPD patch should have removed deferred_compaction from page_alloc.c but it didn't but fixing that causes collisions with this series. This patch: The comment about order applied when the check was order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp: use compaction for all allocation orders"). Fixing the comment while I'm in the general area. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
People get confused by find_vma_prepare(), because it doesn't care about what it returns in its output args, when its callers won't be interested. Clarify by passing in end-of-range address too, and returning failure if any existing vma overlaps the new range: instead of returning an ambiguous vma which most callers then must check. find_vma_links() is a clearer name. This does revert 2.6.27's dfe195fb ("mm: fix uninitialized variables for find_vma_prepare callers"), but it looks like gcc 4.3.0 was one of those releases too eager to shout about uninitialized variables: only copy_vma() warns with 4.5.1 and 4.7.1, which a BUG on error silences. [hughd@google.com: fix warning, remove BUG()] Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Benny Halevy <bhalevy@tonian.com> Acked-by: NHillf Danton <dhillf@gmail.com> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Robin Dong 提交于
When writing a new file with 2048 bytes buffer, such as write(fd, buffer, 2048), it will call generic_perform_write() twice for every page: write_begin mark_page_accessed(page) write_end write_begin mark_page_accessed(page) write_end Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be added to active_list even they have be accessed twice because they are not PageLRU(page). But when page 14th comes, all pages in lru-pvecs will be moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now page 14th *is* PageLRU(page). And after second write_end() only page 14th will be in active_list. In Hadoop environment, we do comes to this situation: after writing a file, we find out that only 14th, 28th, 42th... page are in active_list and others in inactive_list. Now kswapd works, shrinks the inactive_list, the file only have 14th, 28th...pages in memory, the readahead request size will be broken to only 52k (13*4k), system's performance falls dramatically. This problem can also replay by below steps (the machine has 8G memory): 1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576 2. cat another 7.5G file to /dev/null 3. vmtouch -m 1G -v /test/file.out, it will show: /test/file.out [oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144 the 'o' means same pages are in memory but same are not. The solution for this problem is simple: the 14th page should be added to lru_add_pvecs before mark_page_accessed() just as other pages. [akpm@linux-foundation.org: tweak comment] [akpm@linux-foundation.org: grab better comment from the v3 patch] Signed-off-by: NRobin Dong <sanbai@taobao.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags: VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for sys_madvise. The next patch will use it for replacing the outdated flag VM_RESERVED. Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Currently the kernel sets mm->exe_file during sys_execve() and then tracks number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it resets mm->exe_file at last mmput() when mm->mm_users drops to zero. VMA with VM_EXECUTABLE flag appears after mapping file with flag MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma splitting, because sys_mmap ignores this flag. Usually binfmt module sets mm->exe_file and mmaps executable vmas with this file, they hold mm->exe_file while task is running. comment from v2.6.25-6245-g925d1c40 ("procfs task exe symlink"), where all this stuff was introduced: > The kernel implements readlink of /proc/pid/exe by getting the file from > the first executable VMA. Then the path to the file is reconstructed and > reported as the result. > > Because of the VMA walk the code is slightly different on nommu systems. > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of > walking the VMAs to find the first executable file-backed VMA we store a > reference to the exec'd file in the mm_struct. > > That reference would prevent the filesystem holding the executable file > from being unmounted even after unmapping the VMAs. So we track the number > of VM_EXECUTABLE VMAs and drop the new reference when the last one is > unmapped. This avoids pinning the mounted filesystem. exe_file's vma accounting is hooked into every file mmap/unmmap and vma split/merge just to fix some hypothetical pinning fs from umounting by mm, which already unmapped all its executable files, but still alive. Seems like currently nobody depends on this behaviour. We can try to remove this logic and keep mm->exe_file until final mmput(). mm->exe_file is still protected with mm->mmap_sem, because we want to change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall task can change its mm->exe_file and unpin mountpoint explicitly. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn ptes, special ptes and normal ptes. Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like VM_PFNMAP. If driver populates whole VMA at mmap() it probably not expects page-faults. This patch removes special check from vma_wants_writenotify() which disables pages write tracking for VMA populated via vm_instert_page(). BDI below mapped file should not use dirty-accounting, moreover do_wp_page() can handle this. vm_insert_page() still marks vma after first usage. Usually it is called from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it wants to call this function from other places, for example from page-fault handler. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Combine several arch-specific vma flags into one. before patch: 0x00000200 0x01000000 0x20000000 0x40000000 x86 VM_NOHUGEPAGE VM_HUGEPAGE - VM_PAT powerpc - - VM_SAO - parisc VM_GROWSUP - - - ia64 VM_GROWSUP - - - nommu - VM_MAPPED_COPY - - others - - - - after patch: 0x00000200 0x01000000 0x20000000 0x40000000 x86 - VM_PAT VM_HUGEPAGE VM_NOHUGEPAGE powerpc - VM_SAO - - parisc - VM_GROWSUP - - ia64 - VM_GROWSUP - - nommu - VM_MAPPED_COPY - - others - VM_ARCH_1 - - And voila! One completely free bit. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT. We can toss mapping address from remap_pfn_range() into track_pfn_vma_new(), and collect all PAT-related logic together in arch/x86/. This patch also restores orignal frustration-free is_cow_mapping() check in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73a ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3") is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c, because it already handled by VM_PFNMAP in VM_NO_THP bit-mask. [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()] Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Suresh Siddha 提交于
With PAT enabled, vm_insert_pfn() looks up the existing pfn memory attribute and uses it. Expectation is that the driver reserves the memory attributes for the pfn before calling vm_insert_pfn(). remap_pfn_range() (when called for the whole vma) will setup a new attribute (based on the prot argument) for the specified pfn range. This addresses the legacy usage which typically calls remap_pfn_range() with a desired memory attribute. For ranges smaller than the vma size (which is typically not the case), remap_pfn_range() will use the existing memory attribute for the pfn range. Expose two different API's for these different behaviors. track_pfn_insert() for tracking the pfn attribute set by vm_insert_pfn() and track_pfn_remap() for the remap_pfn_range(). This cleanup also prepares the ground for the track/untrack pfn vma routines to take over the ownership of setting PAT specific vm_flag in the 'vma'. [khlebnikov@openvz.org: Clear checks in track_pfn_remap()] [akpm@linux-foundation.org: tweak a few comments] Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Venkatesh Pallipadi <venki@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
When transparent huge pages were introduced, memory compaction and swap storms were an issue, and the kernel had to be careful to not make THP allocations cause pageout or compaction. Now that we have working compaction deferral, kswapd is smart enough to invoke compaction and the quadratic behaviour around isolate_free_pages has been fixed, it should be safe to remove __GFP_NO_KSWAPD. [minchan@kernel.org: Comment fix] [mgorman@suse.de: Avoid direct reclaim for deferred compaction] Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NRik van Riel <riel@redhat.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 10月, 2012 1 次提交
-
-
由 Andi Kleen 提交于
Signed-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 10月, 2012 2 次提交
-
-
由 Tetsuo Handa 提交于
Fix build failure with CONFIG_DEBUG_SLAB=y && CONFIG_DEBUG_PAGEALLOC=y caused by commit 8a13a4cc "mm/sl[aou]b: Shrink __kmem_cache_create() parameter lists". mm/slab.c: In function '__kmem_cache_create': mm/slab.c:2474: error: 'align' undeclared (first use in this function) mm/slab.c:2474: error: (Each undeclared identifier is reported only once mm/slab.c:2474: error: for each function it appears in.) make[1]: *** [mm/slab.o] Error 1 make: *** [mm] Error 2 Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Fengguang Wu 提交于
Acked-by: NGlauber Costa <glommer@parallels.com> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 29 9月, 2012 1 次提交
-
-
由 Pekka Enberg 提交于
This reverts commit 1e5965bf. Ezequiel Garcia has a better fix.
-
- 28 9月, 2012 1 次提交
-
-
由 Andrea Arcangeli 提交于
Speculative cache pagecache lookups can elevate the refcount from under us, so avoid the false positive. If the refcount is < 2 we'll be notified by a VM_BUG_ON in put_page_testzero as there are two put_page(src_page) in a row before returning from this function. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 9月, 2012 4 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
simplifies a bunch of callers... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 26 9月, 2012 3 次提交
-
-
由 David Rientjes 提交于
On Sat, 8 Sep 2012, Ezequiel Garcia wrote: > @@ -454,15 +455,35 @@ void *__kmalloc_node(size_t size, gfp_t gfp, int node) > gfp |= __GFP_COMP; > ret = slob_new_pages(gfp, order, node); > > - trace_kmalloc_node(_RET_IP_, ret, > + trace_kmalloc_node(caller, ret, > size, PAGE_SIZE << order, gfp, node); > } > > kmemleak_alloc(ret, size, 1, gfp); > return ret; > } > + > +void *__kmalloc_node(size_t size, gfp_t gfp, int node) > +{ > + return __do_kmalloc_node(size, gfp, node, _RET_IP_); > +} > EXPORT_SYMBOL(__kmalloc_node); > > +#ifdef CONFIG_TRACING > +void *__kmalloc_track_caller(size_t size, gfp_t gfp, unsigned long caller) > +{ > + return __do_kmalloc_node(size, gfp, NUMA_NO_NODE, caller); > +} > + > +#ifdef CONFIG_NUMA > +void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, > + int node, unsigned long caller) > +{ > + return __do_kmalloc_node(size, gfp, node, caller); > +} > +#endif This breaks Pekka's slab/next tree with this: mm/slob.c: In function '__kmalloc_node_track_caller': mm/slob.c:488: error: 'gfp' undeclared (first use in this function) mm/slob.c:488: error: (Each undeclared identifier is reported only once mm/slob.c:488: error: for each function it appears in.) mm, slob: fix build breakage in __kmalloc_node_track_caller "mm, slob: Add support for kmalloc_track_caller()" breaks the build because gfp is undeclared. Fix it. Acked-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Ezequiel Garcia 提交于
The bug was introduced in commit 4052147c ("mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype"). Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Ezequiel Garcia 提交于
The bug was introduced by commit 7c0cb9c6 ("mm, slab: Replace 'caller' type, void* -> unsigned long"). Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 25 9月, 2012 7 次提交
-
-
由 Ezequiel Garcia 提交于
This patch does not fix anything, and its only goal is to enable us to obtain some common code between SLAB and SLUB. Neither behavior nor produced code is affected. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Ezequiel Garcia 提交于
This patch does not fix anything and its only goal is to produce common code between SLAB and SLUB. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Ezequiel Garcia 提交于
This long (seemingly unnecessary) patch does not fix anything and its only goal is to produce common code between SLAB and SLUB. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Ezequiel Garcia 提交于
This allows to use _RET_IP_ instead of builtin_address(0), thus achiveing implementation consistency in all three allocators. Though maybe a nitpick, the real goal behind this patch is to be able to obtain common code between SLAB and SLUB. Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Ezequiel Garcia 提交于
Currently slob falls back to regular kmalloc for this case. With this patch kmalloc_track_caller() is correctly implemented, thus tracing the specified caller. This is important to trace accurately allocations performed by krealloc, kstrdup, kmemdup, etc. Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Ezequiel Garcia 提交于
This function is seldom used, and can be simply replaced with cachep->size. Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Ezequiel Garcia 提交于
Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NEzequiel Garcia <elezegarcia@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 23 9月, 2012 1 次提交
-
-
由 Michael Wang 提交于
This patch replaces list_for_each_continue_rcu() with list_for_each_entry_continue_rcu() to save a few lines of code and allow removing list_for_each_continue_rcu(). Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
-
- 21 9月, 2012 2 次提交
-
-
由 Dan Magenheimer 提交于
Tmem, as originally specified, assumes that "get" operations performed on persistent pools never flush the page of data out of tmem on a successful get, waiting instead for a flush operation. This is intended to mimic the model of a swap disk, where a disk read is non-destructive. Unlike a disk, however, freeing up the RAM can be valuable. Over the years that frontswap was in the review process, several reviewers (and notably Hugh Dickins in 2010) pointed out that this would result, at least temporarily, in two copies of the data in RAM: one (compressed for zcache) copy in tmem, and one copy in the swap cache. We wondered if this could be done differently, at least optionally. This patch allows tmem backends to instruct the frontswap code that this backend performs exclusive gets. Zcache2 already contains hooks to support this feature. Other backends are completely unaffected unless/until they are updated to support this feature. While it is not clear that exclusive gets are a performance win on all workloads at all times, this small patch allows for experimentation by backends. P.S. Let's not quibble about the naming of "get" vs "read" vs "load" etc. The naming is currently horribly inconsistent between cleancache and frontswap and existing tmem backends, so will need to be straightened out as a separate patch. "Get" is used by the tmem architecture spec, existing backends, and all documentation and presentation material so I am using it in this patch. Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Zhenzhong Duan 提交于
pages_to_unuse is set to 0 to unuse all frontswap pages But that doesn't happen since a wrong condition in frontswap_shrink cancel it. -v2: Add comment to explain return value of __frontswap_shrink, as suggested by Dan Carpenter, thanks Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
- 19 9月, 2012 2 次提交
-
-
由 Dave Jones 提交于
It doesn't seem worth adding a new taint flag for this, so just re-use the one from 'bad page' Acked-by: Christoph Lameter <cl@linux.com> # SLUB Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NDave Jones <davej@redhat.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
On Tue, 11 Sep 2012, Stephen Rothwell wrote: > After merging the final tree, today's linux-next build (sparc64 defconfig) > produced this warning: > > mm/slab.c:808:13: warning: '__slab_error' defined but not used [-Wunused-function] > > Introduced by commit 945cf2b6 ("mm/sl[aou]b: Extract a common > function for kmem_cache_destroy"). All uses of slab_error() are now > guarded by DEBUG. There is no use case left for slab builds without DEBUG. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 18 9月, 2012 4 次提交
-
-
由 qiuxishi 提交于
There may be a bug when registering section info. For example, on my Itanium platform, the pfn range of node0 includes the other nodes, so other nodes' section info will be double registered, and memmap's page count will equal to 3. node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00 node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000 node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000 node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000 free_all_bootmem_node() register_page_bootmem_info_node() register_page_bootmem_info_section() When hot remove memory, we can't free the memmap's page because page_count() is 2 after put_page_bootmem(). sparse_remove_one_section() free_section_usemap() free_map_bootmem() put_page_bootmem() [akpm@linux-foundation.org: add code comment] Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Li Haifeng 提交于
The heuristic method for buddy has been introduced since commit 43506fad ("mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists"). But the page address of higher page's buddy was wrongly calculated, which will lead page_is_buddy to fail for ever. IOW, the heuristic method would be disabled with the wrong page address of higher page's buddy. Calculating the page address of higher page's buddy should be based higher_page with the offset between index of higher page and index of higher page's buddy. Signed-off-by: NHaifeng Li <omycle@gmail.com> Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: KyongHo Cho <pullip.cho@samsung.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
get_partial() is currently not checking pfmemalloc_match() meaning that it is possible for pfmemalloc pages to leak to non-pfmemalloc users. This is a problem in the following situation. Assume that there is a request from normal allocation and there are no objects in the per-cpu cache and no node-partial slab. In this case, slab_alloc enters the slow path and new_slab_objects() is called which may return a PFMEMALLOC page. As the current user is not allowed to access PFMEMALLOC page, deactivate_slab() is called ([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc checks]) and returns an object from PFMEMALLOC page. Next time, when we get another request from normal allocation, slab_alloc() enters the slow-path and calls new_slab_objects(). In new_slab_objects(), we call get_partial() and get a partial slab which was just deactivated but is a pfmemalloc page. We extract one object from it and re-deactivate. "deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly. As a result, access to PFMEMALLOC page is not properly restricted and it can cause a performance degradation due to frequent deactivation. deactivation frequently. This patch changes get_partial_node() to take pfmemalloc_match() into account and prevents the "deactivate -> re-get in get_partial() scenario. Instead, new_slab() is called. Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
In array cache, there is a object at index 0, check it. Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-