- 07 11月, 2015 9 次提交
-
-
由 Mel Gorman 提交于
The primary purpose of watermarks is to ensure that reclaim can always make forward progress in PF_MEMALLOC context (kswapd and direct reclaim). These assume that order-0 allocations are all that is necessary for forward progress. High-order watermarks serve a different purpose. Kswapd had no high-order awareness before they were introduced (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was particularly important when there were high-order atomic requests. The watermarks both gave kswapd awareness and made a reserve for those atomic requests. There are two important side-effects of this. The most important is that a non-atomic high-order request can fail even though free pages are available and the order-0 watermarks are ok. The second is that high-order watermark checks are expensive as the free list counts up to the requested order must be examined. With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to have high-order watermarks. Kswapd and compaction still need high-order awareness which is handled by checking that at least one suitable high-order page is free. With the patch applied, there was little difference in the allocation failure rates as the atomic reserves are small relative to the number of allocation attempts. The expected impact is that there will never be an allocation failure report that shows suitable pages on the free lists. The one potential side-effect of this is that in a vanilla kernel, the watermark checks may have kept a free page for an atomic allocation. Now, we are 100% relying on the HighAtomic reserves and an early allocation to have allocated them. If the first high-order atomic allocation is after the system is already heavily fragmented then it'll fail. [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
High-order watermark checking exists for two reasons -- kswapd high-order awareness and protection for high-order atomic requests. Historically the kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic allocations on demand and avoids using those blocks for order-0 allocations. This is more flexible and reliable than MIGRATE_RESERVE was. A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation request steals a pageblock but limits the total number to 1% of the zone. Callers that speculatively abuse atomic allocations for long-lived high-order allocations to access the reserve will quickly fail. Note that SLUB is currently not such an abuser as it reclaims at least once. It is possible that the pageblock stolen has few suitable high-order pages and will need to steal again in the near future but there would need to be strong justification to search all pageblocks for an ideal candidate. The pageblocks are unreserved if an allocation fails after a direct reclaim attempt. The watermark checks account for the reserved pageblocks when the allocation request is not a high-order atomic allocation. The reserved pageblocks can not be used for order-0 allocations. This may allow temporary wastage until a failed reclaim reassigns the pageblock. This is deliberate as the intent of the reservation is to satisfy a limited number of atomic high-order short-lived requests if the system requires them. The stutter benchmark was used to evaluate this but while it was running there was a systemtap script that randomly allocated between 1 high-order page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This is much larger than the potential reserve and it does not attempt to be realistic. It is intended to stress random high-order allocations from an unknown source, show that there is a reduction in failures without introducing an anomaly where atomic allocations are more reliable than regular allocations. The amount of memory reserved varied throughout the workload as reserves were created and reclaimed under memory pressure. The allocation failures once the workload warmed up were as follows; 4.2-rc5-vanilla 70% 4.2-rc5-atomic-reserve 56% The failure rate was also measured while building multiple kernels. The failure rate was 14% but is 6% with this patch applied. Overall, this is a small reduction but the reserves are small relative to the number of allocation requests. In early versions of the patch, the failure rate reduced by a much larger amount but that required much larger reserves and perversely made atomic allocations seem more reliable than regular allocations. [yalin.wang2010@gmail.com: fix redundant check and a memory leak] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
MIGRATE_RESERVE preserves an old property of the buddy allocator that existed prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to remain contiguous until the only alternative was to fail the allocation. At the time it was discovered that high-order atomic allocations relied on this property so MIGRATE_RESERVE was introduced. A later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE and supporting code so it'll be easier to review. Note that this patch in isolation may look like a false regression if someone was bisecting high-order atomic allocation failures. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The zonelist cache (zlc) was introduced to skip over zones that were recently known to be full. This avoided expensive operations such as the cpuset checks, watermark calculations and zone_reclaim. The situation today is different and the complexity of zlc is harder to justify. 1) The cpuset checks are no-ops unless a cpuset is active and in general are a lot cheaper. 2) zone_reclaim is now disabled by default and I suspect that was a large source of the cost that zlc wanted to avoid. When it is enabled, it's known to be a major source of stalling when nodes fill up and it's unwise to hit every other user with the overhead. 3) Watermark checks are expensive to calculate for high-order allocation requests. Later patches in this series will reduce the cost of the watermark checking. 4) The most important issue is that in the current implementation it is possible for a failed THP allocation to mark a zone full for order-0 allocations and cause a fallback to remote nodes. The last issue could be addressed with additional complexity but as the benefit of zlc is questionable, it is better to remove it. If stalls due to zone_reclaim are ever reported then an alternative would be to introduce deferring logic based on a timeout inside zone_reclaim itself and leave the page allocator fast paths alone. The impact on page-allocator microbenchmarks is negligible as they don't hit the paths where the zlc comes into play. Most page-reclaim related workloads showed no noticeable difference as a result of the removal. The impact was noticeable in a workload called "stutter". One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. In an ideal world the latency application would not notice the mmap latency. On a 2-node machine the results of this patch are stutter 4.3.0-rc1 4.3.0-rc1 baseline nozlc-v4 Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%) 1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%) 2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%) 3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%) Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%) Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%) Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%) Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%) Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%) Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%) Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%) Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%) Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%) Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%) Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%) Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%) Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%) Note that the maximum stall latency went from 24 seconds to 12 which is still bad but an improvement. The milage varies considerably 2-node machine on an earlier test went from 494 seconds to 47 seconds and a 4-node machine that tested an earlier version of this patch went from a worst case stall time of 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable as it is hammering the system and the milage will vary between machines. There is a secondary impact with potentially more direct reclaim because zones are now being considered instead of being skipped by zlc. In this particular test run it did not occur so will not be described. However, in at least one test the following was observed 1. Direct reclaim rates were higher. This was likely due to direct reclaim being entered instead of the zlc disabling a zone and busy looping. Busy looping may have the effect of allowing kswapd to make more progress and in some cases may be better overall. If this is found then the correct action is to put direct reclaimers to sleep on a waitqueue and allow kswapd make forward progress. Busy looping on the zlc is even worse than when the allocator used to blindly call congestion_wait(). 2. There was higher swap activity as direct reclaim was active. 3. Direct reclaim efficiency was lower. This is related to 1 as more scanning activity also encountered more pages that could not be immediately reclaimed In that case, the direct page scan and reclaim rates are noticeable but it is not considered a problem for a few reasons 1. The test is primarily concerned with latency. The mmap attempts are also faulted which means there are THP allocation requests. The ZLC could cause zones to be disabled causing the process to busy loop instead of reclaiming. This looks like elevated direct reclaim activity but it's the correct action to take based on what processes requested. 2. The test hammers reclaim and compaction heavily. The number of successful THP faults is highly variable but affects the reclaim stats. It's not a realistic or reasonable measure of page reclaim activity. 3. No other page-reclaim intensive workload that was tested showed a problem. 4. If a workload is identified that benefitted from the busy looping then it should be fixed by having direct reclaimers sleep on a wait queue until woken by kswapd instead of busy looping. We had this class of problem before when congestion_waits() with a fixed timeout was a brain damaged decision but happened to benefit some workloads. If a workload is identified that relied on the zlc to busy loop then it should be fixed correctly and have a direct reclaimer sleep on a waitqueue until woken by kswapd. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
File-backed pages that will be immediately written are balanced between zones. This heuristic tries to avoid having a single zone filled with recently dirtied pages but the checks are unnecessarily expensive. Move consider_zone_balanced into the alloc_context instead of checking bitmaps multiple times. The patch also gives the parameter a more meaningful name. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Overall, the intent of this series is to remove the zonelist cache which was introduced to avoid high overhead in the page allocator. Once this is done, it is necessary to reduce the cost of watermark checks. The series starts with minor micro-optimisations. Next it notes that GFP flags that affect watermark checks are abused. __GFP_WAIT historically identified callers that could not sleep and could access reserves. This was later abused to identify callers that simply prefer to avoid sleeping and have other options. A patch distinguishes between atomic callers, high-priority callers and those that simply wish to avoid sleep. The zonelist cache has been around for a long time but it is of dubious merit with a lot of complexity and some issues that are explained. The most important issue is that a failed THP allocation can cause a zone to be treated as "full". This potentially causes unnecessary stalls, reclaim activity or remote fallbacks. The issues could be fixed but it's not worth it. The series places a small number of other micro-optimisations on top before examining GFP flags watermarks. High-order watermarks enforcement can cause high-order allocations to fail even though pages are free. The watermark checks both protect high-order atomic allocations and make kswapd aware of high-order pages but there is a much better way that can be handled using migrate types. This series uses page grouping by mobility to reserve pageblocks for high-order allocations with the size of the reservation depending on demand. kswapd awareness is maintained by examining the free lists. By patch 12 in this series, there are no high-order watermark checks while preserving the properties that motivated the introduction of the watermark checks. This patch (of 10): No user of zone_watermark_ok_safe() specifies alloc_flags. This patch removes the unnecessary parameter. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NChristoph Lameter <cl@linux.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
Introduce is_sysrq_oom helper function indicating oom kill triggered by sysrq to improve readability. No functional changes. Signed-off-by: NYaowei Bai <bywxiaobai@163.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 11月, 2015 31 次提交
-
-
由 Eric B Munson 提交于
The previous patch introduced a flag that specified pages in a VMA should be placed on the unevictable LRU, but they should not be made present when the area is created. This patch adds the ability to set this state via the new mlock system calls. We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall. MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED. MCL_ONFAULT should be used as a modifier to the two other mlockall flags. When used with MCL_CURRENT, all current mappings will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be marked with VM_LOCKED | VM_LOCKONFAULT. Prior to this patch, mlockall() will unconditionally clear the mm->def_flags any time it is called without MCL_FUTURE. This behavior is maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and new VMAs will be unlocked. This remains true with or without MCL_ONFAULT in either mlockall() invocation. munlock() will unconditionally clear both vma flags. munlockall() unconditionally clears for VMA flags on all VMAs and in the mm->def_flags field. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric B Munson 提交于
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in. Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric B Munson 提交于
With the refactored mlock code, introduce a new system call for mlock. The new call will allow the user to specify what lock states are being added. mlock2 is trivial at the moment, but a follow on patch will add a new mlock state making it useful. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric B Munson 提交于
mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings where the entire area is not necessary this is not ideal. Instead of forcing all locked pages to be present when they are allocated, this set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. This series introduces a new mlock() system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). There are two main use cases that this set covers. The first is the security focussed mlock case. A buffer is needed that cannot be written to swap. The maximum size is known, but on average the memory used is significantly less than this maximum. With lock on fault, the buffer is guaranteed to never be paged out without consuming the maximum size every time such a buffer is created. The second use case is focussed on performance. Portions of a large file are needed and we want to keep the used portions in memory once accessed. This is the case for large graphical models where the path through the graph is not known until run time. The entire graph is unlikely to be used in a given invocation, but once a node has been used it needs to stay resident for further processing. Given these constraints we have a number of options. We can potentially waste a large amount of memory by mlocking the entire region (this can also cause a significant stall at startup as the entire file is read in). We can mlock every page as we access them without tracking if the page is already resident but this introduces large overhead for each access. The third option is mapping the entire region with PROT_NONE and using a signal handler for SIGSEGV to mprotect(PROT_READ) and mlock() the needed page. Doing this page at a time adds a significant performance penalty. Batching can be used to mitigate this overhead, but in order to safely avoid trying to mprotect pages outside of the mapping, the boundaries of each mapping to be used in this way must be tracked and available to the signal handler. This is precisely what the mm system in the kernel should already be doing. For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is created not when the pages are faulted in. For mlockall(MCL_ONFAULT) the user is charged as if MCL_FUTURE was used. This decision was made to keep the accounting checks out of the page fault path. To illustrate the benefit of this set I wrote a test program that mmaps a 5 GB file filled with random data and then makes 15,000,000 accesses to random addresses in that mapping. The test program was run 20 times for each setup. Results are reported for two program portions, setup and execution. The setup phase is calling mmap and optionally mlock on the entire region. For most experiments this is trivial, but it highlights the cost of faulting in the entire region. Results are averages across the 20 runs in milliseconds. mmap with mlock(MLOCK_LOCKED) on entire range: Setup avg: 8228.666 Processing avg: 8274.257 mmap with mlock(MLOCK_LOCKED) before each access: Setup avg: 0.113 Processing avg: 90993.552 mmap with PROT_NONE and signal handler and batch size of 1 page: With the default value in max_map_count, this gets ENOMEM as I attempt to change the permissions, after upping the sysctl significantly I get: Setup avg: 0.058 Processing avg: 69488.073 mmap with PROT_NONE and signal handler and batch size of 8 pages: Setup avg: 0.068 Processing avg: 38204.116 mmap with PROT_NONE and signal handler and batch size of 16 pages: Setup avg: 0.044 Processing avg: 29671.180 mmap with mlock(MLOCK_ONFAULT) on entire range: Setup avg: 0.189 Processing avg: 17904.899 The signal handler in the batch cases faulted in memory in two steps to avoid having to know the start and end of the faulting mapping. The first step covers the page that caused the fault as we know that it will be possible to lock. The second step speculatively tries to mlock and mprotect the batch size - 1 pages that follow. There may be a clever way to avoid this without having the program track each mapping to be covered by this handeler in a globally accessible structure, but I could not find it. It should be noted that with a large enough batch size this two step fault handler can still cause the program to crash if it reaches far beyond the end of the mapping. These results show that if the developer knows that a majority of the mapping will be used, it is better to try and fault it in at once, otherwise mlock(MLOCK_ONFAULT) is significantly faster. The performance cost of these patches are minimal on the two benchmarks I have tested (stream and kernbench). The following are the average values across 20 runs of stream and 10 runs of kernbench after a warmup run whose results were discarded. Avg throughput in MB/s from stream using 1000000 element arrays Test 4.2-rc1 4.2-rc1+lock-on-fault Copy: 10,566.5 10,421 Scale: 10,685 10,503.5 Add: 12,044.1 11,814.2 Triad: 12,064.8 11,846.3 Kernbench optimal load 4.2-rc1 4.2-rc1+lock-on-fault Elapsed Time 78.453 78.991 User Time 64.2395 65.2355 System Time 9.7335 9.7085 Context Switches 22211.5 22412.1 Sleeps 14965.3 14956.1 This patch (of 6): Extending the mlock system call is very difficult because it currently does not take a flags argument. A later patch in this set will extend mlock to support a middle ground between pages that are locked and faulted in immediately and unlocked pages. To pave the way for the new system call, the code needs some reorganization so that all the actual entry point handles is checking input and translating to VMA flags. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Ryabinin 提交于
Currently we already taint the kernel in some cases. E.g. if we hit some bug in slub memory we call object_err() which will taint the kernel with TAINT_BAD_PAGE flag. But for other kind of bugs kernel left untainted. Always taint with TAINT_BAD_PAGE if kasan found some bug. This is useful for automated testing. Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Reviewed-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Ryabinin 提交于
It's recommended to have slub's user tracking enabled with CONFIG_KASAN, because: a) User tracking disables slab merging which improves detecting out-of-bounds accesses. b) User tracking metadata acts as redzone which also improves detecting out-of-bounds accesses. c) User tracking provides additional information about object. This information helps to understand bugs. Currently it is not enabled by default. Besides recompiling the kernel with KASAN and reinstalling it, user also have to change the boot cmdline, which is not very handy. Enable slub user tracking by default with KASAN=y, since there is no good reason to not do this. [akpm@linux-foundation.org: little fixes, per David] Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xishi Qiu 提交于
Use IS_ALIGNED() to determine whether the shadow span two bytes. It generates less code and more readable. Also add some comments in shadow check functions. Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wang Long 提交于
The current KASAN code can not find the following out-of-bounds bugs: char *ptr; ptr = kmalloc(8, GFP_KERNEL); memset(ptr+7, 0, 2); the cause of the problem is the type conversion error in *memory_is_poisoned_n* function. So this patch fix that. Signed-off-by: NWang Long <long.wanglong@huawei.com> Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Konovalov 提交于
Update the reference to the kasan prototype repository on github, since it was renamed. Signed-off-by: NAndrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Konovalov 提交于
We decided to use KASAN as the short name of the tool and KernelAddressSanitizer as the full one. Update log messages according to that. Signed-off-by: NAndrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Konovalov 提交于
Makes KASAN accurately determine the type of the bad access. If the shadow byte value is in the [0, KASAN_SHADOW_SCALE_SIZE) range we can look at the next shadow byte to determine the type of the access. Signed-off-by: NAndrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Konovalov 提交于
Update the names of the bad access types to better reflect the type of the access that happended and make these error types "literals" that can be used for classification and deduplication in scripts. Signed-off-by: NAndrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Konovalov 提交于
Each access with address lower than kasan_shadow_to_mem(KASAN_SHADOW_START) is reported as user-memory-access. This is not always true, the accessed address might not be in user space. Fix this by reporting such accesses as null-ptr-derefs or wild-memory-accesses. There's another reason for this change. For userspace ASan we have a bunch of systems that analyze error types for the purpose of classification and deduplication. Sooner of later we will write them to KASAN as well. Then clearly and explicitly stated error types will bring value. Signed-off-by: NAndrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
When we end up calling kasan_report in real mode, our shadow mapping for the spinlock variable will show poisoned. This will result in us calling kasan_report_error with lock_report spin lock held. To prevent this disable kasan reporting when we are priting error w.r.t kasan. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
We can't use generic functions like print_hex_dump to access kasan shadow region. This require us to setup another kasan shadow region for the address passed (kasan shadow address). Some architectures won't be able to do that. Hence make a copy of the shadow region row and pass that to generic functions. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
Use is_module_address instead Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
The function only disable/enable reporting. In the later patch we will be adding a kasan early enable/disable. Rename kasan_enabled to properly reflect its function. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
LKP reports that v4.2 commit afa2db2f ("tmpfs: truncate prealloc blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo benchmark. creat-clo does just what you'd expect from the name, and creat's O_TRUNC on 0-length file does indeed get into more overhead now shmem_setattr() tests "0 <= 0" instead of "0 < 0". I'm not sure how much we care, but I think it would not be too VW-like to add in a check for whether any pages (or swap) are allocated: if none are allocated, there's none to remove from the radix_tree. At first I thought that check would be good enough for the unmaps too, but no: we should not skip the unlikely case of unmapping pages beyond the new EOF, which were COWed from holes which have now been reclaimed, leaving none. This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere, and running a debug config before and after: I hope those account for the lesser speedup. And probably someone has a benchmark where a thousand threads keep on stat'ing the same file repeatedly: forestall that report by adjusting v4.3 commit 44a30220 ("shmem: recalculate file inode when fstat") not to take the spinlock in shmem_getattr() when there's no work to do. Signed-off-by: NHugh Dickins <hughd@google.com> Reported-by: NYing Huang <ying.huang@linux.intel.com> Tested-by: NYing Huang <ying.huang@linux.intel.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Commit 424cdc14 ("memcg: convert threshold to bytes") has fixed a regression introduced by 3e32cb2e ("mm: memcontrol: lockless page counters") where thresholds were silently converted to use page units rather than bytes when interpreting the user input. The fix is not complete, though, as properly pointed out by Ben Hutchings during stable backport review. The page count is converted to bytes but unsigned long is used to hold the value which would be obviously not sufficient for 32b systems with more than 4G thresholds. The same applies to usage as taken from mem_cgroup_usage which might overflow. Let's remove this bytes vs. pages internal tracking differences and handle thresholds in page units internally. Chage mem_cgroup_usage() to return the value in page units and revert 424cdc14 because this should be sufficient for the consistent handling. mem_cgroup_read_u64 as the only users of mem_cgroup_usage outside of the threshold handling code is converted to give the proper in bytes result. It is doing that already for page_counter output so this is more consistent as well. The value presented to the userspace is still in bytes units. Fixes: 424cdc14 ("memcg: convert threshold to bytes") Fixes: 3e32cb2e ("mm: memcontrol: lockless page counters") Signed-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NBen Hutchings <ben@decadent.org.uk> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> From: Michal Hocko <mhocko@kernel.org> Subject: memcg-fix-thresholds-for-32b-architectures-fix Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> From: Andrew Morton <akpm@linux-foundation.org> Subject: memcg-fix-thresholds-for-32b-architectures-fix-fix don't attempt to inline mem_cgroup_usage() The compiler ignores the inline anwyay. And __always_inlining it adds 600 bytes of goop to the .o file. Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
page_counter_try_charge() currently returns 0 on success and -ENOMEM on failure, which is surprising behavior given the function name. Make it follow the expected pattern of try_stuff() functions that return a boolean true to indicate success, or false for failure. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
memory.current on the root level doesn't add anything that wouldn't be more accurate and detailed using system statistics. It already doesn't include slabs, and it'll be a pain to keep in sync when further memory types are accounted in the memory controller. Remove it. Note that this applies to the new unified hierarchy interface only. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Hansen 提交于
My recent patch "mm, hugetlb: use memory policy when available" added some bloat to hugetlb.o. This patch aims to get some of the bloat back, especially when NUMA is not in play. It does this with an implicit #ifdef and marking some things static that should have been static in my first patch. It also makes the warnings only VM_WARN_ON()s. They were responsible for a pretty big chunk of the bloat. Doing this gets our NUMA=n text size back to a wee bit _below_ where we started before the original patch. It also shaves a bit of space off the NUMA=y case, but not much. Enforcing the mempolicy definitely takes some text and it's hard to avoid. size(1) output: text data bss dec hex filename 30745 3433 2492 36670 8f3e hugetlb.o.nonuma.baseline 31305 3755 2492 37552 92b0 hugetlb.o.nonuma.patch1 30713 3433 2492 36638 8f1e hugetlb.o.nonuma.patch2 (this patch) 25235 473 41276 66984 105a8 hugetlb.o.numa.baseline 25715 475 41276 67466 1078a hugetlb.o.numa.patch1 25491 473 41276 67240 106a8 hugetlb.o.numa.patch2 (this patch) Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Hansen 提交于
I have a hugetlbfs user which is never explicitly allocating huge pages with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time. This works, but they noticed that mbind() was not doing them any good and the pages were being allocated without respect for the policy they specified. The code in question is this: > struct page *alloc_huge_page(struct vm_area_struct *vma, ... > page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg); > if (!page) { > page = alloc_buddy_huge_page(h, NUMA_NO_NODE); dequeue_huge_page_vma() is smart and will respect the VMA's memory policy. But, it only grabs _existing_ huge pages from the huge page pool. If the pool is empty, we fall back to alloc_buddy_huge_page() which obviously can't do anything with the VMA's policy because it isn't even passed the VMA. Almost everybody preallocates huge pages. That's probably why nobody has ever noticed this. Looking back at the git history, I don't think this _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5, 8 years ago. The fix is to pass vma/addr down in to the places where we actually call in to the buddy allocator. It's fairly straightforward plumbing. This has been lightly tested. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexander Kuleshov 提交于
There are no users of the node_hstates array outside of the mm/hugetlb.c. So let's make it static. Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rasmus Villemoes 提交于
As far as I can tell, strncpy_from_unsafe never returns -EFAULT. ret is the result of a __copy_from_user_inatomic(), which is 0 for success and positive (in this case necessarily 1) for access error - it is never negative. So we were always returning the length of the, possibly truncated, destination string. Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: NAlexei Starovoitov <ast@kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
mm/cma.c: In function 'cma_alloc': mm/cma.c:366: warning: 'pfn' may be used uninitialized in this function The patch actually improves the tracing a bit: if alloc_contig_range() fails, tracing will display the offending pfn rather than -1. Cc: Stefan Strogin <stefan.strogin@gmail.com> Cc: Michal Nazarewicz <mpn@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> Cc: Thierry Reding <treding@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties since v2.6.16 first introduced page migration; and the set_page_dirty() which completed its migration of PageDirty, later had to be moderated to __set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too. No actual problems seen with this procedure recently, but if you look into what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually achieving, it turns out to be nothing more than moving the PageDirty flag, and its NR_FILE_DIRTY stat from one zone to another. It would be good to avoid a pile of irrelevant decrementations and incrementations, and improper event counting, and unnecessary descent of the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which radix_tree_replace_slot() left in place anyway). Do the NR_FILE_DIRTY movement, like the other stats movements, while interrupts still disabled in migrate_page_move_mapping(); and don't even bother if the zone is the same. Do the PageDirty movement there under tree_lock too, where old page is frozen and newpage not yet visible: bearing in mind that as soon as newpage becomes visible in radix_tree, an un-page-locked set_page_dirty() might interfere (or perhaps that's just not possible: anything doing so should already hold an additional reference to the old page, preventing its migration; but play safe). But we do still need to transfer PageDirty in migrate_page_copy(), for those who don't go the mapping route through migrate_page_move_mapping(). Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
We have had trouble in the past from the way in which page migration's newpage is initialized in dribs and drabs - see commit 8bdd6380 ("mm: fix direct reclaim writeback regression") which proposed a cleanup. We have no actual problem now, but I think the procedure would be clearer (and alternative get_new_page pools safer to implement) if we assert that newpage is not touched until we are sure that it's going to be used - except for taking the trylock on it in __unmap_and_move(). So shift the early initializations from move_to_new_page() into migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly migrate_huge_page_move_mapping(), but its NULL-mapping path can just be deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping. Adjust stages 3 to 8 in the Documentation file accordingly. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Hitherto page migration has avoided using a migration entry for a swapcache page mapped into userspace, apparently for historical reasons. So any page blessed with swapcache would entail a minor fault when it's next touched, which page migration otherwise tries to avoid. Swapcache in an mlocked area is rare, so won't often matter, but still better fixed. Just rearrange the block in try_to_unmap_one(), to handle TTU_MIGRATION before checking PageAnon, that's all (apart from some reindenting). Well, no, that's not quite all: doesn't this by the way fix a soft_dirty bug, that page migration of a file page was forgetting to transfer the soft_dirty bit? Probably not a serious bug: if I understand correctly, soft_dirty afficionados usually have to handle file pages separately anyway; but we publish the bit in /proc/<pid>/pagemap on file mappings as well as anonymous, so page migration ought not to perturb it. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
__unmap_and_move() contains a long stale comment on page_get_anon_vma() and PageSwapCache(), with an odd control flow that's hard to follow. Mostly this reflects our confusion about the lifetime of an anon_vma, in the early days of page migration, before we could take a reference to one. Nowadays this seems quite straightforward: cut it all down to essentials. I cannot see the relevance of swapcache here at all, so don't treat it any differently: I believe the old comment reflects in part our anon_vma confusions, and in part the original v2.6.16 page migration technique, which used actual swap to migrate anon instead of swap-like migration entries. Why should a swapcache page not be migrated with the aid of migration entry ptes like everything else? So lose that comment now, and enable migration entries for swapcache in the next patch. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Clean up page migration a little more by calling remove_migration_ptes() from the same level, on success or on failure, from __unmap_and_move() or from unmap_and_move_huge_page(). Don't reset page->mapping of a PageAnon old page in move_to_new_page(), leave that to when the page is freed. Except for here in page migration, it has been an invariant that a PageAnon (bit set in page->mapping) page stays PageAnon until it is freed, and I think we're safer to keep to that. And with the above rearrangement, it's necessary because zap_pte_range() wants to identify whether a migration entry represents a file or an anon page, to update the appropriate rss stats without waiting on it. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-