- 15 1月, 2016 40 次提交
-
-
由 Alexander Kuleshov 提交于
Remove rgnbase and rgnsize variables from memblock_overlaps_region(). We use these variables only for passing to the memblock_addrs_overlap() function and that's all. Let's remove them. Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
__GFP_NOFAIL is a big hammer used to ensure that the allocation request can never fail. This is a strong requirement and as such it also deserves a special treatment when the system is OOM. The primary problem here is that the allocation request might have come with some locks held and the oom victim might be blocked on the same locks. This is basically an OOM deadlock situation. This patch tries to reduce the risk of such a deadlocks by giving __GFP_NOFAIL allocations a special treatment and let them dive into memory reserves after oom killer invocation. This should help them to make a progress and release resources they are holding. The OOM victim should compensate for the reserves consumption. Signed-off-by: NMichal Hocko <mhocko@suse.com> Suggested-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
Use list_for_each_entry instead of list_for_each + list_entry to simplify the code. Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
To make the intention clearer, use list_{first,last}_entry instead of list_entry. Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Commit 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") added an unnecessary and unused parameter to __rmqueue. It was a parameter that was used in an earlier version of the patch and then left behind. This patch cleans it up. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The dirty balance reserve that dirty throttling has to consider is merely memory not available to userspace allocations. There is nothing writeback-specific about it. Generalize the name so that it's reusable outside of that context. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is called from the vm_operations_struct::fault() context during the page fault. This context doesn't need the reclaim protection normally. ceph and ocfs2 which call filemap_fault from their fault handlers seem to be OK because they are not taking any fs lock before invoking generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the reclaim recursion POV because this lock serializes truncate and punch hole with the page faults and it doesn't get involved in the reclaim. There is simply no reason to deliberately use a weaker allocation context when a __GFP_FS | __GFP_IO can be used. The GFP_NOFS protection might be even harmful. There is a push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. We cannot play with mapping_gfp_mask directly because that would be racy wrt. parallel page faults and it might interfere with other users who really rely on NOFS semantic from the stored gfp_mask. The mask is also inode proper so it would even be a layering violation. What we can do instead is to push the gfp_mask into struct vm_fault and allow fs layer to overwrite it should the callback need to be called with a different allocation context. Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO) because this should be safe from the page fault path normally. Why do we care about mapping_gfp_mask at all then? Because this doesn't hold only reclaim protection flags but it also might contain zone and movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect those. Signed-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: NJan Kara <jack@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
sysctl_compaction_handler() is the handler function for compact_memory tunable knob under /proc/sys/vm, add the missing knob name to make this more accurate in comment. No functional change. Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daniel Cashman 提交于
Address Space Layout Randomization (ASLR) provides a barrier to exploitation of user-space processes in the presence of security vulnerabilities by making it more difficult to find desired code/data which could help an attack. This is done by adding a random offset to the location of regions in the process address space, with a greater range of potential offset values corresponding to better protection/a larger search-space for brute force, but also to greater potential for fragmentation. The offset added to the mmap_base address, which provides the basis for the majority of the mappings for a process, is set once on process exec in arch_pick_mmap_layout() and is done via hard-coded per-arch values, which reflect, hopefully, the best compromise for all systems. The trade-off between increased entropy in the offset value generation and the corresponding increased variability in address space fragmentation is not absolute, however, and some platforms may tolerate higher amounts of entropy. This patch introduces both new Kconfig values and a sysctl interface which may be used to change the amount of entropy used for offset generation on a system. The direct motivation for this change was in response to the libstagefright vulnerabilities that affected Android, specifically to information provided by Google's project zero at: http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html The attack presented therein, by Google's project zero, specifically targeted the limited randomness used to generate the offset added to the mmap_base address in order to craft a brute-force-based attack. Concretely, the attack was against the mediaserver process, which was limited to respawning every 5 seconds, on an arm device. The hard-coded 8 bits used resulted in an average expected success rate of defeating the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a piece). With this patch, and an accompanying increase in the entropy value to 16 bits, the same attack would take an average expected time of over 45 hours (32768 tries), which makes it both less feasible and more likely to be noticed. The introduced Kconfig and sysctl options are limited by per-arch minimum and maximum values, the minimum of which was chosen to match the current hard-coded value and the maximum of which was chosen so as to give the greatest flexibility without generating an invalid mmap_base address, generally a 3-4 bits less than the number of bits in the user-space accessible virtual address space. When decided whether or not to change the default value, a system developer should consider that mmap_base address could be placed anywhere up to 2^(value) bits away from the non-randomized location, which would introduce variable-sized areas above and below the mmap_base address such that the maximum vm_area_struct size may be reduced, preventing very large allocations. This patch (of 4): ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Signed-off-by: NDaniel Cashman <dcashman@google.com> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: NKees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Nick Kralevich <nnk@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Borislav Petkov <bp@suse.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Piotr Kwapulinski 提交于
The following flag comparison in mmap_region makes no sense: if (!(vm_flags & MAP_FIXED)) return -ENOMEM; The condition is always false and thus the above "return -ENOMEM" is never executed. The vm_flags must not be compared with MAP_FIXED flag. The vm_flags may only be compared with VM_* flags. MAP_FIXED has the same value as VM_MAYREAD. Hitting the rlimit is a slow path and find_vma_intersection should realize that there is no overlapping VMA for !MAP_FIXED case pretty quickly. Signed-off-by: NPiotr Kwapulinski <kwapulinski.piotr@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
zone_reclaimable_pages counts how many pages are reclaimable in the given zone. This currently includes all pages on file lrus and anon lrus if there is an available swap storage. We do not consider NR_ISOLATED_{ANON,FILE} counters though which is not correct because these counters reflect temporarily isolated pages which are still reclaimable because they either get back to their LRU or get freed either by the page reclaim or page migration. The number of these pages might be sufficiently high to confuse users of zone_reclaimable_pages (e.g. mbind can migrate large ranges of memory at once). Signed-off-by: NMichal Hocko <mhocko@suse.com> Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former is never cleared while the latter can be cleared by unsetting the limit. This allows to disable tcp socket accounting for new sockets after it was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still guaranteeing that memcg_socket_limit_enabled static key will be decremented on memcg destruction. This functionality looks dubious, because it is not clear what a use case would be. By enabling tcp accounting a user accepts the price. If they then find the performance degradation unacceptable, they can always restart their workload with tcp accounting disabled. It does not seem there is any need to flip it while the workload is running. Besides, it contradicts to how kmem accounting API works: writing whatever to memory.kmem.limit_in_bytes enables kmem accounting for the cgroup in question, after which it cannot be disabled. Therefore one might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just enables socket accounting w/o limiting it, which might be useful by itself, but it isn't true. Since this API peculiarity is not documented anywhere, I propose to drop it. This will allow to simplify the code by dropping cg_proto->flags. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
We assume there is enough inactive page cache if the size of inactive file lru is greater than the size of active file lru, in which case we force-scan file lru ignoring anonymous pages. While this logic works fine when there are plenty of page cache pages, it fails if the size of file lru is small (several MB): in this case (lru_size >> prio) will be 0 for normal scan priorities, as a result, if inactive file lru happens to be larger than active file lru, anonymous pages of a cgroup will never get evicted unless the system experiences severe memory pressure, even if there are gigabytes of unused anonymous memory there, which is unfair in respect to other cgroups, whose workloads might be page cache oriented. This patch attempts to fix this by elaborating the "enough inactive page cache" check: it makes it not only check that inactive lru size > active lru size, but also that we will scan something from the cgroup at the current scan priority. If these conditions do not hold, we proceed to SCAN_FRACT as usual. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when reading /proc/vmallocinfo. [akpm@linux-foundation.org: remove VM_VPAGES reference via kvfree()] Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
Simplify the code with list_first_entry_or_null(). Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jerome Marchand 提交于
Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported "shmem-rss" from "file-rss". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Following the previous patch, further reduction of /proc/pid/smaps cost is possible for private writable shmem mappings with unpopulated areas where the page walk invokes the .pte_hole function. We can use radix tree iterator for each such area instead of calling find_get_entry() in a loop. This is possible at the extra maintenance cost of introducing another shmem function shmem_partial_swap_usage(). To demonstrate the diference, I have measured this on a process that creates a private writable 2GB mapping of a partially swapped out /dev/shm/file (which cannot employ the optimizations from the prvious patch) and doesn't populate it at all. I time how long does it take to cat /proc/pid/smaps of this process 100 times. Before this patch: real 0m3.831s user 0m0.180s sys 0m3.212s After this patch: real 0m1.176s user 0m0.180s sys 0m0.684s The time is similar to the case where a radix tree iterator is employed on the whole mapping. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The previous patch has improved swap accounting for shmem mapping, which however made /proc/pid/smaps more expensive for shmem mappings, as we consult the radix tree for each pte_none entry, so the overal complexity is O(n*log(n)). We can reduce this significantly for mappings that cannot contain COWed pages, because then we can either use the statistics tha shmem object itself tracks (if the mapping contains the whole object, or the swap usage of the whole object is zero), or use the radix tree iterator, which is much more effective than repeated find_get_entry() calls. This patch therefore introduces a function shmem_swap_usage(vma) and makes /proc/pid/smaps use it when possible. Only for writable private mappings of shmem objects (i.e. tmpfs files) with the shmem object itself (partially) swapped outwe have to resort to the find_get_entry() approach. Hopefully such mappings are relatively uncommon. To demonstrate the diference, I have measured this on a process that creates a 2GB mapping and dirties single pages with a stride of 2MB, and time how long does it take to cat /proc/pid/smaps of this process 100 times. Private writable mapping of a /dev/shm/file (the most complex case): real 0m3.831s user 0m0.180s sys 0m3.212s Shared mapping of an almost full mapping of a partially swapped /dev/shm/file (which needs to employ the radix tree iterator). real 0m1.351s user 0m0.096s sys 0m0.768s Same, but with /dev/shm/file not swapped (so no radix tree walk needed) real 0m0.935s user 0m0.128s sys 0m0.344s Private anonymous mapping: real 0m0.949s user 0m0.116s sys 0m0.348s The cost is now much closer to the private anonymous mapping case, unless the shmem mapping is private and writable. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
Make memmap_valid_within return bool due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
To make the intention clearer, use list_{next,first}_entry instead of list_entry. Signed-off-by: NGeliang Tang <geliangtang@163.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if __GFP_NOFAIL is requested. This is fragile because we are basically relying on somebody else to make the reclaim (be it the direct reclaim or OOM killer) for us. The caller might be holding resources (e.g. locks) which block other other reclaimers from making any progress for example. Remove the retry loop and rely on __alloc_pages_slowpath to invoke all allowed reclaim steps and retry logic. We have to be careful about __GFP_NOFAIL allocations from the PF_MEMALLOC context even though this is a very bad idea to begin with because no progress can be gurateed at all. We shouldn't break the __GFP_NOFAIL semantic here though. It could be argued that this is essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC is much harder to check for existing users because they might happen deep down the code path performed much later after setting the flag so we cannot really rule out there is no kernel path triggering this combination. Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
__alloc_pages_high_priority doesn't do anything special other than it calls get_page_from_freelist and loops around GFP_NOFAIL allocation until it succeeds. It would be better if the first part was done in __alloc_pages_slowpath where we modify the zonelist because this would be easier to read and understand. Opencoding the function into its only caller allows to simplify it a bit as well. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
Hardcoding index to zonelists array in gfp_zonelist() is not a good idea, let's enumerate it to improve readability. No functional change. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] [n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator] Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
Make memblock_is_memory() and memblock_is_reserved return bool to improve readability due to these particular functions only using either one or zero as their return value. No functional change. Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 yalin wang 提交于
Move node_id zone_idx shrink flags into trace function, so thay we don't need caculate these args if the trace is disabled, and will make this function have less arguments. Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com> Reviewed-by: NSteven Rostedt <rostedt@goodmis.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Now, we have tracepoint in test_pages_isolated() to notify pfn which cannot be isolated. But, in alloc_contig_range(), some error path doesn't call test_pages_isolated() so it's still hard to know exact pfn that causes allocation failure. This patch change this situation by calling test_pages_isolated() in almost error path. In allocation failure case, some overhead is added by this change, but, allocation failure is really rare event so it would not matter. In fatal signal pending case, we don't call test_pages_isolated() because this failure is intentional one. There was a bogus outer_start problem due to unchecked buddy order and this patch also fix it. Before this patch, it didn't matter, because end result is same thing. But, after this patch, tracepoint will report failed pfn so it should be accurate. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
cma allocation should be guranteeded to succeed. But sometimes it can fail in the current implementation. To track down the problem, we need to know which page is problematic and this new tracepoint will report it. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
This is preparation step to report test failed pfn in new tracepoint to analyze cma allocation failure problem. There is no functional change in this patch. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nathan Zimmer 提交于
When running the SPECint_rate gcc on some very large boxes it was noticed that the system was spending lots of time in mpol_shared_policy_lookup(). The gamess benchmark can also show it and is what I mostly used to chase down the issue since the setup for that I found to be easier. To be clear the binaries were on tmpfs because of disk I/O requirements. We then used text replication to avoid icache misses and having all the copies banging on the memory where the instruction code resides. This results in us hitting a bottleneck in mpol_shared_policy_lookup() since lookup is serialised by the shared_policy lock. I have only reproduced this on very large (3k+ cores) boxes. The problem starts showing up at just a few hundred ranks getting worse until it threatens to livelock once it gets large enough. For example on the gamess benchmark at 128 ranks this area consumes only ~1% of time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over 90%. To alleviate the contention in this area I converted the spinlock to an rwlock. This allows a large number of lookups to happen simultaneously. The results were quite good reducing this consumtion at max ranks to around 2%. [akpm@linux-foundation.org: tidy up code comments] Signed-off-by: NNathan Zimmer <nzimmer@sgi.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 yalin wang 提交于
Move trace_reclaim_flags() into trace function, so that we don't need caculate these flags if the trace is disabled. Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com> Reviewed-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chen Gang 提交于
Simplify may_expand_vm(). [akpm@linux-foundation.org: further simplification, per Naoya Horiguchi] Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Klimov 提交于
Before usage page pointer initialized by NULL is reinitialized by follow_page_mask(). Drop useless init of page pointer in the beginning of loop. Signed-off-by: NAlexey Klimov <klimov.linux@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Make vmalloc family functions allocate vmalloc area pages with alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted to memcg. This is needed, at least, to account alloc_fdmem allocations. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Currently, if we want to account all objects of a particular kmem cache, we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed to kmem_cache_create will force accounting for every allocation from this cache even if __GFP_ACCOUNT is not passed. This patch does not make any of the existing caches use this flag - it will be done later in the series. Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and hence cannot have different sets of SLAB_* flags. Thus using this flag will probably reduce the number of merged slabs even if kmem accounting is not used (only compiled in). Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Suggested-by: NTejun Heo <tj@kernel.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So this patch switches kmem accounting to the white-policy: now only those kmem allocations that are marked as __GFP_ACCOUNT are accounted to memcg. Currently, no kmem allocations are marked like this. The following patches will mark several kmem allocations that are known to be easily triggered from userspace and therefore should be accounted to memcg. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
This reverts commit 8f4fc071 ("gfp: add __GFP_NOACCOUNT"). Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So it was decided to switch to the white-list policy. This patch reverts bits introducing the black-list policy. The white-list policy will be introduced later in the series. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
Add a new helper function get_first_slab() that get the first slab from a kmem_cache_node. Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
Simplify the code with list_for_each_entry(). Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
Simplify the code with list_first_entry_or_null(). Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-