- 25 9月, 2019 13 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
Patch series "Make working with compound pages easier", v2. These three patches add three helpers and convert the appropriate places to use them. This patch (of 3): It's unnecessarily hard to find out the size of a potentially huge page. Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NIra Weiny <ira.weiny@intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 YueHaibing 提交于
Fixes gcc '-Wunused-but-set-variable' warning: mm/rmap.c: In function page_mkclean_one: mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable] It is not used any more since commit cdb07bde ("mm/rmap.c: remove redundant variable cend") Link: http://lkml.kernel.org/r/20190724141453.38536-1-yuehaibing@huawei.comSigned-off-by: NYueHaibing <yuehaibing@huawei.com> Reported-by: NHulk Robot <hulkci@huawei.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christophe JAILLET 提交于
s/posioned/poisoned/ Link: http://lkml.kernel.org/r/20190721180908.6534-1-christophe.jaillet@wanadoo.frSigned-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Walter Wu 提交于
Add memory corruption identification at bug report for software tag-based mode. The report shows whether it is "use-after-free" or "out-of-bound" error instead of "invalid-access" error. This will make it easier for programmers to see the memory corruption problem. We extend the slab to store five old free pointer tag and free backtrace, we can check if the tagged address is in the slab record and make a good guess if the object is more like "use-after-free" or "out-of-bound". therefore every slab memory corruption can be identified whether it's "use-after-free" or "out-of-bound". [aryabinin@virtuozzo.com: simplify & clenup code] Link: https://lkml.kernel.org/r/3318f9d7-a760-3cc8-b700-f06108ae745f@virtuozzo.com] Link: http://lkml.kernel.org/r/20190821180332.11450-1-aryabinin@virtuozzo.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com> Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: NAndrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Qian Cai 提交于
The only way to obtain the current memory pool size for a running kernel is to check the kernel config file which is inconvenient. Record it in the kernel messages. [akpm@linux-foundation.org: s/memory pool size/memory pool/available/, per Catalin] Link: http://lkml.kernel.org/r/1565809631-28933-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Catalin Marinas 提交于
Currently kmemleak uses a static early_log buffer to trace all memory allocation/freeing before the slab allocator is initialised. Such early log is replayed during kmemleak_init() to properly initialise the kmemleak metadata for objects allocated up that point. With a memory pool that does not rely on the slab allocator, it is possible to skip this early log entirely. In order to remove the early logging, consider kmemleak_enabled == 1 by default while the kmem_cache availability is checked directly on the object_cache and scan_area_cache variables. The RCU callback is only invoked after object_cache has been initialised as we wouldn't have any concurrent list traversal before this. In order to reduce the number of callbacks before kmemleak is fully initialised, move the kmemleak_init() call to mm_init(). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: remove WARN_ON(), per Catalin] Link: http://lkml.kernel.org/r/20190812160642.52134-4-catalin.marinas@arm.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Catalin Marinas 提交于
Add a memory pool for struct kmemleak_object in case the normal kmem_cache_alloc() fails under the gfp constraints passed by the caller. The mem_pool[] array size is currently fixed at 16000. We are not using the existing mempool kernel API since this requires the slab allocator to be available (for pool->elements allocation). A subsequent kmemleak patch will replace the static early log buffer with the pool allocation introduced here and this functionality is required to be available before the slab was initialised. Link: http://lkml.kernel.org/r/20190812160642.52134-3-catalin.marinas@arm.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Catalin Marinas 提交于
Patch series "mm: kmemleak: Use a memory pool for kmemleak object allocations", v3. Following the discussions on v2 of this patch(set) [1], this series takes slightly different approach: - it implements its own simple memory pool that does not rely on the slab allocator - drops the early log buffer logic entirely since it can now allocate metadata from the memory pool directly before kmemleak is fully initialised - CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option is renamed to CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE - moves the kmemleak_init() call earlier (mm_init()) - to avoid a separate memory pool for struct scan_area, it makes the tool robust when such allocations fail as scan areas are rather an optimisation [1] http://lkml.kernel.org/r/20190727132334.9184-1-catalin.marinas@arm.com This patch (of 3): Object scan areas are an optimisation aimed to decrease the false positives and slightly improve the scanning time of large objects known to only have a few specific pointers. If a struct scan_area fails to allocate, kmemleak can still function normally by scanning the full object. Introduce an OBJECT_FULL_SCAN flag and mark objects as such when scan_area allocation fails. Link: http://lkml.kernel.org/r/20190812160642.52134-2-catalin.marinas@arm.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Qian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Qian Cai 提交于
tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure() when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default, Clang will complain that those unused functions. Link: http://lkml.kernel.org/r/1568752232-5094-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
The memcg_cache_params structure is only embedded into the kmem_cache of slab and slub allocators as defined in slab_def.h and slub_def.h and used internally by mm code. There is no needed to expose it in a public header. So move it from include/linux/slab.h to mm/slab.h. It is just a refactoring patch with no code change. In fact both the slub_def.h and slab_def.h should be moved into the mm directory as well, but that will probably cause many merge conflicts. Link: http://lkml.kernel.org/r/20190718180827.18758-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink file to shrink the slab by flushing out all the per-cpu slabs and free slabs in partial lists. This can be useful to squeeze out a bit more memory under extreme condition as well as making the active object counts in /proc/slabinfo more accurate. This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if memcg sysfs is turned on, it is too cumbersome and impractical to manage all those per-memcg sysfs files in a real production system. So there is no practical way to shrink memcg caches. Fix this by enabling a proper write to the shrink sysfs file of the root cache to scan all the available memcg caches and shrink them as well. For a non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that cache will be shrunk when written. On a 2-socket 64-core 256-thread arm64 system with 64k page after a parallel kernel build, the the amount of memory occupied by slabs before shrinking slabs were: # grep task_struct /proc/slabinfo task_struct 53137 53192 4288 61 4 : tunables 0 0 0 : slabdata 872 872 0 # grep "^S[lRU]" /proc/meminfo Slab: 3936832 kB SReclaimable: 399104 kB SUnreclaim: 3537728 kB After shrinking slabs (by echoing "1" to all shrink files): # grep "^S[lRU]" /proc/meminfo Slab: 1356288 kB SReclaimable: 263296 kB SUnreclaim: 1092992 kB # grep task_struct /proc/slabinfo task_struct 2764 6832 4288 61 4 : tunables 0 0 0 : slabdata 112 112 0 Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vitaly Wool 提交于
z3fold_page_reclaim()'s retry mechanism is broken: on a second iteration it will have zhdr from the first one so that zhdr is no longer in line with struct page. That leads to crashes when the system is stressed. Fix that by moving zhdr assignment up. While at it, protect against using already freed handles by using own local slots structure in z3fold_page_reclaim(). Link: http://lkml.kernel.org/r/20190908162919.830388dc7404d1e2c80f4095@gmail.comSigned-off-by: NVitaly Wool <vitalywool@gmail.com> Reported-by: NMarkus Linnala <markus.linnala@gmail.com> Reported-by: NChris Murphy <bugzilla@colorremedies.com> Reported-by: NAgustin Dall'Alba <agustin@dallalba.com.ar> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Cc: Shakeel Butt <shakeelb@google.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vitaly Wool 提交于
With the original commit applied, z3fold_zpool_destroy() may get blocked on wait_event() for indefinite time. Revert this commit for the time being to get rid of this problem since the issue the original commit addresses is less severe. Link: http://lkml.kernel.org/r/20190910123142.7a9c8d2de4d0acbc0977c602@gmail.com Fixes: d776aaa9 ("mm/z3fold.c: fix race between migration and destruction") Reported-by: NAgustín Dall'Alba <agustin@dallalba.com.ar> Signed-off-by: NVitaly Wool <vitalywool@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 9月, 2019 5 次提交
-
-
由 David Howells 提交于
Convert the ramfs, shmem, tmpfs, devtmpfs and rootfs filesystems to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Note that tmpfs is slightly tricky as it can contain embedded commas, so it can't be trivially split up using strsep() to break on commas in generic_parse_monolithic(). Instead, tmpfs has to supply its own generic parser. However, if tmpfs changes, then devtmpfs and rootfs, which are wrappers around tmpfs or ramfs, must change too - and thus so must ramfs, so these had to be converted also. [AV: rewritten] Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: Hugh Dickins <hughd@google.com> cc: linux-mm@kvack.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
This thing will eventually become our ->parse_param(), while shmem_parse_options() - ->parse_monolithic(). At that point shmem_parse_options() will start calling vfs_parse_fs_string(), rather than calling shmem_parse_one() directly. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
mechanical move. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
just use ctx->mpol (note that callers always set ctx->mpol to NULL when calling that). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
... and copy the data from it into sbinfo in the callers. For use by remount we need to keep track whether there'd been options setting max_inodes, max_blocks and huge resp. and do the sanity checks (and copying) only if such options had been seen. uid/gid/mode is ignored by remount and NULL mpol is already explicitly treated as "ignore it", so we don't need to keep track of those. Note: theoretically, mpol_parse_string() may return NULL not in case of error (for default policy), so the assumption that NULL mpol means "change nothing" is incorrect. However, that's the mainline behaviour and any changes belong in a separate patch. If we go for that, we'll need to keep track of having encountered mpol= option too. [changes in remount logics from Hugh Dickins folded] Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 07 9月, 2019 6 次提交
-
-
由 Daniel Vetter 提交于
We need to make sure implementations don't cheat and don't have a possible schedule/blocking point deeply burried where review can't catch it. I'm not sure whether this is the best way to make sure all the might_sleep() callsites trigger, and it's a bit ugly in the code flow. But it gets the job done. Inspired by an i915 patch series which did exactly that, because the rules haven't been entirely clear to us. Link: https://lore.kernel.org/r/20190826201425.17547-5-daniel.vetter@ffwll.ch Reviewed-by: Christian König <christian.koenig@amd.com> (v1) Reviewed-by: Jérôme Glisse <jglisse@redhat.com> (v4) Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Christoph Hellwig 提交于
Use lockdep to check for held locks instead of using home grown asserts. Link: https://lore.kernel.org/r/20190828141955.22210-4-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NThomas Hellstrom <thellstrom@vmware.com> Reviewed-by: NSteven Price <steven.price@arm.com> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Christoph Hellwig 提交于
The mm_walk structure currently mixed data and code. Split out the operations vectors into a new mm_walk_ops structure, and while we are changing the API also declare the mm_walk structure inside the walk_page_range and walk_page_vma functions. Based on patch from Linus Torvalds. Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NThomas Hellstrom <thellstrom@vmware.com> Reviewed-by: NSteven Price <steven.price@arm.com> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Christoph Hellwig 提交于
Add a new header for the two handful of users of the walk_page_range / walk_page_vma interface instead of polluting all users of mm.h with it. Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NThomas Hellstrom <thellstrom@vmware.com> Reviewed-by: NSteven Price <steven.price@arm.com> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Daniel Vetter 提交于
We want to teach lockdep that mmu notifiers can be called from direct reclaim paths, since on many CI systems load might never reach that level (e.g. when just running fuzzer or small functional tests). I've put the annotation into mmu_notifier_register since only when we have mmu notifiers registered is there any point in teaching lockdep about them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe. Link: https://lore.kernel.org/r/20190826201425.17547-3-daniel.vetter@ffwll.chSuggested-by: NJason Gunthorpe <jgg@mellanox.com> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Daniel Vetter 提交于
This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly easy to provoke a specific notifier to be run on a specific range: Just prep it, and then munmap() it. A bit harder, but still doable, is to provoke the mmu notifiers for all the various callchains that might lead to them. But both at the same time is really hard to reliably hit, especially when you want to exercise paths like direct reclaim or compaction, where it's not easy to control what exactly will be unmapped. By introducing a lockdep map to tie them all together we allow lockdep to see a lot more dependencies, without having to actually hit them in a single challchain while testing. On Jason's suggestion this is is rolled out for both invalidate_range_start and invalidate_range_end. They both have the same calling context, hence we can share the same lockdep map. Note that the annotation for invalidate_ranage_start is outside of the mm_has_notifiers(), to make sure lockdep is informed about all paths leading to this context irrespective of whether mmu notifiers are present for a given context. We don't do that on the invalidate_range_end side to avoid paying the overhead twice, there the lockdep annotation is pushed down behind the mm_has_notifiers() check. Link: https://lore.kernel.org/r/20190826201425.17547-2-daniel.vetter@ffwll.chReviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
- 06 9月, 2019 1 次提交
-
-
由 Al Viro 提交于
... have callers use shmem_mount() Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 05 9月, 2019 1 次提交
-
-
由 Gustavo A. R. Silva 提交于
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct pcpu_alloc_info { ... struct pcpu_group_info groups[]; }; Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. So, replace the following form: sizeof(*ai) + nr_groups * sizeof(ai->groups[0]) with: struct_size(ai, groups, nr_groups) This code was detected with the help of Coccinelle. Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: NDennis Zhou <dennis@kernel.org>
-
- 04 9月, 2019 2 次提交
-
-
由 Nadav Amit 提交于
There is no reason to print warnings when balloon page allocation fails, as they are expected and can be handled gracefully. Since VMware balloon now uses balloon-compaction infrastructure, and suppressed these warnings before, it is also beneficial to suppress these warnings to keep the same behavior that the balloon had before. Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com>
-
由 Christoph Hellwig 提交于
The arm architecture had a VM_ARM_DMA_CONSISTENT flag to mark DMA coherent remapping for a while. Lift this flag to common code so that we can use it generically. We also check it in the only place VM_USERMAP is directly check so that we can entirely replace that flag as well (although I'm not even sure why we'd want to allow remapping DMA appings, but I'd rather not change behavior). Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 03 9月, 2019 1 次提交
-
-
由 Matt Fleming 提交于
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init() for any sched domains with a NUMA distance greater than 2 hops (RECLAIM_DISTANCE). The idea being that it's expensive to balance across domains that far apart. However, as is rather unfortunately explained in: commit 32e45ff4 ("mm: increase RECLAIM_DISTANCE to 30") the value for RECLAIM_DISTANCE is based on node distance tables from 2011-era hardware. Current AMD EPYC machines have the following NUMA node distances: node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10 where 2 hops is 32. The result is that the scheduler fails to load balance properly across NUMA nodes on different sockets -- 2 hops apart. For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4 (CPUs 32-39) like so, $ numactl -C 0-7,32-39 ./spinner 16 causes all threads to fork and remain on node 0 until the active balancer kicks in after a few seconds and forcibly moves some threads to node 4. Override node_reclaim_distance for AMD Zen. Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suravee.Suthikulpanit@amd.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas.Lendacky@amd.com Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 31 8月, 2019 8 次提交
-
-
由 Jan Kara 提交于
Filesystems will need to call this function from their fadvise handlers. CC: stable@vger.kernel.org Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Jan Kara 提交于
Currently handling of MADV_WILLNEED hint calls directly into readahead code. Handle it by calling vfs_fadvise() instead so that filesystem can use its ->fadvise() callback to acquire necessary locks or otherwise prepare for the request. Suggested-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NBoaz Harrosh <boazh@netapp.com> CC: stable@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Shakeel Butt 提交于
Instead of using raw_cpu_read() use per_cpu() to read the actual data of the corresponding cpu otherwise we will be reading the data of the current cpu for the number of online CPUs. Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99e ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Adric Blake has noticed[1] the following warning: WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40 [...] Call Trace: mem_cgroup_shrink_node+0x9b/0x1d0 mem_cgroup_soft_limit_reclaim+0x10c/0x3a0 balance_pgdat+0x276/0x540 kswapd+0x200/0x3f0 ? wait_woken+0x80/0x80 kthread+0xfd/0x130 ? balance_pgdat+0x540/0x540 ? kthread_park+0x80/0x80 ret_from_fork+0x35/0x40 ---[ end trace 727343df67b2398a ]--- which tells us that soft limit reclaim is about to overwrite the reclaim_state configured up in the call chain (kswapd in this case but the direct reclaim is equally possible). This means that reclaim stats would get misleading once the soft reclaim returns and another reclaim is done. Fix the warning by dropping set_task_reclaim_state from the soft reclaim which is always called with reclaim_state set up. [1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NAdric Blake <promarbler14@gmail.com> Acked-by: NYafang Shao <laoar.shao@gmail.com> Acked-by: NYang Shi <yang.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gustavo A. R. Silva 提交于
Fix lock/unlock imbalance by unlocking *zhdr* before return. Addresses Coverity ID 1452811 ("Missing unlock") Link: http://lkml.kernel.org/r/20190826030634.GA4379@embeddedor Fixes: d776aaa9 ("mm/z3fold.c: fix race between migration and destruction") Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones" Commit 766a4c19 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") effectively decreased the precision of per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters. That's good for displaying in memory.stat, but brings a serious regression into the reclaim process. One issue I've discovered and debugged is the following: lruvec_lru_size() can return 0 instead of the actual number of pages in the lru list, preventing the kernel to reclaim last remaining pages. Result is yet another dying memory cgroups flooding. The opposite is also happening: scanning an empty lru list is the waste of cpu time. Also, inactive_list_is_low() can return incorrect values, preventing the active lru from being scanned and freed. It can fail both because the size of active and inactive lists are inaccurate, and because the number of workingset refaults isn't precise. In other words, the result is pretty random. I'm not sure, if using the approximate number of slab pages in count_shadow_number() is acceptable, but issues described above are enough to partially revert the patch. Let's keep per-memcg vmstat_local batched (they are only used for displaying stats to the userspace), but keep lruvec stats precise. This change fixes the dead memcg flooding on my setup. Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com Fixes: 766a4c19 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") Signed-off-by: NRoman Gushchin <guro@fb.com> Acked-by: NYafang Shao <laoar.shao@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Fixes: 701d6785 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool") Link: http://lkml.kernel.org/r/201908251039.5oSbEEUT%25lkp@intel.comReported-by: Nkbuild test robot <lkp@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
I've noticed that the "slab" value in memory.stat is sometimes 0, even if some children memory cgroups have a non-zero "slab" value. The following investigation showed that this is the result of the kmem_cache reparenting in combination with the per-cpu batching of slab vmstats. At the offlining some vmstat value may leave in the percpu cache, not being propagated upwards by the cgroup hierarchy. It means that stats on ancestor levels are lower than actual. Later when slab pages are released, the precise number of pages is substracted on the parent level, making the value negative. We don't show negative values, 0 is printed instead. To fix this issue, let's flush percpu slab memcg and lruvec stats on memcg offlining. This guarantees that numbers on all ancestor levels are accurate and match the actual number of outstanding slab pages. Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com Fixes: fb2f2b0a ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") Signed-off-by: NRoman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 8月, 2019 1 次提交
-
-
由 Tejun Heo 提交于
cgroup foreign inode handling has quite a bit of heuristics and internal states which sometimes makes it difficult to understand what's going on. Add tracepoints to improve visibility. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 28 8月, 2019 2 次提交
-
-
由 Christoph Hellwig 提交于
No modular code uses these, which makes a lot of sense given the wrappers around them are only called by core mm code. Link: https://lore.kernel.org/r/20190828142109.29012-1-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Ralph Campbell 提交于
Normally, callers to handle_mm_fault() are supposed to check the vma->vm_flags first. hmm_range_fault() checks for VM_READ but doesn't check for VM_WRITE if the caller requests a page to be faulted in with write permission (via the hmm_range.pfns[] value). If the vma is write protected, this can result in an infinite loop: hmm_range_fault() walk_page_range() ... hmm_vma_walk_hole() hmm_vma_walk_hole_() hmm_vma_do_fault() handle_mm_fault(FAULT_FLAG_WRITE) /* returns VM_FAULT_WRITE */ /* returns -EBUSY */ /* returns -EBUSY */ /* returns -EBUSY */ /* loops on -EBUSY and range->valid */ Prevent this by checking for vma->vm_flags & VM_WRITE before calling handle_mm_fault(). Link: https://lore.kernel.org/r/20190823221753.2514-3-rcampbell@nvidia.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-