- 18 9月, 2012 1 次提交
-
-
由 Joonsoo Kim 提交于
get_partial() is currently not checking pfmemalloc_match() meaning that it is possible for pfmemalloc pages to leak to non-pfmemalloc users. This is a problem in the following situation. Assume that there is a request from normal allocation and there are no objects in the per-cpu cache and no node-partial slab. In this case, slab_alloc enters the slow path and new_slab_objects() is called which may return a PFMEMALLOC page. As the current user is not allowed to access PFMEMALLOC page, deactivate_slab() is called ([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc checks]) and returns an object from PFMEMALLOC page. Next time, when we get another request from normal allocation, slab_alloc() enters the slow-path and calls new_slab_objects(). In new_slab_objects(), we call get_partial() and get a partial slab which was just deactivated but is a pfmemalloc page. We extract one object from it and re-deactivate. "deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly. As a result, access to PFMEMALLOC page is not properly restricted and it can cause a performance degradation due to frequent deactivation. deactivation frequently. This patch changes get_partial_node() to take pfmemalloc_match() into account and prevents the "deactivate -> re-get in get_partial() scenario. Instead, new_slab() is called. Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 8月, 2012 2 次提交
-
-
由 Christoph Lameter 提交于
This patch removes the check for pfmemalloc from the alloc hotpath and puts the logic after the election of a new per cpu slab. For a pfmemalloc page we do not use the fast path but force the use of the slow path which is also used for the debug case. This has the side-effect of weakening pfmemalloc processing in the following way; 1. A process that is allocating for network swap calls __slab_alloc. pfmemalloc_match is true so the freelist is loaded and c->freelist is now pointing to a pfmemalloc page. 2. A process that is attempting normal allocations calls slab_alloc, finds the pfmemalloc page on the freelist and uses it because it did not check pfmemalloc_match() The patch allows non-pfmemalloc allocations to use pfmemalloc pages with the kmalloc slabs being the most vunerable caches on the grounds they are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A later patch will still protect the system as processes will get throttled if the pfmemalloc reserves get depleted but performance will not degrade as smoothly. [mgorman@suse.de: Expanded changelog] Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 7月, 2012 1 次提交
-
-
由 David Rientjes 提交于
kmemcheck_alloc_shadow() requires irqs to be enabled, so wait to disable them until after its called for __GFP_WAIT allocations. This fixes a warning for such allocations: WARNING: at kernel/lockdep.c:2739 lockdep_trace_alloc+0x14e/0x1c0() Acked-by: NFengguang Wu <fengguang.wu@intel.com> Acked-by: NSteven Rostedt <rostedt@goodmis.org> Tested-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 09 7月, 2012 5 次提交
-
-
由 Christoph Lameter 提交于
Move the mutex handling into the common kmem_cache_create() function. Then we can also move more checks out of SLAB's kmem_cache_create() into the common code. Reviewed-by: NGlauber Costa <glommer@parallels.com> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Use the mutex definition from SLAB and make it the common way to take a sleeping lock. This has the effect of using a mutex instead of a rw semaphore for SLUB. SLOB gains the use of a mutex for kmem_cache_create serialization. Not needed now but SLOB may acquire some more features later (like slabinfo / sysfs support) through the expansion of the common code that will need this. Reviewed-by: NGlauber Costa <glommer@parallels.com> Reviewed-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
All allocators have some sort of support for the bootstrap status. Setup a common definition for the boot states and make all slab allocators use that definition. Reviewed-by: NGlauber Costa <glommer@parallels.com> Reviewed-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Kmem_cache_create() does a variety of sanity checks but those vary depending on the allocator. Use the strictest tests and put them into a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM. This patch has the effect of adding sanity checks for SLUB and SLOB under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Julia Lawall 提交于
If list_for_each_entry, etc complete a traversal of the list, the iterator variable ends up pointing to an address at an offset from the list head, and not a meaningful structure. Thus this value should not be used after the end of the iterator. The patch replaces s->name by al->name, which is referenced nearby. This problem was found using Coccinelle (http://coccinelle.lip6.fr/). Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 20 6月, 2012 3 次提交
-
-
由 Joonsoo Kim 提交于
Current implementation of unfreeze_partials() is so complicated, but benefit from it is insignificant. In addition many code in do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab. Under current implementation which test status of cpu partial slab and acquire list_lock in do {} while loop, we don't need to acquire a list_lock and gain a little benefit when front of the cpu partial slab is to be discarded, but this is a rare case. In case that add_partial is performed and cmpxchg_double_slab is failed, remove_partial should be called case by case. I think that these are disadvantages of current implementation, so I do refactoring unfreeze_partials(). Minimizing code in do {} while loop introduce a reduced fail rate of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256' when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done. ** before ** Cmpxchg_double Looping ------------------------ Locked Cmpxchg Double redos 182685 Unlocked Cmpxchg Double redos 0 ** after ** Cmpxchg_double Looping ------------------------ Locked Cmpxchg Double redos 177995 Unlocked Cmpxchg Double redos 1 We can see cmpxchg_double_slab fail rate is improved slightly. Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'. ** before ** Performance counter stats for './hackbench 50 process 4000' (30 runs): 108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% ) 2,919,550 context-switches # 0.027 M/sec ( +- 3.07% ) 100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% ) 124,201 page-faults # 0.001 M/sec ( +- 0.15% ) 401,500,234,387 cycles # 3.700 GHz ( +- 0.24% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% ) 45,934,956,860 branches # 423.297 M/sec ( +- 0.14% ) 188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% ) 13.691837307 seconds time elapsed ( +- 0.24% ) ** after ** Performance counter stats for './hackbench 50 process 4000' (30 runs): 107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% ) 2,834,781 context-switches # 0.026 M/sec ( +- 2.33% ) 93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% ) 123,967 page-faults # 0.001 M/sec ( +- 0.15% ) 398,781,421,836 cycles # 3.700 GHz ( +- 0.22% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% ) 45,855,370,128 branches # 425.436 M/sec ( +- 0.10% ) 169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% ) 13.596272341 seconds time elapsed ( +- 0.22% ) No regression is found, but rather we can see slightly better result. Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Joonsoo Kim 提交于
get_freelist(), unfreeze_partials() are only called with interrupt disabled, so __cmpxchg_double_slab() is suitable. Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Andi Kleen 提交于
slab_node() could access current->mempolicy from interrupt context. However there's a race condition during exit where the mempolicy is first freed and then the pointer zeroed. Using this from interrupts seems bogus anyways. The interrupt will interrupt a random process and therefore get a random mempolicy. Many times, this will be idle's, which noone can change. Just disable this here and always use local for slab from interrupts. I also cleaned up the callers of slab_node a bit which always passed the same argument. I believe the original mempolicy code did that in fact, so it's likely a regression. v2: send version with correct logic v3: simplify. fix typo. Reported-by: NArun Sharma <asharma@fb.com> Cc: penberg@kernel.org Cc: cl@linux.com Signed-off-by: NAndi Kleen <ak@linux.intel.com> [tdmackey@twitter.com: Rework control flow based on feedback from cl@linux.com, fix logic, and cleanup current task_struct reference] Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NDavid Mackey <tdmackey@twitter.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 14 6月, 2012 1 次提交
-
-
由 Christoph Lameter 提交于
Define a struct that describes common fields used in all slab allocators. A slab allocator either uses the common definition (like SLOB) or is required to provide members of kmem_cache with the definition given. After that it will be possible to share code that only operates on those fields of kmem_cache. The patch basically takes the slob definition of kmem cache and uses the field namees for the other allocators. It also standardizes the names used for basic object lengths in allocators: object_size Struct size specified at kmem_cache_create. Basically the payload expected to be used by the subsystem. size The size of memory allocator for each object. This size is larger than object_size and includes padding, alignment and extra metadata for each object (f.e. for debugging and rcu). Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 01 6月, 2012 9 次提交
-
-
由 Christoph Lameter 提交于
Avoid passing the kmem_cache_cpu pointer to node_match. This makes the node_match function more generic and easier to understand. Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Store the value of c->page to avoid additional fetches from per cpu data. Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Processing on fields of kmem_cache_cpu is cleaner if code working on fields of this struct is taken out of deactivate_slab(). Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
The node field is always page_to_nid(c->page). So its rather easy to replace. Note that there maybe slightly more overhead in various hot paths due to the need to shift the bits from page->flags. However, that is mostly compensated for by a smaller footprint of the kmem_cache_cpu structure (this patch reduces that to 3 words per cache) which allows better caching. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Moving the attempt to get a slab page from the partial lists simplifies __slab_alloc which is rather complicated. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Simplify control flow a bit avoiding nesting. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Avoid the loop in acquire slab and simply fail if there is a conflict. This will cause the next page on the list to be considered. Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Verify that objects returned from __slab_alloc come from slab pages in the correct state. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
The variable "object" really refers to a list of objects that we are handling. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 18 5月, 2012 3 次提交
-
-
由 Joonsoo Kim 提交于
To set page-flag, using SetPageXXXX() and __SetPageXXXX() is more understandable and maintainable. So change it. Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Joonsoo Kim 提交于
In the case which is below, 1. acquire slab for cpu partial list 2. free object to it by remote cpu 3. page->freelist = t then memory leak is occurred. Change acquire_slab() not to zap freelist when it works for cpu partial list. I think it is a sufficient solution for fixing a memory leak. Below is output of 'slabinfo -r kmalloc-256' when './perf stat -r 30 hackbench 50 process 4000 > /dev/null' is done. ***Vanilla*** Sizes (bytes) Slabs Debug Memory ------------------------------------------------------------------------ Object : 256 Total : 468 Sanity Checks : Off Total: 3833856 SlabObj: 256 Full : 111 Redzoning : Off Used : 2004992 SlabSiz: 8192 Partial: 302 Poisoning : Off Loss : 1828864 Loss : 0 CpuSlab: 55 Tracking : Off Lalig: 0 Align : 8 Objects: 32 Tracing : Off Lpadd: 0 ***Patched*** Sizes (bytes) Slabs Debug Memory ------------------------------------------------------------------------ Object : 256 Total : 300 Sanity Checks : Off Total: 2457600 SlabObj: 256 Full : 204 Redzoning : Off Used : 2348800 SlabSiz: 8192 Partial: 33 Poisoning : Off Loss : 108800 Loss : 0 CpuSlab: 63 Tracking : Off Lalig: 0 Align : 8 Objects: 32 Tracing : Off Lpadd: 0 Total and loss number is the impact of this patch. Cc: <stable@vger.kernel.org> Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 majianpeng 提交于
I found some kernel messages such as: SLUB raid5-md127: kmem_cache_destroy called for cache that still has objects. Pid: 6143, comm: mdadm Tainted: G O 3.4.0-rc6+ #75 Call Trace: kmem_cache_destroy+0x328/0x400 free_conf+0x2d/0xf0 [raid456] stop+0x41/0x60 [raid456] md_stop+0x1a/0x60 [md_mod] do_md_stop+0x74/0x470 [md_mod] md_ioctl+0xff/0x11f0 [md_mod] blkdev_ioctl+0xd8/0x7a0 block_ioctl+0x3b/0x40 do_vfs_ioctl+0x96/0x560 sys_ioctl+0x91/0xa0 system_call_fastpath+0x16/0x1b Then using kmemleak I found these messages: unreferenced object 0xffff8800b6db7380 (size 112): comm "mdadm", pid 5783, jiffies 4294810749 (age 90.589s) hex dump (first 32 bytes): 01 01 db b6 ad 4e ad de ff ff ff ff ff ff ff ff .....N.......... ff ff ff ff ff ff ff ff 98 40 4a 82 ff ff ff ff .........@J..... backtrace: kmemleak_alloc+0x21/0x50 kmem_cache_alloc+0xeb/0x1b0 kmem_cache_open+0x2f1/0x430 kmem_cache_create+0x158/0x320 setup_conf+0x649/0x770 [raid456] run+0x68b/0x840 [raid456] md_run+0x529/0x940 [md_mod] do_md_run+0x18/0xc0 [md_mod] md_ioctl+0xba8/0x11f0 [md_mod] blkdev_ioctl+0xd8/0x7a0 block_ioctl+0x3b/0x40 do_vfs_ioctl+0x96/0x560 sys_ioctl+0x91/0xa0 system_call_fastpath+0x16/0x1b This bug was introduced by commit a8364d55 ("slub: only IPI CPUs that have per cpu obj to flush"), which did not include checks for per cpu partial pages being present on a cpu. Signed-off-by: Nmajianpeng <majianpeng@gmail.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Tested-by: NJeff Layton <jlayton@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 5月, 2012 2 次提交
-
-
由 Joonsoo Kim 提交于
We don't use the argument since commit 3b89d7d8 ('slub: move min_partial to struct kmem_cache'), so remove it Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Joonsoo Kim 提交于
Memory allocated by kstrdup should be freed, when kmalloc(kmem_size, GFP_KERNEL) is failed. Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 08 5月, 2012 1 次提交
-
-
由 Joonsoo Kim 提交于
Commit 497b66f2 ('slub: return object pointer from get_partial() / new_slab().') changed return type of some functions. This updates missing part. Signed-off-by: NJoonsoo Kim <js1304@gmail.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 29 3月, 2012 1 次提交
-
-
由 Gilad Ben-Yossef 提交于
flush_all() is called for each kmem_cache_destroy(). So every cache being destroyed dynamically ends up sending an IPI to each CPU in the system, regardless if the cache has ever been used there. For example, if you close the Infinband ipath driver char device file, the close file ops calls kmem_cache_destroy(). So running some infiniband config tool on one a single CPU dedicated to system tasks might interrupt the rest of the 127 CPUs dedicated to some CPU intensive or latency sensitive task. I suspect there is a good chance that every line in the output of "git grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario. This patch attempts to rectify this issue by sending an IPI to flush the per cpu objects back to the free lists only to CPUs that seem to have such objects. The check which CPU to IPI is racy but we don't care since asking a CPU without per cpu objects to flush does no damage and as far as I can tell the flush_all by itself is racy against allocs on remote CPUs anyway, so if you required the flush_all to be determinstic, you had to arrange for locking regardless. Without this patch the following artificial test case: $ cd /sys/kernel/slab $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done produces 166 IPIs on an cpuset isolated CPU. With it it produces none. The code path of memory allocation failure for CPUMASK_OFFSTACK=y config was tested using fault injection framework. Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Sasha Levin <levinsasha928@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Avi Kivity <avi@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.org> Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com> Cc: Milton Miller <miltonm@bga.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 3月, 2012 1 次提交
-
-
由 Mel Gorman 提交于
Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 2月, 2012 1 次提交
-
-
由 Alex Shi 提交于
This patch split the cpu_partial_free into 2 parts: cpu_partial_node, PCP refilling times from node partial; and same name cpu_partial_free, PCP refilling times in slab_free slow path. A new statistic 'cpu_partial_drain' is added to get PCP drain to node partial times. These info are useful when do PCP tunning. The slabinfo.c code is unchanged, since cpu_partial_node is not on slow path. Signed-off-by: NAlex Shi <alex.shi@intel.com> Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 10 2月, 2012 1 次提交
-
-
由 Christoph Lameter 提交于
Otherwise m68k breaks: On Mon, 30 Jan 2012, Geert Uytterhoeven wrote: > m68k/allmodconfig at http://kisskb.ellerman.id.au/kisskb/buildresult/5527349/ > > mm/slub.c:274: error: implicit declaration of function 'prefetch' > > Sorry, didn't notice it earlier due to other build breakage in -next. Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org> Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 06 2月, 2012 1 次提交
-
-
由 Christoph Lameter 提交于
sysfs_slab_add() calls various sysfs functions that actually may end up in userspace doing all sorts of things. Release the slub_lock after adding the kmem_cache structure to the list. At that point the address of the kmem_cache is not known so we are guaranteed exlusive access to the following modifications to the kmem_cache structure. If the sysfs_slab_add fails then reacquire the slub_lock to remove the kmem_cache structure from the list. Cc: <stable@vger.kernel.org> # 3.3+ Reported-by: NSasha Levin <levinsasha928@gmail.com> Acked-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 25 1月, 2012 1 次提交
-
-
由 Eric Dumazet 提交于
Recycling a page is a problem, since freelist link chain is hot on cpu(s) which freed objects, and possibly very cold on cpu currently owning slab. Adding a prefetch of cache line containing the pointer to next object in slab_alloc() helps a lot in many workloads, in particular on assymetric ones (allocations done on one cpu, frees on another cpus). Added cost is three machine instructions only. Examples on my dual socket quad core ht machine (Intel CPU E5540 @2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel. Before patch : # perf stat -r 32 hackbench 50 process 4000 >/dev/null Performance counter stats for 'hackbench 50 process 4000' (32 runs): 327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% ) 28 866 491 context-switches # 0,088 M/sec ( +- 1,80% ) 1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% ) 127 151 page-faults # 0,000 M/sec ( +- 0,16% ) 829 399 813 448 cycles # 2,532 GHz ( +- 0,64% ) 580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% ) 197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% ) 503 548 648 975 instructions # 0,61 insns per cycle # 1,15 stalled cycles per insn ( +- 0,46% ) 95 780 068 471 branches # 292,389 M/sec ( +- 0,48% ) 1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% ) 20,705679994 seconds time elapsed ( +- 0,64% ) After patch : # perf stat -r 32 hackbench 50 process 4000 >/dev/null Performance counter stats for 'hackbench 50 process 4000' (32 runs): 286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% ) 19 703 372 context-switches # 0,069 M/sec ( +- 4,99% ) 1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% ) 126 776 page-faults # 0,000 M/sec ( +- 0,12% ) 724 636 593 213 cycles # 2,532 GHz ( +- 1,32% ) 499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% ) 156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% ) 463 897 792 661 instructions # 0,64 insns per cycle # 1,08 stalled cycles per insn ( +- 0,94% ) 87 717 352 563 branches # 306,451 M/sec ( +- 0,99% ) 941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% ) 18,132070670 seconds time elapsed ( +- 1,30% ) Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Acked-by: NChristoph Lameter <cl@linux.com> CC: Matt Mackall <mpm@selenic.com> CC: David Rientjes <rientjes@google.com> CC: "Alex,Shi" <alex.shi@intel.com> CC: Shaohua Li <shaohua.li@intel.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 13 1月, 2012 2 次提交
-
-
由 Heiko Carstens 提交于
Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures can simply select the option if it is supported. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Heiko Carstens 提交于
While implementing cmpxchg_double() on s390 I realized that we don't set CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it. However setting that option will increase the size of struct page by eight bytes on 64 bit, which we certainly do not want. Also, it doesn't make sense that a present cpu feature should increase the size of struct page. Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and that it should depend on CMPXCHG_DOUBLE instead. This patch: If an architecture supports CMPXCHG_LOCAL this shouldn't result automatically in larger struct pages if the SLUB allocator is used. Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which can be selected if a double word aligned struct page is required. Also update x86 Kconfig so that it should work as before. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 1月, 2012 2 次提交
-
-
由 Stanislaw Gruszka 提交于
Disable slub debug facilities and allocate slabs at minimal order when debug_guardpage_minorder > 0 to increase probability to catch random memory corruption by cpu exception. Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
For caches with debugging enabled, "slub: Switch per cpu partial page support off for debugging" changes cpu_partial to 0. It shouldn't be tunable from userspace for such caches, otherwise the same accounting issues arise during validation. This patch disallows tuning /sys/kernel/slab/cache/cpu_partial to be non- zero for caches with debugging enabled. Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 04 1月, 2012 1 次提交
-
-
由 Jan Beulich 提交于
Just like the per-CPU ones they had several problems/shortcomings: Only the first memory operand was mentioned in the asm() operands, and the 2x64-bit version didn't have a memory clobber while the 2x32-bit one did. The former allowed the compiler to not recognize the need to re-load the data in case it had it cached in some register, while the latter was overly destructive. The types of the local copies of the old and new values were incorrect (the types of the pointed-to variables should be used here, to make sure the respective old/new variable types are compatible). The __dummy/__junk variables were pointless, given that local copies of the inputs already existed (and can hence be used for discarded outputs). The 32-bit variant of cmpxchg_double_local() referenced cmpxchg16b_local(). At once also: - change the return value type to what it really is: 'bool' - unify 32- and 64-bit variants - abstract out the common part of the 'normal' and 'local' variants Signed-off-by: NJan Beulich <jbeulich@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/4F01F12A020000780006A19B@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 23 12月, 2011 1 次提交
-
-
由 Christoph Lameter 提交于
We simply say that regular this_cpu use must be safe regardless of preemption and interrupt state. That has no material change for x86 and s390 implementations of this_cpu operations. However, arches that do not provide their own implementation for this_cpu operations will now get code generated that disables interrupts instead of preemption. -tj: This is part of on-going percpu API cleanup. For detailed discussion of the subject, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1222078Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NTejun Heo <tj@kernel.org> LKML-Reference: <alpine.DEB.2.00.1112221154380.11787@router.home>
-