提交 · 96d17b7be0a9849d381442030886211dbb2a7061 · openeuler / raspberrypi-kernel

05 9月, 2012 5 次提交

mm/sl[aou]b: Get rid of __kmem_cache_destroy · 12c3667f

由 Christoph Lameter 提交于 9月 04, 2012

What is done there can be done in __kmem_cache_shutdown.

This affects RCU handling somewhat. On rcu free all slab allocators do
not refer to other management structures than the kmem_cache structure.
Therefore these other structures can be freed before the rcu deferred
free to the page allocator occurs.
Reviewed-by: NJoonsoo Kim <js1304@gmail.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

12c3667f

mm/sl[aou]b: Move freeing of kmem_cache structure to common code · 8f4c765c

由 Christoph Lameter 提交于 9月 05, 2012

The freeing action is basically the same in all slab allocators.
Move to the common kmem_cache_destroy() function.
Reviewed-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NJoonsoo Kim <js1304@gmail.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

8f4c765c

mm/sl[aou]b: Use "kmem_cache" name for slab cache with kmem_cache struct · 9b030cb8

由 Christoph Lameter 提交于 9月 05, 2012

Make all allocators use the "kmem_cache" slabname for the "kmem_cache"
structure.
Reviewed-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NJoonsoo Kim <js1304@gmail.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

9b030cb8

mm/sl[aou]b: Extract a common function for kmem_cache_destroy · 945cf2b6

由 Christoph Lameter 提交于 9月 04, 2012

kmem_cache_destroy does basically the same in all allocators.

Extract common code which is easy since we already have common mutex
handling.
Reviewed-by: NGlauber Costa <glommer@parallels.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

945cf2b6

mm/sl[aou]b: Move list_add() to slab_common.c · 7c9adf5a

由 Christoph Lameter 提交于 9月 04, 2012

Move the code to append the new kmem_cache to the list of slab caches to
the kmem_cache_create code in the shared code.

This is possible now since the acquisition of the mutex was moved into
kmem_cache_create().
Acked-by: NDavid Rientjes <rientjes@google.com>
Reviewed-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NJoonsoo Kim <js1304@gmail.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

7c9adf5a

17 8月, 2012 1 次提交

mm, slab: remove page_get_cache · 5b74beb4

由 David Rientjes 提交于 8月 16, 2012

page_get_cache() isn't called from anything, so remove it.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

5b74beb4

16 8月, 2012 1 次提交

slab: do not call compound_head() in page_get_cache() · 48f24741

由 Michel Lespinasse 提交于 8月 14, 2012

page_get_cache() does not need to call compound_head(), as its unique
caller virt_to_slab() already makes sure to return a head page.

Additionally, removing the compound_head() call makes page_get_cache()
consistent with page_get_slab().
Signed-off-by: NMichel Lespinasse <walken@google.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

48f24741

01 8月, 2012 3 次提交

mm: micro-optimise slab to avoid a function call · 381760ea

由 Mel Gorman 提交于 7月 31, 2012

Getting and putting objects in SLAB currently requires a function call but
the bulk of the work is related to PFMEMALLOC reserves which are only
consumed when network-backed storage is critical.  Use an inline function
to determine if the function call is required.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

381760ea

mm: introduce __GFP_MEMALLOC to allow access to emergency reserves · b37f1dd0

由 Mel Gorman 提交于 7月 31, 2012

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
like PF_MEMALLOC.  It allows one to pass along the memalloc state in
object related allocation flags as opposed to task related flags, such as
sk->sk_allocation.  This removes the need for ALLOC_PFMEMALLOC as callers
using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
enough to identify allocations related to page reclaim.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b37f1dd0

mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages · 072bb0aa

由 Mel Gorman 提交于 7月 31, 2012

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon.  Swap over the network is considered as an option in diskless
systems.  The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap.  Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively.  This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do.  As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need.  Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels.  This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS.  The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
	preserve access to pages allocated under low memory situations
	to callers that are freeing memory.

Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
	reserves without setting PFMEMALLOC.

Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
	for later use by network packet processing.

Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patches 7-12 allows network processing to use PFMEMALLOC reserves when
	the socket has been marked as being used by the VM to clean pages. If
	packets are received and stored in pages that were allocated under
	low-memory situations and are unrelated to the VM, the packets
	are dropped.

	Patch 11 reintroduces __skb_alloc_page which the networking
	folk may object to but is needed in some cases to propogate
	pfmemalloc from a newly allocated page to an skb. If there is a
	strong objection, this patch can be dropped with the impact being
	that swap-over-network will be slower in some cases but it should
	not fail.

Patch 13 is a micro-optimisation to avoid a function call in the
	common case.

Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
	PFMEMALLOC if necessary.

Patch 15 notes that it is still possible for the PFMEMALLOC reserve
	to be depleted. To prevent this, direct reclaimers get throttled on
	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
	expected that kswapd and the direct reclaimers already running
	will clean enough pages for the low watermark to be reached and
	the throttled processes are woken up.

Patch 16 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench.  Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD.  8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop.  The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied.  With SLAB, the story is
different as an unpatched kernel run to completion.  However, the patched
kernel completed the test 45% faster.

MICRO
                                         3.5.0-rc2 3.5.0-rc2
					 vanilla     swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds)             197.80    173.07
User+Sys Time Running Test (seconds)        206.96    182.03
Total Elapsed Time (seconds)               3240.70   1762.09

This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory.  To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS.  Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs.  This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.

When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected.  SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve.  If one is not available, an attempt is
made to allocate a new page rather than use a reserve.  SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags.  This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

072bb0aa

09 7月, 2012 4 次提交

mm, sl[aou]b: Move kmem_cache_create mutex handling to common code · 20cea968

由 Christoph Lameter 提交于 7月 06, 2012

Move the mutex handling into the common kmem_cache_create()
function.

Then we can also move more checks out of SLAB's kmem_cache_create()
into the common code.
Reviewed-by: NGlauber Costa <glommer@parallels.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

20cea968

mm, sl[aou]b: Use a common mutex definition · 18004c5d

由 Christoph Lameter 提交于 7月 06, 2012

Use the mutex definition from SLAB and make it the common way to take a sleeping lock.

This has the effect of using a mutex instead of a rw semaphore for SLUB.

SLOB gains the use of a mutex for kmem_cache_create serialization.
Not needed now but SLOB may acquire some more features later (like slabinfo
/ sysfs support) through the expansion of the common code that will
need this.
Reviewed-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NJoonsoo Kim <js1304@gmail.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

18004c5d

mm, sl[aou]b: Common definition for boot state of the slab allocators · 97d06609

由 Christoph Lameter 提交于 7月 06, 2012

All allocators have some sort of support for the bootstrap status.

Setup a common definition for the boot states and make all slab
allocators use that definition.
Reviewed-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NJoonsoo Kim <js1304@gmail.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

97d06609

mm, sl[aou]b: Extract common code for kmem_cache_create() · 039363f3

由 Christoph Lameter 提交于 7月 06, 2012

Kmem_cache_create() does a variety of sanity checks but those
vary depending on the allocator. Use the strictest tests and put them into
a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM.

This patch has the effect of adding sanity checks for SLUB and SLOB
under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM.
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

039363f3

02 7月, 2012 4 次提交

slab: move FULL state transition to an initcall · a164f896

由 Glauber Costa 提交于 6月 21, 2012

During kmem_cache_init_late(), we transition to the LATE state,
and after some more work, to the FULL state, its last state

This is quite different from slub, that will only transition to
its last state (previously SYSFS), in a (late)initcall, after a lot
more of the kernel is ready.

This means that in slab, we have no way to taking actions dependent
on the initialization of other pieces of the kernel that are supposed
to start way after kmem_init_late(), such as cgroups initialization.

To achieve more consistency in this behavior, that patch only
transitions to the UP state in kmem_init_late. In my analysis,
setup_cpu_cache() should be happy to test for >= UP, instead of
== FULL. It also has passed some tests I've made.

We then only mark FULL state after the reap timers are in place,
meaning that no further setup is expected.
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

a164f896

slab: Fix a typo in commit 8c138b "slab: Get rid of obj_size macro" · d97d476b

由 Feng Tang 提交于 7月 02, 2012

Commit  8c138b only sits in Pekka's and linux-next tree now, which tries
to replace obj_size(cachep) with cachep->object_size, but has a typo in
kmem_cache_free() by using "size" instead of "object_size", which casues
some regressions.
Reported-and-tested-by: NFengguang Wu <wfg@linux.intel.com>
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: NGlauber Costa <glommer@parallels.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

d97d476b

mm, slab: Build fix for recent kmem_cache changes · 0672aa7c

由 Thierry Reding 提交于 6月 22, 2012

Commit 3b0efdfa ("mm, sl[aou]b: Extract common fields from struct
kmem_cache") renamed the kmem_cache structure's "next" field to "list"
but forgot to update one instance in leaks_show().
Signed-off-by: NThierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

0672aa7c

slab: rename gfpflags to allocflags · a618e89f

由 Glauber Costa 提交于 6月 14, 2012

A consistent name with slub saves us an acessor function.
In both caches, this field represents the same thing. We would
like to use it from the mem_cgroup code.
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Acked-by: NChristoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

a618e89f

20 6月, 2012 1 次提交

slab/mempolicy: always use local policy from interrupt context · e7b691b0

由 Andi Kleen 提交于 6月 09, 2012

slab_node() could access current->mempolicy from interrupt context.
However there's a race condition during exit where the mempolicy
is first freed and then the pointer zeroed.

Using this from interrupts seems bogus anyways. The interrupt
will interrupt a random process and therefore get a random
mempolicy. Many times, this will be idle's, which noone can change.

Just disable this here and always use local for slab
from interrupts. I also cleaned up the callers of slab_node a bit
which always passed the same argument.

I believe the original mempolicy code did that in fact,
so it's likely a regression.

v2: send version with correct logic
v3: simplify. fix typo.
Reported-by: NArun Sharma <asharma@fb.com>
Cc: penberg@kernel.org
Cc: cl@linux.com
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
[tdmackey@twitter.com: Rework control flow based on feedback from
cl@linux.com, fix logic, and cleanup current task_struct reference]
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NDavid Mackey <tdmackey@twitter.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

e7b691b0

14 6月, 2012 4 次提交

slab: Get rid of obj_size macro · 8c138bc0

由 Christoph Lameter 提交于 6月 13, 2012

The size of the slab object is frequently needed. Since we now
have a size field directly in the kmem_cache structure there is no
need anymore of the obj_size macro/function.
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

8c138bc0

mm, sl[aou]b: Extract common fields from struct kmem_cache · 3b0efdfa

由 Christoph Lameter 提交于 6月 13, 2012

Define a struct that describes common fields used in all slab allocators.
A slab allocator either uses the common definition (like SLOB) or is
required to provide members of kmem_cache with the definition given.

After that it will be possible to share code that
only operates on those fields of kmem_cache.

The patch basically takes the slob definition of kmem cache and
uses the field namees for the other allocators.

It also standardizes the names used for basic object lengths in
allocators:

object_size	Struct size specified at kmem_cache_create. Basically
		the payload expected to be used by the subsystem.

size		The size of memory allocator for each object. This size
		is larger than object_size and includes padding, alignment
		and extra metadata for each object (f.e. for debugging
		and rcu).
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

3b0efdfa

slab: Remove some accessors · 35026088

由 Christoph Lameter 提交于 6月 13, 2012

Those are rather trivial now and its better to see inline what is
really going on.
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

35026088

slab: Use page struct fields instead of casting · e571b0ad

由 Christoph Lameter 提交于 6月 13, 2012

Add fields to the page struct so that it is properly documented that
slab overlays the lru fields.

This cleans up some casts in slab.
Reviewed-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NJoonsoo Kim <js1304@gmail.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

e571b0ad

22 3月, 2012 1 次提交

cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87

由 Mel Gorman 提交于 3月 21, 2012

Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths.  This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32.  The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side.  This is much cheaper on some architectures, including x86.  The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk.  If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
actual results were

                             3.3.0-rc3          3.3.0-rc3
                             rc3-vanilla        nobarrier-v2r1
    Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
    Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
    Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
    Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
    Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
    Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
    Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
    Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
    Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
    Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
    Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
    Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
    Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
    Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
    Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds)             135.68    132.17
    User+Sys Time Running Test (seconds)         164.2    160.13
    Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected).  The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals.  The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data.  In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cc9a6c87

10 3月, 2012 1 次提交

mm: SLAB Out-of-memory diagnostics · 8bdec192

由 Rafael Aquini 提交于 3月 09, 2012

Following the example at mm/slub.c, add out-of-memory diagnostics to the
SLAB allocator to help on debugging certain OOM conditions.

An example print out looks like this:

  <snip page allocator out-of-memory message>
  SLAB: Unable to allocate memory on node 0 (gfp=0x11200)
    cache: bio-0, object size: 192, order: 0
    node 0: slabs: 3/3, objs: 60/60, free: 0
Signed-off-by: NRafael Aquini <aquini@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

8bdec192

23 1月, 2012 1 次提交

slab, cleanup: remove unneeded return · 42c8c99c

由 Zhao Jin 提交于 8月 27, 2011

The procedure ends right after the if-statement, so remove ``return''.
Also move the last common statement outside.
Signed-off-by: NZhao Jin <cronozhj@gmail.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

42c8c99c

10 1月, 2012 1 次提交

tracing/mm: Move include of trace/events/kmem.h out of header into slab.c · 4dee6b64

由 Steven Rostedt 提交于 1月 09, 2012

Including trace/events/*.h TRACE_EVENT() macro headers in other headers
can cause strange side effects if another trace/event/*.h header
includes that header. Having trace/events/kmem.h inside slab_def.h
caused a compile error in sparc64 when changes were done to some header
files. Moving the kmem.h trace header out of slab.h and into slab.c
fixes the problem.

Note, both slub.c and slob.c already include the trace/events/kmem.h
file. Only slab.c had it missing.

Link: http://lkml.kernel.org/r/20120105190405.1e3191fb5a43b2a0f1655e1f@canb.auug.org.auReported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4dee6b64

05 12月, 2011 1 次提交

slab, lockdep: Fix silly bug · 52cef189

由 Peter Zijlstra 提交于 11月 28, 2011

Commit 30765b92 ("slab, lockdep: Annotate the locks before using
them") moves the init_lock_keys() call from after g_cpucache_up =
FULL, to before it. And overlooks the fact that init_node_lock_keys()
tests for it and ignores everything !FULL.

Introduce a LATE stage and change the lockdep test to be <LATE.
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

52cef189

17 11月, 2011 1 次提交

slab: add taint flag outputting to debug paths. · face37f5

由 Dave Jones 提交于 11月 15, 2011

When we get corruption reports, it's useful to see if the kernel was
tainted, to rule out problems we can't do anything about.
Signed-off-by: NDave Jones <davej@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

face37f5

11 11月, 2011 2 次提交

slab: introduce slab_max_order kernel parameter · 3df1cccd

由 David Rientjes 提交于 10月 18, 2011

Introduce new slab_max_order kernel parameter which is the equivalent of
slub_max_order.

For immediate purposes, allows users to override the heuristic that sets
the max order to 1 by default if they have more than 32MB of RAM.  This
may result in page allocation failures if there is substantial
fragmentation.

Another usecase would be to increase the max order for better
performance.
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

3df1cccd

slab: rename slab_break_gfp_order to slab_max_order · 543585cc

由 David Rientjes 提交于 10月 18, 2011

slab_break_gfp_order is more appropriately named slab_max_order since it
enforces the maximum order size of slabs as long as a single object will
still fit.

Also rename BREAK_GFP_ORDER_{LO,HI} accordingly.
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

543585cc

28 9月, 2011 1 次提交

mm: restrict access to slab files under procfs and sysfs · ab067e99

由 Vasiliy Kulikov 提交于 9月 27, 2011

Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world.  slabinfo
contains rather private information related both to the kernel and
userspace tasks.  Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack.  Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:

1) dentry (and different *inode*) number might reveal other processes fs
activity.  The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them.  The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.

2) different inode entries might reveal the same information as (1), but
these are more fine granted counters.  If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed.  If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed.  Number of files in ecryptfs
mount point is a private information per se.

3) fuse_* reveals number of files / fs activity of a user in a user
private mount point.  It is approx. the same severity as ecryptfs
infoleak in (2).

4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/".  With 0444 slabinfo
the precise number of sysfs files is known to the world.

5) buffer_head might reveal some kernel activity.  With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.

6) *kmalloc* infoleaks are very situational.  Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity.  If an attacker has relatively silent victim system, he
might get rather precise counters.

Additional information sources might significantly increase the slabinfo
infoleak benefits.  E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.

Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs).  The
related discussion:

http://thread.gmane.org/gmane.linux.kernel/1108378

To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:

    groupadd slabinfo
    usermod -a -G slabinfo $MONITOR_USER

And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):

    chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
    chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
Reviewed-by: NKees Cook <kees@ubuntu.com>
Reviewed-by: NDave Hansen <dave@linux.vnet.ibm.com>
Acked-by: NChristoph Lameter <cl@gentwo.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

ab067e99

04 8月, 2011 2 次提交

slab, lockdep: Annotate the locks before using them · 30765b92

由 Peter Zijlstra 提交于 7月 28, 2011

Fernando found we hit the regular OFF_SLAB 'recursion' before we
annotate the locks, cure this.

The relevant portion of the stack-trace:

> [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
> [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
> [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
> [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
> [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
> [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
> [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
> [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
> [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
> [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf
Reported-by: NFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Acked-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptopSigned-off-by: NIngo Molnar <mingo@elte.hu>

30765b92

slab, lockdep: Annotate slab -> rcu -> debug_object -> slab · 83835b3d

由 Peter Zijlstra 提交于 7月 22, 2011

Lockdep thinks there's lock recursion through:

	kmem_cache_free()
	  cache_flusharray()
	    spin_lock(&l3->list_lock)  <----------------.
	    free_block()                                |
	      slab_destroy()                            |
		call_rcu()                              |
		  debug_object_activate()               |
		    debug_object_init()                 |
		      __debug_object_init()             |
			kmem_cache_alloc()              |
			  cache_alloc_refill()          |
			    spin_lock(&l3->list_lock) --'

Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no
actual possibility of recursing. Luckily debug objects marks it slab
with SLAB_DEBUG_OBJECTS so we can identify the thing.

Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special
lockdep key so that lockdep sees its a different cachep.

Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU |
SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble.
Reported-and-tested-by: NSebastian Siewior <sebastian@breakpoint.cc>
[ fixes to the initial patch ]
Reported-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

83835b3d

01 8月, 2011 1 次提交

slab: use print_hex_dump · fdde6abb

由 Sebastian Andrzej Siewior 提交于 7月 29, 2011

Less code and the advantage of ascii dump.

before:
| Slab corruption: names_cache start=c5788000, len=4096
| 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00
| 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff
| 030: ff ff ff ff e2 b4 17 18 c7 e4 08 06 00 01 08 00
| 040: 06 04 00 01 e2 b4 17 18 c7 e4 0a 00 00 01 00 00
| 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b

after:
| Slab corruption: size-4096 start=c38a9000, len=4096
| 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00  kk....V...$...*.
| 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
| 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff  ................
| 030: ff ff ff ff d2 56 5f aa db 9c 08 06 00 01 08 00  .....V_.........
| 040: 06 04 00 01 d2 56 5f aa db 9c 0a 00 00 01 00 00  .....V_.........
| 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b  ........kkkkkkkk
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

fdde6abb

31 7月, 2011 1 次提交

slab: use NUMA_NO_NODE · eacbbae3

由 Andrew Morton 提交于 7月 28, 2011

Use the nice enumerated constant.

Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

eacbbae3

28 7月, 2011 1 次提交

slab: remove one NR_CPUS dependency · acfe7d74

由 Eric Dumazet 提交于 7月 25, 2011

Reduce high order allocations in do_tune_cpucache() for some setups.
(NR_CPUS=4096 -> we need 64KB)
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

acfe7d74

22 7月, 2011 1 次提交

slab: fix DEBUG_SLAB warning · 7ea466f2

由 Tetsuo Handa 提交于 7月 21, 2011

In commit c225150b "slab: fix DEBUG_SLAB build",
"if ((unsigned long)objp & (ARCH_SLAB_MINALIGN-1))" is always true if
ARCH_SLAB_MINALIGN == 0. Do not print warning if ARCH_SLAB_MINALIGN == 0.
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

7ea466f2

21 7月, 2011 1 次提交

slab: shrink sizeof(struct kmem_cache) · b56efcf0

由 Eric Dumazet 提交于 7月 20, 2011

Reduce high order allocations for some setups.
(NR_CPUS=4096 -> we need 64KB per kmem_cache struct)

We now allocate exact needed size (using nr_cpu_ids and nr_node_ids)

This also makes code a bit smaller on x86_64, since some field offsets
are less than the 127 limit :

Before patch :
# size mm/slab.o
   text    data     bss     dec     hex filename
  22605  361665      32  384302   5dd2e mm/slab.o

After patch :
# size mm/slab.o
   text    data     bss     dec     hex filename
  22349	 353473	   8224	 384046	  5dc2e	mm/slab.o

CC: Andrew Morton <akpm@linux-foundation.org>
Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

b56efcf0

18 7月, 2011 1 次提交

slab: fix DEBUG_SLAB build · c225150b

由 Hugh Dickins 提交于 7月 11, 2011

Fix CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y build error and warnings.

Now that ARCH_SLAB_MINALIGN defaults to __alignof__(unsigned long long),
it is always defined (when slab.h included), but cannot be used in #if:
mm/slab.c: In function `cache_alloc_debugcheck_after':
mm/slab.c:3156:5: warning: "__alignof__" is not defined
mm/slab.c:3156:5: error: missing binary operator before token "("
make[1]: *** [mm/slab.o] Error 1

So just remove the #if and #endif lines, but then 64-bit build warns:
mm/slab.c: In function `cache_alloc_debugcheck_after':
mm/slab.c:3156:6: warning: cast from pointer to integer of different size
mm/slab.c:3158:10: warning: format `%d' expects type `int', but argument
                            3 has type `long unsigned int'
Fix those with casts, whatever the actual type of ARCH_SLAB_MINALIGN.
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

c225150b