提交 · 7ecccf9d1e416fe61bb1caa0a94605b522feeaf3 · openeuler / raspberrypi-kernel

25 10月, 2013 14 次提交

slab: remove useless statement for checking pfmemalloc · 7ecccf9d

由 Joonsoo Kim 提交于 10月 24, 2013

Now, virt_to_page(page->s_mem) is same as the page,
because slab use this structure for management.
So remove useless statement.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

7ecccf9d

slab: use struct page for slab management · 8456a648

由 Joonsoo Kim 提交于 10月 24, 2013

Now, there are a few field in struct slab, so we can overload these
over struct page. This will save some memory and reduce cache footprint.

After this change, slabp_cache and slab_size no longer related to
a struct slab, so rename them as freelist_cache and freelist_size.

These changes are just mechanical ones and there is no functional change.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

8456a648

slab: replace free and inuse in struct slab with newly introduced active · 106a74e1

由 Joonsoo Kim 提交于 10月 24, 2013

Now, free in struct slab is same meaning as inuse.
So, remove both and replace them with active.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

106a74e1

slab: remove SLAB_LIMIT · 45eed508

由 Joonsoo Kim 提交于 10月 24, 2013

It's useless now, so remove it.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

45eed508

slab: remove kmem_bufctl_t · 16025177

由 Joonsoo Kim 提交于 10月 24, 2013

Now, we changed the management method of free objects of the slab and
there is no need to use special value, BUFCTL_END, BUFCTL_FREE and
BUFCTL_ACTIVE. So remove them.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

16025177

slab: change the management method of free objects of the slab · b1cb0982

由 Joonsoo Kim 提交于 10月 24, 2013

Current free objects management method of the slab is weird, because
it touch random position of the array of kmem_bufctl_t when we try to
get free object. See following example.

struct slab's free = 6
kmem_bufctl_t array: 1 END 5 7 0 4 3 2

To get free objects, we access this array with following pattern.
6 -> 3 -> 7 -> 2 -> 5 -> 4 -> 0 -> 1 -> END

If we have many objects, this array would be larger and be not in the same
cache line. It is not good for performance.

We can do same thing through more easy way, like as the stack.
Only thing we have to do is to maintain stack top to free object. I use
free field of struct slab for this purpose. After that, if we need to get
an object, we can get it at stack top and manipulate top pointer.
That's all. This method already used in array_cache management.
Following is an access pattern when we use this method.

struct slab's free = 0
kmem_bufctl_t array: 6 3 7 2 5 4 0 1

To get free objects, we access this array with following pattern.
0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7

This may help cache line footprint if slab has many objects, and,
in addition, this makes code much much simpler.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

b1cb0982

slab: use __GFP_COMP flag for allocating slab pages · a57a4988

由 Joonsoo Kim 提交于 10月 24, 2013

If we use 'struct page' of first page as 'struct slab', there is no
advantage not to use __GFP_COMP. So use __GFP_COMP flag for all the cases.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

a57a4988

slab: use well-defined macro, virt_to_slab() · 56f295ef

由 Joonsoo Kim 提交于 10月 24, 2013

This is trivial change, just use well-defined macro.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

56f295ef

slab: overloading the RCU head over the LRU for RCU free · 68126702

由 Joonsoo Kim 提交于 10月 24, 2013

With build-time size checking, we can overload the RCU head over the LRU
of struct page to free pages of a slab in rcu context. This really help to
implement to overload the struct slab over the struct page and this
eventually reduce memory usage and cache footprint of the SLAB.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

68126702

slab: remove cachep in struct slab_rcu · 07d417a1

由 Joonsoo Kim 提交于 10月 24, 2013

We can get cachep using page in struct slab_rcu, so remove it.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

07d417a1

slab: remove nodeid in struct slab · 1ea991b0

由 Joonsoo Kim 提交于 10月 24, 2013

We can get nodeid using address translation, so this field is not useful.
Therefore, remove it.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

1ea991b0

slab: remove colouroff in struct slab · ac2b54ed

由 Joonsoo Kim 提交于 10月 24, 2013

Now there is no user colouroff, so remove it.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

ac2b54ed

slab: change return type of kmem_getpages() to struct page · 0c3aa83e

由 Joonsoo Kim 提交于 10月 24, 2013

It is more understandable that kmem_getpages() return struct page.
And, with this, we can reduce one translation from virt addr to page and
makes better code than before. Below is a change of this patch.

* Before
   text	   data	    bss	    dec	    hex	filename
  22123	  23434	      4	  45561	   b1f9	mm/slab.o

* After
   text	   data	    bss	    dec	    hex	filename
  22074	  23434	      4	  45512	   b1c8	mm/slab.o

And this help following patch to remove struct slab's colouroff.
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

0c3aa83e

slab: correct pfmemalloc check · 73293c2f

由 Joonsoo Kim 提交于 10月 24, 2013

We checked pfmemalloc by slab unit, not page unit. You can see this
in is_slab_pfmemalloc(). So other pages don't need to be set/cleared
pfmemalloc.

And, therefore we should check pfmemalloc in page flag of first page,
but current implementation don't do that. virt_to_head_page(obj) just
return 'struct page' of that object, not one of first page, since the SLAB
don't use __GFP_COMP when CONFIG_MMU. To get 'struct page' of first page,
we first get a slab and try to get it via virt_to_head_page(slab->s_mem).
Acked-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NPekka Enberg <penberg@iki.fi>

73293c2f

29 8月, 2013 1 次提交

memcg: check that kmem_cache has memcg_params before accessing it · 6f6b8951

由 Andrey Vagin 提交于 8月 28, 2013

If the system had a few memory groups and all of them were destroyed,
memcg_limited_groups_array_size has non-zero value, but all new caches
are created without memcg_params, because memcg_kmem_enabled() returns
false.

We try to enumirate child caches in a few places and all of them are
potentially dangerous.

For example my kernel is compiled with CONFIG_SLAB and it crashed when I
tryed to mount a NFS share after a few experiments with kmemcg.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0
  PGD b942a067 PUD b999f067 PMD 0
  Oops: 0000 [#1] SMP
  Modules linked in: fscache(+) ip6table_filter ip6_tables iptable_filter ip_tables i2c_piix4 pcspkr virtio_net virtio_balloon i2c_core floppy
  CPU: 0 PID: 357 Comm: modprobe Not tainted 3.11.0-rc7+ #59
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff8800b9f98240 ti: ffff8800ba32e000 task.ti: ffff8800ba32e000
  RIP: 0010:[<ffffffff8118166a>]  [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0
  RSP: 0018:ffff8800ba32fb70  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
  RDX: 0000000000000000 RSI: ffff8800b9f98910 RDI: 0000000000000246
  RBP: ffff8800ba32fba0 R08: 0000000000000002 R09: 0000000000000004
  R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
  R13: 0000000000000008 R14: 00000000000000d0 R15: ffff8800375d0200
  FS:  00007f55f1378740(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 00007f24feba57a0 CR3: 0000000037b51000 CR4: 00000000000006f0
  Call Trace:
    enable_cpucache+0x49/0x100
    setup_cpu_cache+0x215/0x280
    __kmem_cache_create+0x2fa/0x450
    kmem_cache_create_memcg+0x214/0x350
    kmem_cache_create+0x2b/0x30
    fscache_init+0x19b/0x230 [fscache]
    do_one_initcall+0xfa/0x1b0
    load_module+0x1c41/0x26d0
    SyS_finit_module+0x86/0xb0
    system_call_fastpath+0x16/0x1b
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6f6b8951

28 8月, 2013 1 次提交

mm: move_ptes -- Set soft dirty bit depending on pte type · 6dec97dc

由 Cyrill Gorcunov 提交于 8月 27, 2013

Dave reported corrupted swap entries

 | [ 4588.541886] swap_free: Unused swap offset entry 00002d15
 | [ 4588.541952] BUG: Bad page map in process trinity-kid12  pte:005a2a80 pmd:22c01f067

and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit set regardless
the type of entry pte consists of.  The trick here is that when we carry
soft dirty status in swap entries we are to use _PAGE_SWP_SOFT_DIRTY
instead, because this is the only place in pte which can be used for own
needs without intersecting with bits owned by swap entry type/offset.
Reported-and-tested-by: NDave Jones <davej@redhat.com>
Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Analyzed-by: NHugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6dec97dc

25 8月, 2013 1 次提交

cope with potentially long ->d_dname() output for shmem/hugetlb · 118b2302

由 Al Viro 提交于 8月 24, 2013

dynamic_dname() is both too much and too little for those - the
output may be well in excess of 64 bytes dynamic_dname() assumes
to be enough (thanks to ashmem feeding really long names to
shmem_file_setup()) and vsnprintf() is an overkill for those
guys.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

118b2302

24 8月, 2013 1 次提交

memcg: get rid of swapaccount leftovers · 07555ac1

由 Michal Hocko 提交于 8月 22, 2013

The swapaccount kernel parameter without any values has been removed by
commit a2c8990a ("memsw: remove noswapaccount kernel parameter") but
it seems that we didn't get rid of all the left overs.

Make sure that menuconfig help text and kernel-parameters.txt are clear
about value for the paramter and remove the stalled comment which is not
very much useful on its own.
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Reported-by: NGergely Risko <gergely@risko.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

07555ac1

16 8月, 2013 1 次提交

Fix TLB gather virtual address range invalidation corner cases · 2b047252

由 Linus Torvalds 提交于 8月 15, 2013

Ben Tebulin reported:

"Since v3.7.2 on two independent machines a very specific Git
repository fails in 9/10 cases on git-fsck due to an SHA1/memory
failures. This only occurs on a very specific repository and can be
reproduced stably on two independent laptops. Git mailing list ran
out of ideas and for me this looks like some very exotic kernel issue"

and bisected the failure to the backport of commit 53a59fc6 ("mm:
limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

That commit itself is not actually buggy, but what it does is to make it
much more likely to hit the partial TLB invalidation case, since it
introduces a new case in tlb_next_batch() that previously only ever
happened when running out of memory.

The real bug is that the TLB gather virtual memory range setup is subtly
buggered. It was introduced in commit 597e1c35 ("mm/mmu_gather:
enable tlb flush range in generic mmu_gather"), and the range handling
was already fixed at least once in commit e6c495a9 ("mm: fix the TLB
range flushed when __tlb_remove_page() runs out of slots"), but that fix
was not complete.

The problem with the TLB gather virtual address range is that it isn't
set up by the initial tlb_gather_mmu() initialization (which didn't get
the TLB range information), but it is set up ad-hoc later by the
functions that actually flush the TLB. And so any such case that forgot
to update the TLB range entries would potentially miss TLB invalidates.

Rather than try to figure out exactly which particular ad-hoc range
setup was missing (I personally suspect it's the hugetlb case in
zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
did), this patch just gets rid of the problem at the source: make the
TLB range information available to tlb_gather_mmu(), and initialize it
when initializing all the other tlb gather fields.

This makes the patch larger, but conceptually much simpler. And the end
result is much more understandable; even if you want to play games with
partial ranges when invalidating the TLB contents in chunks, now the
range information is always there, and anybody who doesn't want to
bother with it won't introduce subtle bugs.

Ben verified that this fixes his problem.
Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com>
Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au>
Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com>
Reviewed-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2b047252

14 8月, 2013 3 次提交

mm: save soft-dirty bits on file pages · 41bb3476

由 Cyrill Gorcunov 提交于 8月 13, 2013

Andy reported that if file page get reclaimed we lose the soft-dirty bit
if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
encoded into pte entry.  Thus when #pf happens on such non-present pte
we can restore it back.
Reported-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41bb3476

mm: save soft-dirty bits on swapped pages · 179ef71c

由 Cyrill Gorcunov 提交于 8月 13, 2013

Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
get swapped out, the bit is getting lost and no longer available when
pte read back.

To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
pte entry for the page being swapped out.  When such page is to be read
back from a swap cache we check for bit presence and if it's there we
clear it and restore the former _PAGE_SOFT_DIRTY bit back.

One of the problem was to find a place in pte entry where we can save
the _PTE_SWP_SOFT_DIRTY bit while page is in swap.  The _PAGE_PSE was
chosen for that, it doesn't intersect with swap entry format stored in
pte.
Reported-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: NMinchan Kim <minchan@kernel.org>
Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

179ef71c

memcg: don't initialize kmem-cache destroying work for root caches · 3e6b11df

由 Andrey Vagin 提交于 8月 13, 2013

struct memcg_cache_params has a union.  Different parts of this union
are used for root and non-root caches.  A part with destroying work is
used only for non-root caches.

I fixed the same problem in another place v3.9-rc1-16204-gf101a946, but
didn't notice this one.

This patch fixes the kernel panic:

[   46.848187] BUG: unable to handle kernel paging request at 000000fffffffeb8
[   46.849026] IP: [<ffffffff811a484c>] kmem_cache_destroy_memcg_children+0x6c/0xc0
[   46.849092] PGD 0
[   46.849092] Oops: 0000 [#1] SMP
...
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: <stable@vger.kernel.org>    [3.9.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3e6b11df

09 8月, 2013 1 次提交

Revert "slub: do not put a slab to cpu partial list when cpu_partial is 0" · 37090506

由 Linus Torvalds 提交于 8月 08, 2013

This reverts commit 318df36e.

This commit caused Steven Rostedt's hackbench runs to run out of memory
due to a leak.  As noted by Joonsoo Kim, it is buggy in the following
scenario:

 "I guess, you may set 0 to all kmem caches's cpu_partial via sysfs,
  doesn't it?

  In this case, memory leak is possible in following case.  Code flow of
  possible leak is follwing case.

   * in __slab_free()
   1. (!new.inuse || !prior) && !was_frozen
   2. !kmem_cache_debug && !prior
   3. new.frozen = 1
   4. after cmpxchg_double_slab, run the (!n) case with new.frozen=1
   5. with this patch, put_cpu_partial() doesn't do anything,
      because this cache's cpu_partial is 0
   6. return

  In step 5, leak occur"

And Steven does indeed have cpu_partial set to 0 due to RT testing.

Joonsoo is cooking up a patch, but everybody agrees that reverting this
for now is the right thing to do.
Reported-and-bisected-by: NSteven Rostedt <rostedt@goodmis.org>
Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

37090506

05 8月, 2013 1 次提交

tmpfs: fix SEEK_DATA/SEEK_HOLE regression · 387aae6f

由 Hugh Dickins 提交于 8月 04, 2013

Commit 46a1c2c7 ("vfs: export lseek_execute() to modules") broke the
tmpfs SEEK_DATA/SEEK_HOLE implementation, because vfs_setpos() converts
the carefully prepared -ENXIO to -EINVAL.  Other filesystems avoid it in
error cases: do the same in tmpfs.
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

387aae6f

01 8月, 2013 7 次提交

vmpressure: make sure there are no events queued after memcg is offlined · 33cb876e

由 Michal Hocko 提交于 7月 31, 2013

vmpressure is called synchronously from reclaim where the target_memcg
is guaranteed to be alive but the eventfd is signaled from the work
queue context.  This means that memcg (along with vmpressure structure
which is embedded into it) might go away while the work item is pending
which would result in use-after-release bug.

We have two possible ways how to fix this.  Either vmpressure pins memcg
before it schedules vmpr->work and unpin it in vmpressure_work_fn or
explicitely flush the work item from the css_offline context (as
suggested by Tejun).

This patch implements the later one and it introduces vmpressure_cleanup
which flushes the vmpressure work queue item item.  It hooks into
mem_cgroup_css_offline after the memcg itself is cleaned up.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Reported-by: NTejun Heo <tj@kernel.org>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Li Zefan <lizefan@huawei.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

33cb876e

vmpressure: do not check for pending work to prevent from new work · 8e0ed445

由 Michal Hocko 提交于 7月 31, 2013

because it is racy and it doesn't give us much anyway as schedule_work
handles this case already.
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Reported-by: NTejun Heo <tj@kernel.org>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Li Zefan <lizefan@huawei.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8e0ed445

vmpressure: change vmpressure::sr_lock to spinlock · 22f2020f

由 Michal Hocko 提交于 7月 31, 2013

There is nothing that can sleep inside critical sections protected by
this lock and those sections are really small so there doesn't make much
sense to use mutex for them.  Change the log to a spinlock
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Reported-by: NTejun Heo <tj@kernel.org>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Li Zefan <lizefan@huawei.com>
Reviewed-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

22f2020f

mm: zbud: fix condition check on allocation size · 9d8c5b52

由 Heesub Shin 提交于 7月 31, 2013

zbud_alloc() incorrectly verifies the size of allocation limit.  It
should deny the allocation request greater than (PAGE_SIZE -
ZHDR_SIZE_ALIGNED - CHUNK_SIZE), not (PAGE_SIZE - ZHDR_SIZE_ALIGNED)
which has no remaining spaces for its buddy.  There is no point in
spending the entire zbud page storing only a single page, since we don't
have any benefits.
Signed-off-by: NHeesub Shin <heesub.shin@samsung.com>
Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Sunae Seo <sunae.seo@samsung.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9d8c5b52

thp, mm: avoid PageUnevictable on active/inactive lru lists · e180cf80

由 Kirill A. Shutemov 提交于 7月 31, 2013

active/inactive lru lists can contain unevicable pages (i.e.  ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_[in]active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!

1090 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
1091                 struct lruvec *lruvec, struct list_head *dst,
1092                 unsigned long *nr_scanned, struct scan_control *sc,
1093                 isolate_mode_t mode, enum lru_list lru)
1094 {
...
1108                 switch (__isolate_lru_page(page, mode)) {
1109                 case 0:
...
1116                 case -EBUSY:
...
1121                 default:
1122                         BUG();
1123                 }
1124         }
...
1130 }

__isolate_lru_page() returns EINVAL for PageUnevictable(page).

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail(): if
page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page.  So we can end up with inactive page on
active lru.  The patch will fix it as well since we copy PG_active from
head page.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e180cf80

mm/swap.c: clear PageActive before adding pages onto unevictable list · ef2a2cbd

由 Naoya Horiguchi 提交于 7月 31, 2013

As a result of commit 13f7f789 ("mm: pagevec: defer deciding which
LRU to add a page to until pagevec drain time"), pages on unevictable
lists can have both of PageActive and PageUnevictable set.  This is not
only confusing, but also corrupts page migration and
shrink_[in]active_list.

This patch fixes the problem by adding ClearPageActive before adding
pages into unevictable list.  It also cleans up VM_BUG_ONs.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ef2a2cbd

mm: mempolicy: fix mbind_range() && vma_adjust() interaction · 3964acd0

由 Oleg Nesterov 提交于 7月 31, 2013

vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this
is doubly wrong:

1. This leaks vma->vm_policy if it is not NULL and not equal to
   next->vm_policy.

   This can happen if vma_merge() expands "area", not prev (case 8).

2. This sets the wrong policy if vma_merge() joins prev and area,
   area is the vma the caller needs to update and it still has the
   old policy.

Revert commit 1444f92c ("mm: merging memory blocks resets
mempolicy") which introduced these problems.

Change mbind_range() to recheck mpol_equal() after vma_merge() to fix
the problem that commit tried to address.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven T Hampson <steven.t.hampson@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3964acd0

17 7月, 2013 1 次提交

sysfs.h: add __ATTR_RW() macro · b9b32597

由 Greg Kroah-Hartman 提交于 7月 14, 2013

A number of parts of the kernel created their own version of this, might
as well have the sysfs core provide it instead.
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b9b32597

15 7月, 2013 2 次提交

kernel: delete __cpuinit usage from all core kernel files · 0db0628d

由 Paul Gortmaker 提交于 6月 19, 2013

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.

[1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

0db0628d

slub: Check for page NULL before doing the node_match check · c25f195e

由 Steven Rostedt 提交于 1月 17, 2013

In the -rt kernel (mrg), we hit the following dump:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
PGD a2d39067 PUD b1641067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
CPU 3
Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992
RIP: 0010:[<ffffffff811573f1>]  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
RSP: 0018:ffff8800a9b17d70  EFLAGS: 00010213
RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000
RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500
RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd
R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500
R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000)
Stack:
 ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011
 0000000001200011 0000000001200011 0000000000000000 0000000000000000
 00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd
Call Trace:
 [<ffffffff81202e08>] ? current_has_perm+0x68/0x80
 [<ffffffff81041cbd>] copy_process+0xdd/0x15b0
 [<ffffffff810a2125>] ? rt_up_read+0x25/0x30
 [<ffffffff8104369a>] do_fork+0x5a/0x360
 [<ffffffff8107c66b>] ? migrate_enable+0xeb/0x220
 [<ffffffff8100b068>] sys_clone+0x28/0x30
 [<ffffffff81527423>] stub_clone+0x13/0x20
 [<ffffffff81527152>] ? system_call_fastpath+0x16/0x1b
Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 <48> 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2
RIP  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
 RSP <ffff8800a9b17d70>
CR2: 0000000000000000
---[ end trace 0000000000000002 ]---

Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel
with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do
disable migration. But the SLUB code is relatively lockless, and the
spin_locks there are raw_spin_locks (not converted to mutexes), thus I
believe this bug can happen in mainline without -rt features. The -rt
patch is just good at triggering mainline bugs ;-)

Anyway, looking at where this crashed, it seems that the page variable
can be NULL when passed to the node_match() function (which does not
check if it is NULL). When this happens we get the above panic.

As page is only used in slab_alloc() to check if the node matches, if
it's NULL I'm assuming that we can say it doesn't and call the
__slab_alloc() code. Is this a correct assumption?
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c25f195e

11 7月, 2013 3 次提交

mm: remove free_area_cache · 98d1e64f

由 Michel Lespinasse 提交于 7月 10, 2013

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.
Signed-off-by: NMichel Lespinasse <walken@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

98d1e64f

zswap: add to mm/ · 2b281117

由 Seth Jennings 提交于 7月 10, 2013

zswap is a thin backend for frontswap that takes pages that are in the
process of being swapped out and attempts to compress them and store
them in a RAM-based memory pool.  This can result in a significant I/O
reduction on the swap device and, in the case where decompressing from
RAM is faster than reading from the swap device, can also improve
workload performance.

It also has support for evicting swap pages that are currently
compressed in zswap to the swap device on an LRU(ish) basis.  This
functionality makes zswap a true cache in that, once the cache is full,
the oldest pages can be moved out of zswap to the swap device so newer
pages can be compressed and stored in zswap.

This patch adds the zswap driver to mm/
Signed-off-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Jenifer Hopper <jhopper@us.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2b281117

zbud: add to mm/ · 4e2e2770

由 Seth Jennings 提交于 7月 10, 2013

zbud is an special purpose allocator for storing compressed pages.  It
is designed to store up to two compressed pages per physical page.
While this design limits storage density, it has simple and
deterministic reclaim properties that make it preferable to a higher
density approach when reclaim will be used.

zbud works by storing compressed pages, or "zpages", together in pairs
in a single memory page called a "zbud page".  The first buddy is "left
justifed" at the beginning of the zbud page, and the last buddy is
"right justified" at the end of the zbud page.  The benefit is that if
either buddy is freed, the freed buddy space, coalesced with whatever
slack space that existed between the buddies, results in the largest
possible free region within the zbud page.

zbud also provides an attractive lower bound on density.  The ratio of
zpages to zbud pages can not be less than 1.  This ensures that zbud can
never "do harm" by using more pages to store zpages than the
uncompressed zpages would have used on their own.

This implementation is a rewrite of the zbud allocator internally used
by zcache in the driver/staging tree.  The rewrite was necessary to
remove some of the zcache specific elements that were ingrained
throughout and provide a generic allocation interface that can later be
used by zsmalloc and others.

This patch adds zbud to mm/ for later use by zswap.
Signed-off-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Jenifer Hopper <jhopper@us.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Bob Liu <bob.liu@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e2e2770

10 7月, 2013 3 次提交

ipc/shmc.c: eliminate ugly 80-col tricks · c103a4dc

由 Andrew Morton 提交于 7月 08, 2013

Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c103a4dc

mm/memory_hotplug.c: fix return value of online_pages() · 0a1be150

由 Toshi Kani 提交于 7月 08, 2013

online_pages() is called from memory_block_action() when a user requests
to online a memory block via sysfs.  This function needs to return a
proper error value in case of error.
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0a1be150

mm: honor min_free_kbytes set by user · 5f12733e

由 Michal Hocko 提交于 7月 08, 2013

min_free_kbytes is updated during memory hotplug (by
init_per_zone_wmark_min) currently which is right thing to do in most
cases but this could be unexpected if admin increased the value to
prevent from allocation failures and the new min_free_kbytes would be
decreased as a result of memory hotadd.

This patch saves the user defined value and allows updating
min_free_kbytes only if it is higher than the saved one.

A warning is printed when the new value is ignored.
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5f12733e