提交 · 2831096e21503897ee474c23131c3feb8db0ffb1 · openanolis / cloud-kernel

11 12月, 2014 40 次提交

exit: reparent: cleanup the usage of reparent_leader() · 2831096e

由 Oleg Nesterov 提交于 12月 10, 2014

1. Now that reparent_leader() doesn't abuse ->sibling we can shift
   list_move_tail() from reparent_leader() to forget_original_parent()
   and turn it into a single list_splice_tail_init(). This also makes
   BUG_ON(!list_empty()) and list_for_each_entry_safe() unnecessary.

2. This also allows to shift the same_thread_group() check, it looks
   a bit more clear in the caller.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2831096e

exit: reparent: cleanup the changing of ->parent · 57a05918

由 Oleg Nesterov 提交于 12月 10, 2014

1. Cosmetic, but "if (t->parent == father)" looks a bit confusing.
   We need to change t->parent if and only if t is not traced.

2. If we actually want this BUG_ON() to ensure that parent/ptrace
   match each other, then we should also take ptrace_reparented()
   case into account too.

3. Change this code to use for_each_thread() instead of deprecated
   while_each_thread().

[dan.carpenter@oracle.com: silence a bogus static checker warning]
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

57a05918

exit: reparent: use ->ptrace_entry rather than ->sibling for EXIT_DEAD tasks · dc2fd4b0

由 Oleg Nesterov 提交于 12月 10, 2014

reparent_leader() reuses ->sibling as a list node to add an EXIT_DEAD task
into dead_children list we are going to release.  This obviously removes
the dead task from its real_parent->children list and this is even good;
the parent can do nothing with the EXIT_DEAD reparented zombie, it only
makes do_wait() slower.

But, this also means that it can not be reparented once again, so if its
new parent dies too nobody will update ->parent/real_parent, they can
point to the freed memory even before release_task() we are going to call,
this breaks the code which relies on pid_alive() to access
->real_parent/parent.

Fortunately this is mostly theoretical, this can only happen if init or
PR_SET_CHILD_SUBREAPER process ignores SIGCHLD and the new parent
sub-thread exits right after we drop tasklist_lock.

Change this code to use ->ptrace_entry instead, we know that the child is
not traced so nobody can ever use this member.  This also allows to unify
this logic with exit_ptrace(), see the next changes.

Note: we really need to change release_task() to nullify real_parent/
parent/group_leader pointers, but we need to change the current users
first somehow.  And it would be better to reap this zombie immediately but
release_task_locked() we need is complicated by proc_flush_task().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dc2fd4b0

sched_show_task: fix unsafe usage of ->real_parent · a90e984c

由 Oleg Nesterov 提交于 12月 10, 2014

rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.

Note: we need some helpers to cleanup the code like this.  And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a90e984c

proc: task_state: ptrace_parent() doesn't need pid_alive() check · abdba6e9

由 Oleg Nesterov 提交于 12月 10, 2014

p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check.  Other callers already use it
directly without additional checks.

Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

abdba6e9

proc: task_state: move the main seq_printf() outside of rcu_read_lock() · b0fafc11

由 Oleg Nesterov 提交于 12月 10, 2014

task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id().  We can calculate
tgid/ngid and drop rcu lock.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b0fafc11

proc: task_state: deuglify the max_fds calculation · 0f4a0d53

由 Oleg Nesterov 提交于 12月 10, 2014

1. The usage of fdt looks very ugly, it can't be NULL if ->files is
   not NULL. We can use "unsigned int max_fds" instead.

2. This also allows to move seq_printf(max_fds) outside of task_lock()
   and join it with the previous seq_printf(). See also the next patch.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0f4a0d53

proc: task_state: read cred->group_info outside of task_lock() · 4af1036d

由 Oleg Nesterov 提交于 12月 10, 2014

task_state() reads cred->group_info under task_lock() because a long ago
it was task_struct->group_info and it was actually protected by
task->alloc_lock.  Today this task_unlock() after rcu_read_unlock() just
adds the confusion, move task_unlock() up.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4af1036d

fs/proc.c: use rb_entry_safe() instead of rb_entry() · 2fc1e948

由 Nicolas Dichtel 提交于 12月 10, 2014

Better to use existing macro that rewriting them.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2fc1e948

procfs: fix error handling of proc_register() · b208d54b

由 Debabrata Banerjee 提交于 12月 10, 2014

proc_register() error paths are leaking inodes and directory refcounts.
Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b208d54b

fs/proc: use a rb tree for the directory entries · 710585d4

由 Nicolas Dichtel 提交于 12月 10, 2014

When a lot of netdevices are created, one of the bottleneck is the
creation of proc entries.  This serie aims to accelerate this part.

The current implementation for the directories in /proc is using a single
linked list.  This is slow when handling directories with large numbers of
entries (eg netdevice-related entries when lots of tunnels are opened).

This patch replaces this linked list by a red-black tree.

Here are some numbers:

dummy30000.batch contains 30 000 times 'link add type dummy'.

Before the patch:
  $ time ip -b dummy30000.batch
  real    2m31.950s
  user    0m0.440s
  sys     2m21.440s
  $ time rmmod dummy
  real    1m35.764s
  user    0m0.000s
  sys     1m24.088s

After the patch:
  $ time ip -b dummy30000.batch
  real    2m0.874s
  user    0m0.448s
  sys     1m49.720s
  $ time rmmod dummy
  real    1m13.988s
  user    0m0.000s
  sys     1m1.008s

The idea of improving this part was suggested by Thierry Herbelot.

[akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Cc: Thierry Herbelot <thierry.herbelot@6wind.com>.
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

710585d4

mm: move page->mem_cgroup bad page handling into generic code · 9edad6ea

由 Johannes Weiner 提交于 12月 10, 2014

Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9edad6ea

mm: page_cgroup: rename file to mm/swap_cgroup.c · 5d1ea48b

由 Johannes Weiner 提交于 12月 10, 2014

Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.

Rename it and move the conditional compilation into mm/Makefile.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5d1ea48b

mm: embed the memcg pointer directly into struct page · 1306a85a

由 Johannes Weiner 提交于 12月 10, 2014

Memory cgroups used to have 5 per-page pointers.  To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.

There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged.  The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory.  With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space.  Remaining users that care can
still compile their kernels without CONFIG_MEMCG.

     text    data     bss     dec     hex     filename
  8828345 1725264  983040 11536649 b00909  vmlinux.old
  8827425 1725264  966656 11519345 afc571  vmlinux.new

[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1306a85a

mm: memcontrol: remove stale page_cgroup_lock comment · 22811c6b

由 Johannes Weiner 提交于 12月 10, 2014

There is no cgroup-specific page lock anymore.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

22811c6b

mm/frontswap.c: fix the condition in BUG_ON · a1ad2897

由 Li Haifeng 提交于 12月 10, 2014

The largest index of swap device is MAX_SWAPFILES-1.  So the type should
be less than MAX_SWAPFILES.
Signed-off-by: NHaifeng Li <omycle@gmail.com>
Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a1ad2897

mm: fix a spelling mistake · 26086de3

由 Wei Yuan 提交于 12月 10, 2014

Signed-off-by Wei Yuan <weiyuan.wei@huawei.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

26086de3

mm: hugetlb: fix __unmap_hugepage_range() · 569f48b8

由 Hillf Danton 提交于 12月 10, 2014

First, after flushing TLB, we have no need to scan pte from start again.
Second, before bail out loop, the address is forwarded one step.
Signed-off-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Reviewed-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

569f48b8

mm, memcg: fix potential undefined behaviour in page stat accounting · e4bd6a02

由 Michal Hocko 提交于 12月 10, 2014

Since commit d7365e78 ("mm: memcontrol: fix missed end-writeback
page accounting") mem_cgroup_end_page_stat consumes locked and flags
variables directly rather than via pointers which might trigger C
undefined behavior as those variables are initialized only in the slow
path of mem_cgroup_begin_page_stat.

Although mem_cgroup_end_page_stat handles parameters correctly and
touches them only when they hold a sensible value it is caller which
loads a potentially uninitialized value which then might allow compiler
to do crazy things.

I haven't seen any warning from gcc and it seems that the current
version (4.9) doesn't exploit this type undefined behavior but Sasha has
reported the following:

  UBSan: Undefined behaviour in mm/rmap.c:1084:2
  load of value 255 is not a valid value for type '_Bool'
  CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
  Call Trace:
    dump_stack (lib/dump_stack.c:52)
    ubsan_epilogue (lib/ubsan.c:159)
    __ubsan_handle_load_invalid_value (lib/ubsan.c:482)
    page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
    unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
    unmap_single_vma (mm/memory.c:1348)
    unmap_vmas (mm/memory.c:1377 (discriminator 3))
    exit_mmap (mm/mmap.c:2837)
    mmput (kernel/fork.c:659)
    do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
    do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
    SyS_exit_group (kernel/exit.c:901)
    tracesys_phase2 (arch/x86/kernel/entry_64.S:529)

Fix this by using pointer parameters for both locked and flags and be
more robust for future compiler changes even though the current code is
implemented correctly.
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e4bd6a02

mm: fix huge zero page accounting in smaps report · c164e038

由 Kirill A. Shutemov 提交于 12月 10, 2014

As a small zero page, huge zero page should not be accounted in smaps
report as normal page.

For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds.  We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.

Let's add separate codepath to handle pmds.  follow_trans_huge_pmd() will
detect huge zero page for us.

We would need pmd_dirty() helper to do this properly.  The patch adds it
to THP-enabled architectures which don't yet have one.

[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Tested-by: NFengwei Yin <yfw.kernel@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c164e038

mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree() · 2314b42d

由 Johannes Weiner 提交于 12月 10, 2014

None of the mem_cgroup_same_or_subtree() callers actually require it to
take the RCU lock, either because they hold it themselves or they have css
references.  Remove it.

To make the API change clear, rename the leftover helper to
mem_cgroup_is_descendant() to match cgroup_is_descendant().
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2314b42d

mm: memcontrol: pull the NULL check from __mem_cgroup_same_or_subtree() · 413918bb

由 Johannes Weiner 提交于 12月 10, 2014

The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner.  It
makes a lot more sense to check where it's looked up, rather than check
for it in __mem_cgroup_same_or_subtree() where it's unexpected.

No other callsite passes NULL to __mem_cgroup_same_or_subtree().
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

413918bb

mm: memcontrol: remove bogus NULL check after mem_cgroup_from_task() · c01f46c7

由 Johannes Weiner 提交于 12月 10, 2014

That function acts like a typecast - unless NULL is passed in, no NULL can
come out.  task_in_mem_cgroup() callers don't pass NULL tasks.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c01f46c7

mm: memcontrol: shorten the page statistics update slowpath · 312722cb

由 Johannes Weiner 提交于 12月 10, 2014

While moving charges from one memcg to another, page stat updates must
acquire the old memcg's move_lock to prevent double accounting.  That
situation is denoted by an increased memcg->move_accounting.  However, the
charge moving code declares this way too early for now, even before
summing up the RSS and pre-allocating destination charges.

Shorten this slowpath mode by increasing memcg->move_accounting only right
before walking the task's address space with the intention of actually
moving the pages.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

312722cb

thp: do not mark zero-page pmd write-protected explicitly · e544a4e7

由 Kirill A. Shutemov 提交于 12月 10, 2014

Zero pages can be used only in anonymous mappings, which never have
writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X
definitions.

Let's drop redundant pmd_wrprotect() in set_huge_zero_page().
Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e544a4e7

memcg: use generic slab iterators for showing slabinfo · b047501c

由 Vladimir Davydov 提交于 12月 10, 2014

Let's use generic slab_start/next/stop for showing memcg caches info.  In
contrast to the current implementation, this will work even if all memcg
caches' info doesn't fit into a seq buffer (a page), plus it simply looks
neater.

Actually, the main reason I do this isn't mere cleanup.  I'm going to zap
the memcg_slab_caches list, because I find it useless provided we have the
slab_caches list, and this patch is a step in this direction.

It should be noted that before this patch an attempt to read
memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
in -EIO, while after this patch it will silently show nothing except the
header, but I don't think it will frustrate anyone.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b047501c

memcg: remove mem_cgroup_reclaimable check from soft reclaim · 4ef461e8

由 Vladimir Davydov 提交于 12月 10, 2014

mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages on
*any* NUMA node.  However, the only place where it's called is
mem_cgroup_soft_reclaim(), which tries to reclaim memory from a *specific*
zone.  So the way it is used is incorrect - it will return true even if
the cgroup doesn't have pages on the zone we're scanning.

I think we can get rid of this check completely, because
mem_cgroup_shrink_node_zone(), which is called by
mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
equivalent to shrink_lruvec(), which exits almost immediately if the
lruvec passed to it is empty.  So there's no need to optimize anything
here.  Besides, we don't have such a check in the general scan path
(shrink_zone) either.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4ef461e8

mm, hugetlb: correct bit shift in hstate_sizelog() · 97ad2be1

由 Sasha Levin 提交于 12月 10, 2014

hstate_sizelog() would shift left an int rather than long, triggering
undefined behaviour and passing an incorrect value when the requested
page size was more than 4GB, thus breaking >4GB pages.
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

97ad2be1

mm: memcontrol: fold mem_cgroup_start_move()/mem_cgroup_end_move() · 247b1447

由 Johannes Weiner 提交于 12月 10, 2014

Having these functions and their documentation split out and somewhere
makes it harder, not easier, to follow what's going on.

Inline them directly where charge moving is prepared and finished, and put
an explanation right next to it.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

247b1447

mm: memcontrol: don't pass a NULL memcg to mem_cgroup_end_move() · 4e2f245d

由 Johannes Weiner 提交于 12月 10, 2014

mem_cgroup_end_move() checks if the passed memcg is NULL, along with a
lengthy comment to explain why this seemingly non-sensical situation is
even possible.

Check in cancel_attach() itself whether can_attach() set up the move
context or not, it's a lot more obvious from there.  Then remove the check
and comment in mem_cgroup_end_move().
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e2f245d

mm: memcontrol: inline memcg->move_lock locking · 354a4783

由 Johannes Weiner 提交于 12月 10, 2014

The wrappers around taking and dropping the memcg->move_lock spinlock add
nothing of value.  Inline the spinlock calls into the callsites.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

354a4783

mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flag · 29833315

由 Johannes Weiner 提交于 12月 10, 2014

pc->mem_cgroup had to be left intact after uncharge for the final LRU
removal, and !PCG_USED indicated whether the page was uncharged.  But
since commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API") pages
are uncharged after the final LRU removal.  Uncharge can simply clear
the pointer and the PCG_USED/PageCgroupUsed sites can test that instead.

Because this is the last page_cgroup flag, this patch reduces the memcg
per-page overhead to a single pointer.

[akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal]
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

29833315

mm: memcontrol: remove unnecessary PCG_MEM memory charge flag · f4aaa8b4

由 Johannes Weiner 提交于 12月 10, 2014

PCG_MEM is a remnant from an earlier version of 0a31bc97 ("mm:
memcontrol: rewrite uncharge API"), used to tell whether migration cleared
a charge while leaving pc->mem_cgroup valid and PCG_USED set.  But in the
final version, mem_cgroup_migrate() directly uncharges the source page,
rendering this distinction unnecessary.  Remove it.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f4aaa8b4

mm: memcontrol: remove unnecessary PCG_MEMSW memory+swap charge flag · 18eca2e6

由 Johannes Weiner 提交于 12月 10, 2014

Now that mem_cgroup_swapout() fully uncharges the page, every page that is
still in use when reaching mem_cgroup_uncharge() is known to carry both
the memory and the memory+swap charge.  Simplify the uncharge path and
remove the PCG_MEMSW page flag accordingly.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

18eca2e6

mm: memcontrol: uncharge pages on swapout · 7bdd143c

由 Johannes Weiner 提交于 12月 10, 2014

This series gets rid of the remaining page_cgroup flags, thus cutting the
memcg per-page overhead down to one pointer.

This patch (of 4):

mem_cgroup_swapout() is called with exclusive access to the page at the
end of the page's lifetime.  Instead of clearing the PCG_MEMSW flag and
deferring the uncharge, just do it right away.  This allows follow-up
patches to simplify the uncharge code.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NVladimir Davydov <vdavydov@parallels.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7bdd143c

mm: memcontrol: micro-optimize mem_cgroup_split_huge_fixup() · b9982f8d

由 Michal Hocko 提交于 12月 10, 2014

Don't call lookup_page_cgroup() when memcg is disabled.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b9982f8d

memcg: remove activate_kmem_mutex · 8c0145b6

由 Vladimir Davydov 提交于 12月 10, 2014

The activate_kmem_mutex is used to serialize memcg.kmem.limit updates, but
we already serialize them with memcg_limit_mutex so let's remove the
former.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8c0145b6

mm: memcontrol: clarify migration where old page is uncharged · 7d5e3245

由 Johannes Weiner 提交于 12月 10, 2014

Better explain re-entrant migration when compaction races with reclaim,
and also mention swapcache readahead pages as possible uncharged migration
sources.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7d5e3245

mm: memcontrol: update mem_cgroup_page_lruvec() documentation · dfe0e773

由 Johannes Weiner 提交于 12月 10, 2014

Commit 7512102c ("memcg: fix GPF when cgroup removal races with last
exit") added a pc->mem_cgroup reset into mem_cgroup_page_lruvec() to
prevent a crash where an anon page gets uncharged on unmap, the memcg is
released, and then the final LRU isolation on free dereferences the
stale pc->mem_cgroup pointer.

But since commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API"),
pages are only uncharged AFTER that final LRU isolation, which
guarantees the memcg's lifetime until then.  pc->mem_cgroup now only
needs to be reset for swapcache readahead pages.

Update the comment and callsite requirements accordingly.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dfe0e773

memcg: simplify unreclaimable groups handling in soft limit reclaim · bc2f2e7f

由 Vladimir Davydov 提交于 12月 10, 2014

If we fail to reclaim anything from a cgroup during a soft reclaim pass
we want to get the next largest cgroup exceeding its soft limit. To
achieve this, we should obviously remove the current group from the tree
and then pick the largest group. Currently we have a weird loop instead.
Let's simplify it.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bc2f2e7f

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功