提交 · fabc3fdde00b54825ba23230aedbf88a735b4e49 · openanolis / cloud-kernel

09 9月, 2015 40 次提交

memcg: get rid of mem_cgroup_root_css for !CONFIG_MEMCG · fabc3fdd

由 Michal Hocko 提交于 9月 08, 2015

The only user is cgwb_bdi_init and that one depends on
CONFIG_CGROUP_WRITEBACK which in turn depends on CONFIG_MEMCG so it
doesn't make much sense to definte an empty stub for !CONFIG_MEMCG.
Moreover ERR_PTR(-EINVAL) is ugly and would lead to runtime crashes if
used in unguarded code paths.  Better fail during compilation.
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fabc3fdd

memcg: export struct mem_cgroup · 33398cf2

由 Michal Hocko 提交于 9月 08, 2015

mem_cgroup structure is defined in mm/memcontrol.c currently which means
that the code outside of this file has to use external API even for
trivial access stuff.

This patch exports mm_struct with its dependencies and makes some of the
exported functions inlines.  This even helps to reduce the code size a bit
(make defconfig + CONFIG_MEMCG=y)

  text		data    bss     dec     	 hex 	filename
  12355346        1823792 1089536 15268674         e8fb42 vmlinux.before
  12354970        1823792 1089536 15268298         e8f9ca vmlinux.after

This is not much (370B) but better than nothing.

We also save a function call in some hot paths like callers of
mem_cgroup_count_vm_event which is used for accounting.

The patch doesn't introduce any functional changes.

[vdavykov@parallels.com: inline memcg_kmem_is_active]
[vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG]
[akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx]
[akpm@linux-foundation.org: export mem_cgroup_from_task() to modules]
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

33398cf2

sparc32: do not include swap.h from pgtable_32.h · b3d9ed3f

由 Michal Hocko 提交于 9月 08, 2015

"memcg: export struct mem_cgroup" will add includes into
linux/memcontrol.h which lead to further header dependency issues as
reported by Guenter Roeck:

  In file included from include/linux/highmem.h:7:0,
                   from include/linux/bio.h:23,
                   from include/linux/writeback.h:192,
                   from include/linux/memcontrol.h:30,
                   from include/linux/swap.h:8,
                   from ./arch/sparc/include/asm/pgtable_32.h:17,
                   from ./arch/sparc/include/asm/pgtable.h:6,
                   from arch/sparc/kernel/traps_32.c:23:
  include/linux/mm.h: In function 'is_vmalloc_addr':
  include/linux/mm.h:371:17: error: 'VMALLOC_START' undeclared (first use in this function)
  include/linux/mm.h:371:17: note: each undeclared identifier is reported only once for each function it appears in
  include/linux/mm.h:371:41: error: 'VMALLOC_END' undeclared (first use in this function)
  include/linux/mm.h: In function 'maybe_mkwrite':
  include/linux/mm.h:556:3: error: implicit declaration of function 'pte_mkwrite'

The issue is that pgtable_32.h depends on swap.h to get swap_entry_t but
that goes all the way down to linux/mm.h which wants to have VMALLOC_*
which is defined later in pgtable_32.h, though.

swap_entry_t is defined in include/mm_types.h so it should be sufficient
to include this header without more dependencies.
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b3d9ed3f

mm/dmapool: allow NULL `pool' pointer in dma_pool_destroy() · 44d7175d

由 Sergey Senozhatsky 提交于 9月 08, 2015

dma_pool_destroy() does not tolerate a NULL dma_pool pointer argument and
performs a NULL-pointer dereference.  This requires additional attention
and effort from developers/reviewers and forces all dma_pool_destroy()
callers to do a NULL check

    if (pool)
        dma_pool_destroy(pool);

Or, otherwise, be invalid dma_pool_destroy() users.

Tweak dma_pool_destroy() and NULL-check the pointer there.

Proposed by Andrew Morton.

Link: https://lkml.org/lkml/2015/6/8/583Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

44d7175d

mm/mempool: allow NULL `pool' pointer in mempool_destroy() · 4e3ca3e0

由 Sergey Senozhatsky 提交于 9月 08, 2015

mempool_destroy() does not tolerate a NULL mempool_t pointer argument and
performs a NULL-pointer dereference.  This requires additional attention
and effort from developers/reviewers and forces all mempool_destroy()
callers to do a NULL check

    if (pool)
        mempool_destroy(pool);

Or, otherwise, be invalid mempool_destroy() users.

Tweak mempool_destroy() and NULL-check the pointer there.

Proposed by Andrew Morton.

Link: https://lkml.org/lkml/2015/6/8/583Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e3ca3e0

mm/slab_common: allow NULL cache pointer in kmem_cache_destroy() · 3942d299

由 Sergey Senozhatsky 提交于 9月 08, 2015

kmem_cache_destroy() does not tolerate a NULL kmem_cache pointer argument
and performs a NULL-pointer dereference.  This requires additional
attention and effort from developers/reviewers and forces all
kmem_cache_destroy() callers (200+ as of 4.1) to do a NULL check

    if (cache)
        kmem_cache_destroy(cache);

Or, otherwise, be invalid kmem_cache_destroy() users.

Tweak kmem_cache_destroy() and NULL-check the pointer there.

Proposed by Andrew Morton.

Link: https://lkml.org/lkml/2015/6/8/583Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3942d299

mm, oom: remove unnecessary variable · 75e8f8b2

由 David Rientjes 提交于 9月 08, 2015

The "killed" variable in out_of_memory() can be removed since the call to
oom_kill_process() where we should block to allow the process time to
exit is obvious.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

75e8f8b2

mm, oom: add description of struct oom_control · 8989e4c7

由 David Rientjes 提交于 9月 08, 2015

Describe the purpose of struct oom_control and what each member does.

Also make gfp_mask and order const since they are never manipulated or
passed to functions that discard the qualifier.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8989e4c7

mm, oom: do not panic for oom kills triggered from sysrq · 071a4bef

由 David Rientjes 提交于 9月 08, 2015

Sysrq+f is used to kill a process either for debug or when the VM is
otherwise unresponsive.

It is not intended to trigger a panic when no process may be killed.

Avoid panicking the system for sysrq+f when no processes are killed.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Suggested-by: NMichal Hocko <mhocko@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

071a4bef

mm, oom: pass an oom order of -1 when triggered by sysrq · 54e9e291

由 David Rientjes 提交于 9月 08, 2015

The force_kill member of struct oom_control isn't needed if an order of -1
is used instead.  This is the same as order == -1 in struct
compact_control which requires full memory compaction.

This patch introduces no functional change.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

54e9e291

mm, oom: organize oom context into struct · 6e0fc46d

由 David Rientjes 提交于 9月 08, 2015

There are essential elements to an oom context that are passed around to
multiple functions.

Organize these elements into a new struct, struct oom_control, that
specifies the context for an oom condition.

This patch introduces no functional change.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6e0fc46d

mm: make set_recommended_min_free_kbytes() return void · 2c0b80d4

由 Nicholas Krause 提交于 9月 08, 2015

This makes set_recommended_min_free_kbytes() have a return type of void as
it cannot fail.
Signed-off-by: NNicholas Krause <xerofoify@gmail.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2c0b80d4

mm: improve __GFP_NORETRY comment based on implementation · 28c015d0

由 David Rientjes 提交于 9月 08, 2015

Explicitly state that __GFP_NORETRY will attempt direct reclaim and
memory compaction before returning NULL and that the oom killer is not
called in the current implementation of the page allocator.

[akpm@linux-foundation.org: s/has/have/]
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

28c015d0

fs: do not prefault sys_write() user buffer pages · 998ef75d

由 Dave Hansen 提交于 9月 08, 2015

=== Short summary ====

iov_iter_fault_in_readable() works around a really rare case and we can
avoid the deadlock it addresses in another way: disable page faults and
work around copy failures by faulting after the copy in a slow path
instead of before in a hot one.

I have a little microbenchmark that does repeated, small writes to tmpfs.
This patch speeds that micro up by 6.2%.

=== Long version ===

When doing a sys_write() we have a source buffer in userspace and then a
target file page.

If both of those are the same physical page, there is a potential deadlock
that we avoid.  It would happen something like this:

1. We start the write to the file
2. Allocate page cache page and set it !Uptodate
3. Touch the userspace buffer to copy in the user data
4. Page fault (since source of the write not yet mapped)
5. Page fault code tries to lock the page and deadlocks

(more details on this below)

To avoid this, we prefault the page to guarantee that this fault does not
occur.  But, this prefault comes at a cost.  It is one of the most
expensive things that we do in a hot write() path (especially if we
compare it to the read path).  It is working around a pretty rare case.

To fix this, it's pretty simple.  We move the "prefault" code to run after
we attempt the copy.  We explicitly disable page faults _during_ the copy,
detect the copy failure, then execute the "prefault" ouside of where the
page lock needs to be held.

iov_iter_copy_from_user_atomic() actually already has an implicit
pagefault_disable() inside of it (at least on x86), but we add an explicit
one.  I don't think we can depend on every kmap_atomic() implementation to
pagefault_disable() for eternity.

===================================================

The stack trace when this happens looks like this:

  wait_on_page_bit_killable+0xc0/0xd0
  __lock_page_or_retry+0x84/0xa0
  filemap_fault+0x1ed/0x3d0
  __do_fault+0x41/0xc0
  handle_mm_fault+0x9bb/0x1210
  __do_page_fault+0x17f/0x3d0
  do_page_fault+0xc/0x10
  page_fault+0x22/0x30
  generic_perform_write+0xca/0x1a0
  __generic_file_write_iter+0x190/0x1f0
  ext4_file_write_iter+0xe9/0x460
  __vfs_write+0xaa/0xe0
  vfs_write+0xa6/0x1a0
  SyS_write+0x46/0xa0
  entry_SYSCALL_64_fastpath+0x12/0x6a
  0xffffffffffffffff

(Note, this does *NOT* happen in practice today because
 the kmap_atomic() does a pagefault_disable().  The trace
 above was obtained by taking out the pagefault_disable().)

You can trigger the deadlock with this little code snippet:

	fd = open("foo", O_RDWR);
	fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0);
	write(fd, &fdmap[0], 1);
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Paul Cassella <cassella@cray.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

998ef75d

mm: /proc/pid/smaps:: show proportional swap share of the mapping · 8334b962

由 Minchan Kim 提交于 9月 08, 2015

We want to know per-process workingset size for smart memory management
on userland and we use swap(ex, zram) heavily to maximize memory
efficiency so workingset includes swap as well as RSS.

On such system, if there are lots of shared anonymous pages, it's really
hard to figure out exactly how many each process consumes memory(ie, rss
+ wap) if the system has lots of shared anonymous memory(e.g, android).

This patch introduces SwapPss field on /proc/<pid>/smaps so we can get
more exact workingset size per process.

Bongkyu tested it. Result is below.

1. 50M used swap
SwapTotal: 461976 kB
SwapFree: 411192 kB

$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
48236
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
141184

2. 240M used swap
SwapTotal: 461976 kB
SwapFree: 216808 kB

$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
230315
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
1387744

[akpm@linux-foundation.org: simplify kunmap_atomic() call]
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: NBongkyu Kim <bongkyu.kim@lge.com>
Tested-by: NBongkyu Kim <bongkyu.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8334b962

memtest: remove unused header files · 3115aec4

由 Vladimir Murzin 提交于 9月 08, 2015

memtest does not require these headers to be included.
Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
Cc: Leon Romanovsky <leon@leon.nu>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3115aec4

memtest: cleanup log messages · f373bafc

由 Vladimir Murzin 提交于 9月 08, 2015

- prefer pr_info(...  to printk(KERN_INFO ...
- use %pa for phys_addr_t
- use cpu_to_be64 while printing pattern in reserve_bad_mem()
Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
Cc: Leon Romanovsky <leon@leon.nu>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f373bafc

memtest: use kstrtouint instead of simple_strtoul · 06f80596

由 Vladimir Murzin 提交于 9月 08, 2015

Since simple_strtoul is obsolete and memtest_pattern is type of int, use
kstrtouint instead.
Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
Cc: Leon Romanovsky <leon@leon.nu>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

06f80596

pagemap: update documentation · 83b4b0bb

由 Konstantin Khlebnikov 提交于 9月 08, 2015

Notes about recent changes.

[akpm@linux-foundation.org: various tweaks]
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mark Williamson <mwilliamson@undo-software.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

83b4b0bb

pagemap: add mmap-exclusive bit for marking pages mapped only here · 77bb499b

由 Konstantin Khlebnikov 提交于 9月 08, 2015

This patch sets bit 56 in pagemap if this page is mapped only once.  It
allows to detect exclusively used pages without exposing PFN:

present file exclusive state
0       0    0         non-present
1       1    0         file page mapped somewhere else
1       1    1         file page mapped only here
1       0    0         anon non-CoWed page (shared with parent/child)
1       0    1         anon CoWed page (or never forked)

CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.

MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
page could be mapped once but has several swap-ptes which point to it.
Application could detect that by swap bit in pagemap entry and touch that
pte via /proc/pid/mem to get real information.

See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

Requested by Mark Williamson.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: NMark Williamson <mwilliamson@undo-software.com>
Tested-by: NMark Williamson <mwilliamson@undo-software.com>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

77bb499b

pagemap: hide physical addresses from non-privileged users · 1c90308e

由 Konstantin Khlebnikov 提交于 9月 08, 2015

This patch makes pagemap readable for normal users and hides physical
addresses from them. For some use-cases PFN isn't required at all.

See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name

Fixes: ab676b7d ("pagemap: do not leak physical addresses to non-privileged userspace")
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: NMark Williamson <mwilliamson@undo-software.com>
Tested-by: NMark Williamson <mwilliamson@undo-software.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1c90308e

pagemap: rework hugetlb and thp report · 356515e7

由 Konstantin Khlebnikov 提交于 9月 08, 2015

This patch moves pmd dissection out of reporting loop: huge pages are
reported as bunch of normal pages with contiguous PFNs.

Add missing "FILE" bit in hugetlb vmas.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: NMark Williamson <mwilliamson@undo-software.com>
Tested-by: NMark Williamson <mwilliamson@undo-software.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

356515e7

pagemap: switch to the new format and do some cleanup · deb94544

由 Konstantin Khlebnikov 提交于 9月 08, 2015

This patch removes page-shift bits (scheduled to remove since 3.11) and
completes migration to the new bit layout. Also it cleans messy macro.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mark Williamson <mwilliamson@undo-software.com>
Tested-by: NMark Williamson <mwilliamson@undo-software.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

deb94544

pagemap: check permissions and capabilities at open time · a06db751

由 Konstantin Khlebnikov 提交于 9月 08, 2015

This patchset makes pagemap useable again in the safe way (after row
hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access
for non-privileged users but hides PFNs from them.

Also it adds bit 'map-exclusive' which is set if page is mapped only here:
it helps in estimation of working set without exposing pfns and allows to
distinguish CoWed and non-CoWed private anonymous pages.

Second patch removes page-shift bits and completes migration to the new
pagemap format: flags soft-dirty and mmap-exclusive are available only in
the new format.

This patch (of 5):

This patch moves permission checks from pagemap_read() into pagemap_open().

Pointer to mm is saved in file->private_data. This reference pins only
mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.

See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.comSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: NMark Williamson <mwilliamson@undo-software.com>
Tested-by: NMark Williamson <mwilliamson@undo-software.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a06db751

mm: remove put_page_unless_one() · b5e3aa0a

由 Vineet Gupta 提交于 9月 08, 2015

It has no callers.
Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b5e3aa0a

mm/memblock.c: WARN_ON when flags differs from overlap region · 4fcab5f4

由 Wei Yang 提交于 9月 08, 2015

Each memblock_region has flags to indicates the type of this range. For
the overlap case, memblock_add_range() inserts the lower part and leave the
upper part as indicated in the overlapped region.

If the flags of the new range differs from the overlapped region, the
information recorded is not correct.

This patch adds a WARN_ON when the flags of the new range differs from the
overlapped region.
Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4fcab5f4

mm/page_alloc.c: remove unused variable in free_area_init_core() · 7f3eb55b

由 Wei Yang 提交于 9月 08, 2015

Commit febd5949 ("mm/memory hotplug: init the zone's size when
calculating node totalpages") refines the function
free_area_init_core().

After doing so, these two parameters are not used anymore.

This patch removes these two parameters.
Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7f3eb55b

mm/page_alloc.c: refine the calculation of highest possible node id · 904a9553

由 Wei Yang 提交于 9月 08, 2015

nr_node_ids records the highest possible node id, which is calculated by
scanning the bitmap node_states[N_POSSIBLE].  Current implementation
scan the bitmap from the beginning, which will scan the whole bitmap.

This patch reverses the order by scanning from the end with
find_last_bit().
Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

904a9553

mm, dax: use i_mmap_unlock_write() in do_cow_fault() · 52a2b53f

由 Kirill A. Shutemov 提交于 9月 08, 2015

__dax_fault() takes i_mmap_lock for write. Let's pair it with write
unlock on do_cow_fault() side.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52a2b53f

mm: take i_mmap_lock in unmap_mapping_range() for DAX · 46c043ed

由 Kirill A. Shutemov 提交于 9月 08, 2015

DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.

__dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
all mappings.  We need to drop i_mmap_lock there to avoid lock deadlock.

Re-aquiring the lock should be fine since we check i_size after the
point.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

46c043ed

dax: use linear_page_index() · 3fdd1b47

由 Matthew Wilcox 提交于 9月 08, 2015

I was basically open-coding it (thanks to copying code from do_fault()
which probably also needs to be fixed).
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3fdd1b47

dax: ensure that zero pages are removed from other processes · 73a6ec47

由 Matthew Wilcox 提交于 9月 08, 2015

If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables. There could be a zero pmd in
another process's page tables, if it had done a load. We can detect this
case by noticing that the buffer_head returned from the filesystem is New,
and ensure that other processes mapping this huge page have their page
tables flushed.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Reported-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

73a6ec47

dax: don't use set_huge_zero_page() · d295e341

由 Kirill A. Shutemov 提交于 9月 08, 2015

This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d295e341

thp: fix zap_huge_pmd() for DAX · da146769

由 Kirill A. Shutemov 提交于 9月 08, 2015

The original DAX code assumed that pgtable_t was a pointer, which isn't
true on all architectures.  Restructure the code to not rely on that
assumption.

[willy@linux.intel.com: further fixes integrated into this patch]
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

da146769

thp: decrement refcount on huge zero page if it is split · 5b701b84

由 Kirill A. Shutemov 提交于 9月 08, 2015

The DAX code neglected to put the refcount on the huge zero page.
Also we must notify on splits.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5b701b84

dax: fix race between simultaneous faults · 84317297

由 Matthew Wilcox 提交于 9月 08, 2015

If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes. Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block(). Now the loser of the race will see the block has
already been zeroed, and will not zero it again.

This severely limits our scalability. I have ideas for improving it, but
those can wait for a later patch.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

84317297

ext4: start transaction before calling into DAX · 01a33b4a

由 Matthew Wilcox 提交于 9月 08, 2015

Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock.  We can avoid this by starting the transaction in ext4 before
calling into DAX.  The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

01a33b4a

ext4: add ext4_get_block_dax() · ed923b57

由 Matthew Wilcox 提交于 9月 08, 2015

DAX wants different semantics from any currently-existing ext4 get_block
callback.  Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents.  So introduce a new ext4_get_block_dax() which
has those semantics.

We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ed923b57

dax: improve comment about truncate race · 84c4e5e6

由 Matthew Wilcox 提交于 9月 08, 2015

Jan Kara pointed out I should be more explicit here about the perils of
racing against truncate.  The comment is mostly the same as for the PTE
case.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

84c4e5e6

thp: change insert_pfn's return type to void · ae18d6dc

由 Matthew Wilcox 提交于 9月 08, 2015

It would make more sense to have all the return values from
vmf_insert_pfn_pmd() encoded in one place instead of having to follow
the convention into insert_pfn().  Suggested by Jeff Moyer.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ae18d6dc

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功