提交 · e30825f1869a75b29a69dc8e0aaaaccc492092cf · openanolis / cloud-kernel

14 12月, 2014 35 次提交

mm/debug-pagealloc: prepare boottime configurable on/off · e30825f1

由 Joonsoo Kim 提交于 12月 12, 2014

Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it.  This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page.  So, we
can't use this good feature in many cases.

Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page.  This gets rid of third party module
issue mentioned above.  And, this allows us to determine if we need extra
memory for this page extension in boottime.  With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC.  This will help our
development process greatly.

This patch is the preparation step to achive above goal.  debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext.  Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext.  This patch implements this.
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e30825f1

mm/page_ext: resurrect struct page extending code for debugging · eefa864b

由 Joonsoo Kim 提交于 12月 12, 2014

When we debug something, we'd like to insert some information to every
page.  For this purpose, we sometimes modify struct page itself.  But,
this has drawbacks.  First, it requires re-compile.  This makes us
hesitate to use the powerful debug feature so development process is
slowed down.  And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency.  At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel.  Keeping this as it is would be better to reproduce errornous
situation.

This feature is intended to overcome above mentioned problems.  This
feature allocates memory for extended data per page in certain place
rather than the struct page itself.  This memory can be accessed by the
accessor functions provided by this code.  During the boot process, it
checks whether allocation of huge chunk of memory is needed or not.  If
not, it avoids allocating memory at all.  With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.

Until now, memcg uses this technique.  But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed.  I'd like to use this code to develop debug feature, so
this patch resurrect it.

To help these things to work well, this patch introduces two callbacks for
clients.  One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time.  The other is optional, init
callback, which is used to do proper initialization after memory is
allocated.  Detailed explanation about purpose of these functions is in
code comment.  Please refer it.

Others are completely same with previous extension code in memcg.
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eefa864b

mm, gfp: escalatedly define GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE · 2d48366b

由 Jianyu Zhan 提交于 12月 12, 2014

GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined
defined, also implied by their names:

GFP_USER                                  = GFP_USER
GFP_USER + __GFP_HIGHMEM                  = GFP_HIGHUSER
GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE  = GFP_HIGHUSER_MOVABLE

So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to
reflect this fact.  It also makes the definition clear and texturally warn
on any furture break-up of this escalated relastionship.
Signed-off-by: NJianyu Zhan <jianyu.zhan@emc.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2d48366b

include/linux/kmemleak.h: needs slab.h · 66f2ca7e

由 Andrew Morton 提交于 12月 12, 2014

include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive':
include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function)

Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

66f2ca7e

mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache() · 056b7cce

由 Zhang Zhen 提交于 12月 12, 2014

The gfp was passed in but never used in this function.
Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

056b7cce

mm: move swp_entry_t definition to include/linux/mm_types.h · bd6dace7

由 Tejun Heo 提交于 12月 12, 2014

swp_entry_t being defined in include/linux/swap.h instead of
include/linux/mm_types.h causes cyclic include dependency later when
include/linux/page_cgroup.h is included from writeback path.  Move the
definition to include/linux/mm_types.h.

While at it, reformat the comment above it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bd6dace7

memory-hotplug: remove redundant call of page_to_pfn · 71fbd556

由 Zhang Zhen 提交于 12月 12, 2014

This is just a small optimization.  The start_pfn can be obtained directly
by phys_index << PFN_SECTION_SHIFT.  So the call of page_to_pfn() is
redundant and remove it.
Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
Acked-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Wang Nan <wangnan0@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

71fbd556

iommu/amd: use handle_mm_fault directly · 9dc00f4c

由 Jesse Barnes 提交于 12月 12, 2014

This could be useful for debug in the future if we want to track
major/minor faults more closely, and also avoids the put_page trick we
used with gup.

In order to do this, we also track the task struct in the PASID state
structure.  This lets us update the appropriate task stats after the fault
has been handled, and may aid with debug in the future as well.
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
Tested-by: NOded Gabbay <oded.gabbay@amd.com>
Cc: Joerg Roedel <jroedel@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9dc00f4c

mm: export find_extend_vma() and handle_mm_fault() for driver use · e1d6d01a

由 Jesse Barnes 提交于 12月 12, 2014

This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
simply, rather than doing tricks with page refs and get_user_pages().
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
Cc: Oded Gabbay <oded.gabbay@amd.com>
Cc: Joerg Roedel <jroedel@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e1d6d01a

hugetlb: hugetlb_register_all_nodes(): add __init marker · 7d9ca000

由 Luiz Capitulino 提交于 12月 12, 2014

This function is only called during initialization.
Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7d9ca000

hugetlb: alloc_bootmem_huge_page(): use IS_ALIGNED() · df994ead

由 Luiz Capitulino 提交于 12月 12, 2014

No reason to duplicate the code of an existing macro.
Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

df994ead

hugetlb: fix hugepages= entry in kernel-parameters.txt · 27ec26ec

由 Luiz Capitulino 提交于 12月 12, 2014

The hugepages= entry in kernel-parameters.txt states that 1GB pages can
only be allocated at boot time and not freed afterwards.  This is not
true since commit 944d9fec ("hugetlb: add support for gigantic page
allocation at runtime"), at least for x86_64.

Instead of adding arch-specifc observations to the hugepages= entry,
this commit just drops the out of date information.  Further information
about arch-specific support and available features can be obtained in
the hugetlb documentation.
Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

27ec26ec

memcg: turn memcg_kmem_skip_account into a bit field · 6f185c29

由 Vladimir Davydov 提交于 12月 12, 2014

It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
the task_struct.

Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
set/clear the flag inline.  Regarding the overwhelming comment to the
helpers, which is removed by this patch too, we already have a compact yet
accurate explanation in memcg_schedule_cache_create, no need in yet
another one.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6f185c29

memcg: only check memcg_kmem_skip_account in __memcg_kmem_get_cache · 4e701d7b

由 Vladimir Davydov 提交于 12月 12, 2014

__memcg_kmem_get_cache can recurse if it calls kmalloc (which it does if
the cgroup's kmem cache doesn't exist), because kmalloc may call
__memcg_kmem_get_cache internally again.  To avoid the recursion, we use
the task_struct->memcg_kmem_skip_account flag.

However, there's no need checking the flag in memcg_kmem_newpage_charge,
because there's no way how this function could result in recursion, if
called from memcg_kmem_get_cache.  So let's remove the redundant code.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e701d7b

memcg: zap kmem_account_flags · 900a38f0

由 Vladimir Davydov 提交于 12月 12, 2014

The only such flag is KMEM_ACCOUNTED_ACTIVE, but it's set iff
mem_cgroup->kmemcg_id is initialized, so we can check kmemcg_id instead of
having a separate flags field.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

900a38f0

mm: mincore: add hwpoison page handle · c313dc5d

由 Weijie Yang 提交于 12月 12, 2014

When the encountered pte is a swap entry, the current code handles two
cases: migration and normal swapentry, but we have a third case: hwpoison
page.

This patch adds hwpoison page handle, consider hwpoison page incore as
same as migration.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c313dc5d

mm/rmap: calculate page offset when needed · b258d860

由 Davidlohr Bueso 提交于 12月 12, 2014

Call page_to_pgoff() to get the page offset once we are sure we actually
need it, and any very obvious initial function checks have passed.
Trivial micro-optimization, and potentially save some cycles.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b258d860

mm/debug-pagealloc: cleanup page guard code · 2847cf95

由 Joonsoo Kim 提交于 12月 12, 2014

Page guard is used by debug-pagealloc feature.  Currently, it is
open-coded, but, I think that more abstraction of it makes core page
allocator code more readable.

There is no functional difference.
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2847cf95

mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUG · 4308ce17

由 Tony Luck 提交于 12月 12, 2014

There is a lot of duplication in the rubric around actually setting or
clearing a mem region flag.  Create a new helper function to do this and
reduce each of memblock_mark_hotplug() and memblock_clear_hotplug() to a
single line.

This will be useful if someone were to add a new mem region flag - which
I hope to be doing some day soon. But it looks like a plausible cleanup
even without that - so I'd like to get it out of the way now.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Emil Medve <Emilian.Medve@freescale.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4308ce17

memcg: do not abuse memcg_kmem_skip_account · 95fc3c50

由 Vladimir Davydov 提交于 12月 12, 2014

task_struct->memcg_kmem_skip_account was initially introduced to avoid
recursion during kmem cache creation: memcg_kmem_get_cache, which is
called by kmem_cache_alloc to determine the per-memcg cache to account
allocation to, may issue lazy cache creation if the needed cache doesn't
exist, which means issuing yet another kmem_cache_alloc.  We can't just
pass a flag to the nested kmem_cache_alloc disabling kmem accounting,
because there are hidden allocations, e.g.  in INIT_WORK.  So we
introduced a flag on the task_struct, memcg_kmem_skip_account, making
memcg_kmem_get_cache return immediately.

By its nature, the flag may also be used to disable accounting for
allocations shared among different cgroups, and currently it is used this
way in memcg_activate_kmem.  Using it like this looks like abusing it to
me.  If we want to disable accounting for some allocations (which we will
definitely want one day), we should either add GFP_NO_MEMCG or GFP_MEMCG
flag in order to blacklist/whitelist some allocations.

For now, let's simply remove memcg_stop/resume_kmem_account from
memcg_activate_kmem.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

95fc3c50

memcg: don't check mm in __memcg_kmem_{get_cache,newpage_charge} · 9d100c5e

由 Vladimir Davydov 提交于 12月 12, 2014

We already assured the current task has mm in memcg_kmem_should_charge,
no need to double check.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9d100c5e

memcg: __mem_cgroup_free: remove stale disarm_static_keys comment · bfda7e8f

由 Vladimir Davydov 提交于 12月 12, 2014

cpuset code stopped using cgroup_lock in favor of cpuset_mutex long ago.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bfda7e8f

mm: cma: align to physical address, not CMA region position · b5be83e3

由 Gregory Fong 提交于 12月 12, 2014

The alignment in cma_alloc() was done w.r.t. the bitmap.  This is a
problem when, for example:

- a device requires 16M (order 12) alignment
- the CMA region is not 16 M aligned

In such a case, can result with the CMA region starting at, say,
0x2f800000 but any allocation you make from there will be aligned from
there.  Requesting an allocation of 32 M with 16 M alignment will result
in an allocation from 0x2f800000 to 0x31800000, which doesn't work very
well if your strange device requires 16M alignment.

Change to use bitmap_find_next_zero_area_off() to account for the
difference in alignment at reserve-time and alloc-time.
Signed-off-by: NGregory Fong <gregory.0xf0@gmail.com>
Acked-by: NMichal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kukjin Kim <kgene.kim@samsung.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b5be83e3

lib: bitmap: add alignment offset for bitmap_find_next_zero_area() · 5e19b013

由 Michal Nazarewicz 提交于 12月 12, 2014

Add a bitmap_find_next_zero_area_off() function which works like
bitmap_find_next_zero_area() function except it allows an offset to be
specified when alignment is checked.  This lets caller request a bit such
that its number plus the offset is aligned according to the mask.

[gregory.0xf0@gmail.com: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation]
Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: NGregory Fong <gregory.0xf0@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kukjin Kim <kgene.kim@samsung.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5e19b013

mm/memory.c: share the i_mmap_rwsem · c8475d14

由 Davidlohr Bueso 提交于 12月 12, 2014

The unmap_mapping_range family of functions do the unmapping of user pages
(ultimately via zap_page_range_single) without touching the actual
interval tree, thus share the lock.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c8475d14

mm/nommu: share the i_mmap_rwsem · 1acf2e04

由 Davidlohr Bueso 提交于 12月 12, 2014

Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify
that any shared mappings of the inode in question aren't broken (dead
zone).  afaict the only user being ramfs to handle the size change
attribute.

Pretty much a no-brainer to share the lock.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1acf2e04

mm/memory-failure: share the i_mmap_rwsem · d28eb9c8

由 Davidlohr Bueso 提交于 12月 12, 2014

No brainer conversion: collect_procs_file() only schedules a process for
later kill, share the lock, similarly to the anon vma variant.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d28eb9c8

mm/xip: share the i_mmap_rwsem · 874bfcaf

由 Davidlohr Bueso 提交于 12月 12, 2014

__xip_unmap() will remove the xip sparse page from the cache and take down
pte mapping, without altering the interval tree, thus share the
i_mmap_rwsem when searching for the ptes to unmap.

Additionally, tidy up the function a bit and make variables only local to
the interval tree walk loop.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

874bfcaf

uprobes: share the i_mmap_rwsem · 4a23717a

由 Davidlohr Bueso 提交于 12月 12, 2014

Both register and unregister call build_map_info() in order to create the
list of mappings before installing or removing breakpoints for every mm
which maps file backed memory.  As such, there is no reason to hold the
i_mmap_rwsem exclusively, so share it and allow concurrent readers to
build the mapping data.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: NHugh Dickins <hughd@google.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4a23717a

mm/rmap: share the i_mmap_rwsem · 3dec0ba0

由 Davidlohr Bueso 提交于 12月 12, 2014

Similarly to the anon memory counterpart, we can share the mapping's lock
ownership as the interval tree is not modified when doing doing the walk,
only the file page.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3dec0ba0

mm: convert i_mmap_mutex to rwsem · c8c06efa

由 Davidlohr Bueso 提交于 12月 12, 2014

The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory.  To
this end, this lock can also be a rwsem.  In addition, there are some
important opportunities to share the lock when there are no tree
modifications.

This conversion is straightforward.  For now, all users take the write
lock.

[sfr@canb.auug.org.au: update fremap.c]
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c8c06efa

mm: use new helper functions around the i_mmap_mutex · 83cde9e8

由 Davidlohr Bueso 提交于 12月 12, 2014

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

83cde9e8

mm,fs: introduce helpers around the i_mmap_mutex · 8b28f621

由 Davidlohr Bueso 提交于 12月 12, 2014

This series is a continuation of the conversion of the i_mmap_mutex to
rwsem, following what we have for the anon memory counterpart.  With
Hugh's feedback from the first iteration.

Ultimately, the most obvious paths that require exclusive ownership of the
lock is when we modify the VMA interval tree, via
vma_interval_tree_insert() and vma_interval_tree_remove() families.  Cases
such as unmapping, where the ptes content is changed but the tree remains
untouched should make it safe to share the i_mmap_rwsem.

As such, the code of course is straightforward, however the devil is very
much in the details.  While its been tested on a number of workloads
without anything exploding, I would not be surprised if there are some
less documented/known assumptions about the lock that could suffer from
these changes.  Or maybe I'm just missing something, but either way I
believe its at the point where it could use more eyes and hopefully some
time in linux-next.

Because the lock type conversion is the heart of this patchset,
its worth noting a few comparisons between mutex vs rwsem (xadd):

  (i) Same size, no extra footprint.

  (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
       exclusive lock ownership.

  (iii) Both can be slightly unfair wrt exclusive ownership, with
        writer lock stealing properties, not necessarily respecting
        FIFO order for granting the lock when contended.

  (iv) Mutexes can be slightly faster than rwsems when
       the lock is non-contended.

  (v) Both suck at performance for debug (slowpaths), which
      shouldn't matter anyway.

Sharing the lock is obviously beneficial, and sem writer ownership is
close enough to mutexes.  The biggest winner of these changes is
migration.

As for concrete numbers, the following performance results are for a
4-socket 60-core IvyBridge-EX with 130Gb of RAM.

Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
with this set, with a steady ~60% throughput (jpm) increase for alltests
and up to ~30% for disk for high amounts of concurrency.  Lower counts of
workload users (< 100) does not show much difference at all, so at least
no regressions.

                    3.18-rc1            3.18-rc1-i_mmap_rwsem
alltests-100     17918.72 (  0.00%)    28417.97 ( 58.59%)
alltests-200     16529.39 (  0.00%)    26807.92 ( 62.18%)
alltests-300     16591.17 (  0.00%)    26878.08 ( 62.00%)
alltests-400     16490.37 (  0.00%)    26664.63 ( 61.70%)
alltests-500     16593.17 (  0.00%)    26433.72 ( 59.30%)
alltests-600     16508.56 (  0.00%)    26409.20 ( 59.97%)
alltests-700     16508.19 (  0.00%)    26298.58 ( 59.31%)
alltests-800     16437.58 (  0.00%)    26433.02 ( 60.81%)
alltests-900     16418.35 (  0.00%)    26241.61 ( 59.83%)
alltests-1000    16369.00 (  0.00%)    26195.76 ( 60.03%)
alltests-1100    16330.11 (  0.00%)    26133.46 ( 60.03%)
alltests-1200    16341.30 (  0.00%)    26084.03 ( 59.62%)
alltests-1300    16304.75 (  0.00%)    26024.74 ( 59.61%)
alltests-1400    16231.08 (  0.00%)    25952.35 ( 59.89%)
alltests-1500    16168.06 (  0.00%)    25850.58 ( 59.89%)
alltests-1600    16142.56 (  0.00%)    25767.42 ( 59.62%)
alltests-1700    16118.91 (  0.00%)    25689.58 ( 59.38%)
alltests-1800    16068.06 (  0.00%)    25599.71 ( 59.32%)
alltests-1900    16046.94 (  0.00%)    25525.92 ( 59.07%)
alltests-2000    16007.26 (  0.00%)    25513.07 ( 59.38%)

disk-100          7582.14 (  0.00%)     7257.48 ( -4.28%)
disk-200          6962.44 (  0.00%)     7109.15 (  2.11%)
disk-300          6435.93 (  0.00%)     6904.75 (  7.28%)
disk-400          6370.84 (  0.00%)     6861.26 (  7.70%)
disk-500          6353.42 (  0.00%)     6846.71 (  7.76%)
disk-600          6368.82 (  0.00%)     6806.75 (  6.88%)
disk-700          6331.37 (  0.00%)     6796.01 (  7.34%)
disk-800          6324.22 (  0.00%)     6788.00 (  7.33%)
disk-900          6253.52 (  0.00%)     6750.43 (  7.95%)
disk-1000         6242.53 (  0.00%)     6855.11 (  9.81%)
disk-1100         6234.75 (  0.00%)     6858.47 ( 10.00%)
disk-1200         6312.76 (  0.00%)     6845.13 (  8.43%)
disk-1300         6309.95 (  0.00%)     6834.51 (  8.31%)
disk-1400         6171.76 (  0.00%)     6787.09 (  9.97%)
disk-1500         6139.81 (  0.00%)     6761.09 ( 10.12%)
disk-1600         4807.12 (  0.00%)     6725.33 ( 39.90%)
disk-1700         4669.50 (  0.00%)     5985.38 ( 28.18%)
disk-1800         4663.51 (  0.00%)     5972.99 ( 28.08%)
disk-1900         4674.31 (  0.00%)     5949.94 ( 27.29%)
disk-2000         4668.36 (  0.00%)     5834.93 ( 24.99%)

In addition, a 67.5% increase in successfully migrated NUMA pages, thus
improving node locality.

The patch layout is simple but designed for bisection (in case reversion
is needed if the changes break upstream) and easier review:

o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
o Patches 5-10 share the lock in specific paths, each patch
  details the rationale behind why it should be safe.

This patchset has been tested with: postgres 9.4 (with brand new hugetlb
support), hugetlbfs test suite (all tests pass, in fact more tests pass
with these changes than with an upstream kernel), ltp, aim7 benchmarks,
memcached and iozone with the -B option for mmap'ing.  *Untested* paths
are nommu, memory-failure, uprobes and xip.

This patch (of 8):

Various parts of the kernel acquire and release this mutex, so add
i_mmap_lock_write() and immap_unlock_write() helper functions that will
encapsulate this logic.  The next patch will make use of these.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8b28f621

MAINTAINERS: update Xiubo's email address · b4b98297

由 Xiubo Li 提交于 12月 12, 2014

My current email address will be gone shortly, update my email to be a
gmail one.
Signed-off-by: NXiubo Li <Li.Xiubo@freescale.com>
Cc: Timur Tabi <timur@tabi.org>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: NNicolin Chen <nicoleotsuka@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b4b98297

rtc: snvs: fix build with CONFIG_PM_SLEEP disabled · 88221c32

由 Guenter Roeck 提交于 12月 12, 2014

Commit 7654e9d4 ("drivers/rtc/rtc-snvs: fix suspend/resume")
replaces SIMPLE_DEV_PM_OPS with direct declaration of snvs_rtc_pm_ops,
but does so outside #ifdef CONFIG_PM_SLEEP.  This causes the driver
build to fail if CONFIG_PM_SLEEP is not configured.

Fixes: 7654e9d4 ("drivers/rtc/rtc-snvs: fix suspend/resume")
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Cc: Sanchayan Maity <maitysanchayan@gmail.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

88221c32

13 12月, 2014 5 次提交

Merge tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · 6ce4436c

由 Linus Torvalds 提交于 12月 12, 2014

Pull pstore update #2 from Tony Luck:
 "Couple of pstore-ram enhancements to allow use of different memory
  attributes"

* tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore-ram: Allow optional mapping with pgprot_noncached
  pstore-ram: Fix hangs by using write-combine mappings

6ce4436c

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · bdeb03ca

由 Linus Torvalds 提交于 12月 12, 2014

Pull btrfs update from Chris Mason:
 "From a feature point of view, most of the code here comes from Miao
  Xie and others at Fujitsu to implement scrubbing and replacing devices
  on raid56.  This has been in development for a while, and it's a big
  improvement.

  Filipe and Josef have a great assortment of fixes, many of which solve
  problems corruptions either after a crash or in error conditions.  I
  still have a round two from Filipe for next week that solves
  corruptions with discard and block group removal"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: make get_caching_control unconditionally return the ctl
  Btrfs: fix unprotected deletion from pending_chunks list
  Btrfs: fix fs mapping extent map leak
  Btrfs: fix memory leak after block remove + trimming
  Btrfs: make btrfs_abort_transaction consider existence of new block groups
  Btrfs: fix race between writing free space cache and trimming
  Btrfs: fix race between fs trimming and block group remove/allocation
  Btrfs, replace: enable dev-replace for raid56
  Btrfs: fix freeing used extents after removing empty block group
  Btrfs: fix crash caused by block group removal
  Btrfs: fix invalid block group rbtree access after bg is removed
  Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  ...

bdeb03ca

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid · 0349678c

由 Linus Torvalds 提交于 12月 12, 2014

Pull HID updates from Jiri Kosina:
 - i2c-hid race condition fix from Jean-Baptiste Maneyrol
 - Logitech driver now supports vendor-specific HID++ protocol, allowing
   us to deliver a full multitouch support on wider range of Logitech
   touchpads.  Written by Benjamin Tissoires
 - MS Surface Pro 3 Type Cover support added by Alan Wu
 - RMI touchpad support improvements from Andrew Duggan
 - a lot of updates to Wacom driver from Jason Gerecke and Ping Cheng
 - various small fixes all over the place

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (56 commits)
  HID: rmi: The address of query8 must be calculated based on which query registers are present
  HID: rmi: Check for additional ACM registers appended to F11 data report
  HID: i2c-hid: prevent buffer overflow in early IRQ
  HID: logitech-hidpp: disable io in probe error path
  HID: logitech-hidpp: add boundary check for name retrieval
  HID: logitech-hidpp: check name retrieval return code
  HID: logitech-hidpp: do not return the name length
  HID: wacom: Report input events for each finger on generic devices
  HID: wacom: Initialize MT slots for generic devices at post_parse_hid
  HID: wacom: Update maximum X/Y accounding to outbound offset
  HID: wacom: Add support for DTU-1031X
  HID: wacom: add defines for new Cintiq and DTU outbound tracking
  HID: wacom: fix freeze on open when autosuspend is on
  HID: wacom: re-add accidentally dropped Lenovo PID
  HID: make hid_report_len as a static inline function in hid.h
  HID: wacom: Consult the application usage when determining field type
  HID: wacom: PAD is independent with pen/touch
  HID: multitouch: Add quirk for VTL touch panels
  HID: i2c-hid: fix race condition reading reports
  HID: wacom: Add angular resolution data to some ABS axes
  ...

0349678c

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial · a7cb7bb6

由 Linus Torvalds 提交于 12月 12, 2014

Pull trivial tree update from Jiri Kosina:
 "Usual stuff: documentation updates, printk() fixes, etc"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
  intel_ips: fix a type in error message
  cpufreq: cpufreq-dt: Move newline to end of error message
  ps3rom: fix error return code
  treewide: fix typo in printk and Kconfig
  ARM: dts: bcm63138: change "interupts" to "interrupts"
  Replace mentions of "list_struct" to "list_head"
  kernel: trace: fix printk message
  scsi: mpt2sas: fix ioctl in comment
  zbud, zswap: change module author email
  clocksource: Fix 'clcoksource' typo in comment
  arm: fix wording of "Crotex" in CONFIG_ARCH_EXYNOS3 help
  gpio: msm-v1: make boolean argument more obvious
  usb: Fix typo in usb-serial-simple.c
  PCI: Fix comment typo 'COMFIG_PM_OPS'
  powerpc: Fix comment typo 'CONIFG_8xx'
  powerpc: Fix comment typos 'CONFiG_ALTIVEC'
  clk: st: Spelling s/stucture/structure/
  isci: Spelling s/stucture/structure/
  usb: gadget: zero: Spelling s/infrastucture/infrastructure/
  treewide: Fix company name in module descriptions
  ...

a7cb7bb6

Merge tag 'upstream-3.19-rc1' of git://git.infradead.org/linux-ubifs · ccb5a491

由 Linus Torvalds 提交于 12月 12, 2014

Pull UBI/UBIFS updates from Artem Bityutskiy:
 "This includes the following UBI/UBIFS changes:
   - UBI debug messages now include the UBI device number.  This change
     is responsible for the big diffstat since it touched every
     debugging print statement.
   - An Xattr bug-fix which fixes SELinux support
   - Several error path fixes in UBI/UBIFS"

* tag 'upstream-3.19-rc1' of git://git.infradead.org/linux-ubifs:
  UBI: Fix invalid vfree()
  UBI: Fix double free after do_sync_erase()
  UBIFS: fix a couple bugs in UBIFS xattr length calculation
  UBI: vtbl: Use ubi_eba_atomic_leb_change()
  UBI: Extend UBI layer debug/messaging capabilities
  UBIFS: fix budget leak in error path

ccb5a491

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功