1. 24 1月, 2014 40 次提交
    • M
      add generic fixmap.h · d57c33c5
      Mark Salter 提交于
      Many architectures provide an asm/fixmap.h which defines support for
      compile-time 'special' virtual mappings which need to be made before
      paging_init() has run.  This support is also used for early ioremap on
      x86.  Much of this support is identical across the architectures.  This
      patch consolidates all of the common bits into asm-generic/fixmap.h
      which is intended to be included from arch/*/include/asm/fixmap.h.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jonas Bonn <jonas.bonn@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d57c33c5
    • Y
      logfs: check for the return value after calling find_or_create_page() · 2252b62a
      Younger Liu 提交于
      In get_mapping_page(), after calling find_or_create_page(), the return
      value should be checked.
      
       This patch has been provided:
      http://www.spinics.net/lists/linux-fsdevel/msg66948.html but not been
      applied now.
      Signed-off-by: NYounger Liu <liuyiyang@hisense.com>
      Cc: Younger Liu <younger.liucn@gmail.com>
      Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
      Reviewed-by: NPrasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Jörn Engel <joern@logfs.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2252b62a
    • F
      drivers/block/Kconfig: update RAM block device module name · a3b25d9b
      Fabian Frederick 提交于
      RAM block device support module name changed to brd.ko some years ago
      with an "rd" alias to match previous module implementation.  This patch
      updates its Kconfig definition.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b25d9b
    • D
      drivers/mailbox/omap: make mbox->irq signed for error handling · 4a102b4d
      Dan Carpenter 提交于
      There is a bug in omap2_mbox_probe() where we try do:
      
      		mbox->irq = platform_get_irq(pdev, info->irq_id);
      		if (mbox->irq < 0) {
      
      The problem is that mbox->irq is unsigned so the error handling doesn't
      work.  I've changed it to a signed integer.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Suman Anna <s-anna@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Omar Ramirez Luna <omar.ramirez@copitl.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a102b4d
    • G
      asm/types.h: Remove include/asm-generic/int-l64.h · 0c79a8e2
      Geert Uytterhoeven 提交于
      Now all 64-bit architectures have been converted to int-ll64.h, we can
      remove int-l64.h in kernelspace.
      
      For backwards compatibility, alpha, ia64, mips64, and powerpc64 still
      use int-l64.h in userspace.
      
      This is the (reworked for UAPI) non-documentation part of more than two
      year old "asm/types.h: All architectures use int-ll64.h in kernelspace"
      (https://lkml.org/lkml/2011/8/13/104)
      
      Since <asm/types.h> (from include/uapi/asm-generic/types.h) is used for
      both kernel and user space, include/asm-generic/int-ll64.h cannot just
      become include/asm-generic/types.h, as Arnd suggested.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c79a8e2
    • C
      mm: ignore VM_SOFTDIRTY on VMA merging · 34228d47
      Cyrill Gorcunov 提交于
      The VM_SOFTDIRTY bit affects vma merge routine: if two VMAs has all bits
      in vm_flags matched except dirty bit the kernel can't longer merge them
      and this forces the kernel to generate new VMAs instead.
      
      It finally may lead to the situation when userspace application reaches
      vm.max_map_count limit and get crashed in worse case
      
       | (gimp:11768): GLib-ERROR **: gmem.c:110: failed to allocate 4096 bytes
       |
       | (file-tiff-load:12038): LibGimpBase-WARNING **: file-tiff-load: gimp_wire_read(): error
       | xinit: connection to X server lost
       |
       | waiting for X server to shut down
       | /usr/lib64/gimp/2.0/plug-ins/file-tiff-load terminated: Hangup
       | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
       | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
      
        https://bugzilla.kernel.org/show_bug.cgi?id=67651
        https://bugzilla.gnome.org/show_bug.cgi?id=719619#c0
      
      Initial problem came from missed VM_SOFTDIRTY in do_brk() routine but
      even if we would set up VM_SOFTDIRTY here, there is still a way to
      prevent VMAs from merging: one can call
      
       | echo 4 > /proc/$PID/clear_refs
      
      and clear all VM_SOFTDIRTY over all VMAs presented in memory map, then
      new do_brk() will try to extend old VMA and finds that dirty bit doesn't
      match thus new VMA will be generated.
      
      As discussed with Pavel, the right approach should be to ignore
      VM_SOFTDIRTY bit when we're trying to merge VMAs and if merge successed
      we mark extended VMA with dirty bit where needed.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reported-by: NBastian Hougaard <gnome@rvzt.net>
      Reported-by: NMel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34228d47
    • F
      mm/rmap: fix coccinelle warnings · 871beb8c
      Fengguang Wu 提交于
      mm/rmap.c:851:9-10: WARNING: return of 0/1 in function 'invalid_mkclean_vma' with return type bool
      
       Return statements in functions returning bool should use
       true/false instead of 1/0.
      
      Generated by: coccinelle/misc/boolreturn.cocci
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      871beb8c
    • J
      mm/swapfile.c: do not skip lowest_bit in scan_swap_map() scan loop · a5998061
      Jamie Liu 提交于
      In the second half of scan_swap_map()'s scan loop, offset is set to
      si->lowest_bit and then incremented before entering the loop for the
      first time, causing si->swap_map[si->lowest_bit] to be skipped.
      Signed-off-by: NJamie Liu <jamieliu@google.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5998061
    • V
      memcg: remove unused code from kmem_cache_destroy_work_func · 0d8a4a37
      Vladimir Davydov 提交于
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d8a4a37
    • M
      mm: improve documentation of page_order · 6c14466c
      Mel Gorman 提交于
      Developers occasionally try and optimise PFN scanners by using
      page_order but miss that in general it requires zone->lock.  This has
      happened twice for compaction.c and rejected both times.  This patch
      clarifies the documentation of page_order and adds a note to
      compaction.c why page_order is not used.
      
      [akpm@linux-foundation.org: tweaks]
      [lauraa@codeaurora.org: Corrected a page_zone(page)->lock reference]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c14466c
    • M
      memcg: fix css reference leak and endless loop in mem_cgroup_iter · 0eef6156
      Michal Hocko 提交于
      Commit 19f39402 ("memcg: simplify mem_cgroup_iter") has reorganized
      mem_cgroup_iter code in order to simplify it.  A part of that change was
      dropping an optimization which didn't call css_tryget on the root of the
      walked tree.  The patch however didn't change the css_put part in
      mem_cgroup_iter which excludes root.
      
      This wasn't an issue at the time because __mem_cgroup_iter_next bailed
      out for root early without taking a reference as cgroup iterators
      (css_next_descendant_pre) didn't visit root themselves.
      
      Nevertheless cgroup iterators have been reworked to visit root by commit
      bd8815a6 ("cgroup: make css_for_each_descendant() and friends
      include the origin css in the iteration") when the root bypass have been
      dropped in __mem_cgroup_iter_next.  This means that css_put is not
      called for root and so css along with mem_cgroup and other cgroup
      internal object tied by css lifetime are never freed.
      
      Fix the issue by reintroducing root check in __mem_cgroup_iter_next and
      do not take css reference for it.
      
      This reference counting magic protects us also from another issue, an
      endless loop reported by Hugh Dickins when reclaim races with root
      removal and css_tryget called by iterator internally would fail.  There
      would be no other nodes to visit so __mem_cgroup_iter_next would return
      NULL and mem_cgroup_iter would interpret it as "start looping from root
      again" and so mem_cgroup_iter would loop forever internally.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eef6156
    • M
      memcg: fix endless loop caused by mem_cgroup_iter · ecc736fc
      Michal Hocko 提交于
      Hugh has reported an endless loop when the hardlimit reclaim sees the
      same group all the time.  This might happen when the reclaim races with
      the memcg removal.
      
      shrink_zone
                                                      [rmdir root]
        mem_cgroup_iter(root, NULL, reclaim)
          // prev = NULL
          rcu_read_lock()
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root || NULL
            css_tryget(last_visited)            // failed
            last_visited = NULL                 [1]
          memcg = root = __mem_cgroup_iter_next(root, NULL)
          mem_cgroup_iter_update
            iter->last_visited = root;
          reclaim->generation = iter->generation
      
       mem_cgroup_iter(root, root, reclaim)
         // prev = root
         rcu_read_lock
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root
            css_tryget(last_visited)            // failed
          [1]
      
      The issue seemed to be introduced by commit 5f578161 ("memcg: relax
      memcg iter caching") which has replaced unconditional css_get/css_put by
      css_tryget/css_put for the cached iterator.
      
      This patch fixes the issue by skipping css_tryget on the root of the
      tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
      in mem_cgroup_iter_update.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecc736fc
    • D
      mm, oom: prefer thread group leaders for display purposes · d49ad935
      David Rientjes 提交于
      When two threads have the same badness score, it's preferable to kill
      the thread group leader so that the actual process name is printed to
      the kernel log rather than the thread group name which may be shared
      amongst several processes.
      
      This was the behavior when select_bad_process() used to do
      for_each_process(), but it now iterates threads instead and leads to
      ambiguity.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d49ad935
    • X
      doc/kmemcheck: add kmemcheck to kernel-parameters · c3ac14b2
      Xishi Qiu 提交于
      Add "kmemcheck=xx" to Documentation/kernel-parameters.txt.
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ac14b2
    • H
      mm/memcg: iteration skip memcgs not yet fully initialized · d8ad3055
      Hugh Dickins 提交于
      It is surprising that the mem_cgroup iterator can return memcgs which
      have not yet been fully initialized.  By accident (or trial and error?)
      this appears not to present an actual problem; but it may be better to
      prevent such surprises, by skipping memcgs not yet online.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8ad3055
    • H
      mm/memcg: fix last_dead_count memory wastage · d2ab70aa
      Hugh Dickins 提交于
      Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to
      int: it's assigned from an int and compared with an int, and adjacent to
      an unsigned int: so there's no point to it being unsigned long, which
      wasted 104 bytes in every mem_cgroup_per_zone.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ab70aa
    • P
      mm: audit/fix non-modular users of module_init in core code · a64fb3cd
      Paul Gortmaker 提交于
      Code that is obj-y (always built-in) or dependent on a bool Kconfig
      (built-in or absent) can never be modular.  So using module_init as an
      alias for __initcall can be somewhat misleading.
      
      Fix these up now, so that we can relocate module_init from init.h into
      module.h in the future.  If we don't do this, we'd have to add module.h
      to obviously non-modular code, and that would be a worse thing.
      
      The audit targets the following module_init users for change:
       mm/ksm.c                       bool KSM
       mm/mmap.c                      bool MMU
       mm/huge_memory.c               bool TRANSPARENT_HUGEPAGE
       mm/mmu_notifier.c              bool MMU_NOTIFIER
      
      Note that direct use of __initcall is discouraged, vs.  one of the
      priority categorized subgroups.  As __initcall gets mapped onto
      device_initcall, our use of subsys_initcall (which makes sense for these
      files) will thus change this registration from level 6-device to level
      4-subsys (i.e.  slightly earlier).
      
      However no observable impact of that difference has been observed during
      testing.
      
      One might think that core_initcall (l2) or postcore_initcall (l3) would
      be more appropriate for anything in mm/ but if we look at some actual
      init functions themselves, we see things like:
      
      mm/huge_memory.c --> hugepage_init     --> hugepage_init_sysfs
      mm/mmap.c        --> init_user_reserve --> sysctl_user_reserve_kbytes
      mm/ksm.c         --> ksm_init          --> sysfs_create_group
      
      and hence the choice of subsys_initcall (l4) seems reasonable, and at
      the same time minimizes the risk of changing the priority too
      drastically all at once.  We can adjust further in the future.
      
      Also, several instances of missing ";" at EOL are fixed.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a64fb3cd
    • P
      mm/mm_init.c: make creation of the mm_kobj happen earlier than device_initcall · da29bd36
      Paul Gortmaker 提交于
      The use of __initcall is to be eventually replaced by choosing one from
      the prioritized groupings laid out in init.h header:
      
      	pure_initcall               0
      	core_initcall               1
      	postcore_initcall           2
      	arch_initcall               3
      	subsys_initcall             4
      	fs_initcall                 5
      	device_initcall             6
      	late_initcall               7
      
      In the interim, all __initcall are mapped onto device_initcall, which as
      can be seen above, comes quite late in the ordering.
      
      Currently the mm_kobj is created with __initcall in mm_sysfs_init().
      This means that any other initcalls that want to reference the mm_kobj
      have to be device_initcall (or later), otherwise we will for example,
      trip the BUG_ON(!kobj) in sysfs's internal_create_group().  This
      unfairly restricts those users; for example something that clearly makes
      sense to be an arch_initcall will not be able to choose that.
      
      However, upon examination, it is only this way for historical reasons
      (i.e.  simply not reprioritized yet).  We see that sysfs is ready quite
      earlier in init/main.c via:
      
       vfs_caches_init
       |_ mnt_init
          |_ sysfs_init
      
      well ahead of the processing of the prioritized calls listed above.
      
      So we can recategorize mm_sysfs_init to be a pure_initcall, which in
      turn allows any mm_kobj initcall users a wider range (1 --> 7) of
      initcall priorities to choose from.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da29bd36
    • H
      mm: show message when updating min_free_kbytes in thp · 42aa83cb
      Han Pingtian 提交于
      min_free_kbytes may be raised during THP's initialization.  Sometimes,
      this will change the value which was set by the user.  Showing this
      message will clarify this confusion.
      
      Only show this message when changing a value which was set by the user
      according to Michal Hocko's suggestion.
      
      Show the old value of min_free_kbytes according to Dave Hansen's
      suggestion.  This will give user the chance to restore old value of
      min_free_kbytes.
      Signed-off-by: NHan Pingtian <hanpt@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42aa83cb
    • N
      mm/memory_hotplug.c: move register_memory_resource out of the lock_memory_hotplug · ac13c462
      Nathan Zimmer 提交于
      We don't need to do register_memory_resource() under
      lock_memory_hotplug() since it has its own lock and doesn't make any
      callbacks.
      
      Also register_memory_resource return NULL on failure so we don't have
      anything to cleanup at this point.
      
      The reason for this rfc is I was doing some experiments with hotplugging
      of memory on some of our larger systems.  While it seems to work, it can
      be quite slow.  With some preliminary digging I found that
      lock_memory_hotplug is clearly ripe for breakup.
      
      It could be broken up per nid or something but it also covers the
      online_page_callback.  The online_page_callback shouldn't be very hard
      to break out.
      
      Also there is the issue of various structures(wmarks come to mind) that
      are only updated under the lock_memory_hotplug that would need to be
      dealt with.
      
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Hedi <hedi@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac13c462
    • P
      mm/nobootmem: free_all_bootmem again · 354f17e1
      Philipp Hachtmann 提交于
      get_allocated_memblock_reserved_regions_info() should work if it is
      compiled in.  Extended the ifdef around
      get_allocated_memblock_memory_regions_info() to include
      get_allocated_memblock_reserved_regions_info() as well.  Similar changes
      in nobootmem.c/free_low_memory_core_early() where the two functions are
      called.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Cc: qiuxishi <qiuxishi@huawei.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      354f17e1
    • V
      mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask · ec97097b
      Vladimir Davydov 提交于
      If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
      once with nid=0, but currently it is not true: if node 0 is not set in
      the nodemask or if it is not online, we will not call such shrinkers at
      all.  As a result some slabs will be left untouched under some
      circumstances.  Let us fix it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reported-by: NDave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec97097b
    • V
      mm: vmscan: shrink all slab objects if tight on memory · 0b1fb40a
      Vladimir Davydov 提交于
      When reclaiming kmem, we currently don't scan slabs that have less than
      batch_size objects (see shrink_slab_node()):
      
              while (total_scan >= batch_size) {
                      shrinkctl->nr_to_scan = batch_size;
                      shrinker->scan_objects(shrinker, shrinkctl);
                      total_scan -= batch_size;
              }
      
      If there are only a few shrinkers available, such a behavior won't cause
      any problems, because the batch_size is usually small, but if we have a
      lot of slab shrinkers, which is perfectly possible since FS shrinkers
      are now per-superblock, we can end up with hundreds of megabytes of
      practically unreclaimable kmem objects.  For instance, mounting a
      thousand of ext2 FS images with a hundred of files in each and iterating
      over all the files using du(1) will result in about 200 Mb of FS caches
      that cannot be dropped even with the aid of the vm.drop_caches sysctl!
      
      This problem was initially pointed out by Glauber Costa [*].  Glauber
      proposed to fix it by making the shrink_slab() always take at least one
      pass, to put it simply, turning the scan loop above to a do{}while()
      loop.  However, this proposal was rejected, because it could result in
      more aggressive and frequent slab shrinking even under low memory
      pressure when total_scan is naturally very small.
      
      This patch is a slightly modified version of Glauber's approach.
      Similarly to Glauber's patch, it makes shrink_slab() scan less than
      batch_size objects, but only if the total number of objects we want to
      scan (total_scan) is greater than the total number of objects available
      (max_pass).  Since total_scan is biased as half max_pass if the current
      delta change is small:
      
              if (delta < max_pass / 4)
                      total_scan = min(total_scan, max_pass / 2);
      
      this is only possible if we are scanning at high prio.  That said, this
      patch shouldn't change the vmscan behaviour if the memory pressure is
      low, but if we are tight on memory, we will do our best by trying to
      reclaim all available objects, which sounds reasonable.
      
      [*] http://www.spinics.net/lists/cgroups/msg06913.htmlSigned-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b1fb40a
    • W
      sched/numa: fix setting of cpupid on page migration twice · baae911b
      Wanpeng Li 提交于
      Commit 7851a45c ("mm: numa: Copy cpupid on page migration") copiess
      over the cpupid at page migration time.  It is unnecessary to set it
      again in migrate_misplaced_transhuge_page().
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baae911b
    • J
      mm: do_mincore() cleanup · c980e66a
      Jianguo Wu 提交于
      Two cleanups:
      1. remove redundant codes for hugetlb pages.
      2. end = pmd_addr_end(addr, end) restricts [addr, end) within PMD_SIZE,
         this may increase do_mincore() calls, remove it.
      Signed-off-by: NJianguo Wu <wujianguo@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: qiuxishi <qiuxishi@huawei.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c980e66a
    • S
      include/linux/genalloc.h: spinlock_t needs spinlock_types.h · b30afea0
      Shawn Guo 提交于
      Compiling a C file which includes genalloc.h but without
      spinlock_types.h being included before, we will see the compile error
      below.
      
       include/linux/genalloc.h:54:2: error: unknown type name `spinlock_t'
      
      Include spinlock_types.h from genalloc.h to fix the problem.
      Signed-off-by: NShawn Guo <shawn.guo@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30afea0
    • V
      Documentation/trace/postprocess/trace-vmscan-postprocess.pl: fix the traceevent regex · bd727816
      Vinayak Menon 提交于
      When irq, preempt and lockdep fields are printed (field 3 in the example
      below) in the trace output, the script fails.
      
      An example entry:
        kswapd0-610   [000] ...1   158.112152: mm_vmscan_kswapd_wake: nid=0 order=0
      Signed-off-by: NVinayak Menon <vinayakm.list@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd727816
    • H
      mm: prevent setting of a value less than 0 to min_free_kbytes · da8c757b
      Han Pingtian 提交于
      If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang.  Changing
      proc_dointvec() to proc_dointvec_minmax() in the
      min_free_kbytes_sysctl_handler() can prevent this to happen.
      
      mhocko said:
      
      : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
      : your machine unusable but I agree that proc_dointvec_minmax is more
      : suitable here as we already have:
      :
      : 	.proc_handler   = min_free_kbytes_sysctl_handler,
      : 	.extra1         = &zero,
      :
      : It used to work properly but then 6fce56ec ("sysctl: Remove references
      : to ctl_name and strategy from the generic sysctl table") has removed
      : sysctl_intvec strategy and so extra1 is ignored.
      Signed-off-by: NHan Pingtian <hanpt@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da8c757b
    • M
      mm: new_vma_page() cannot see NULL vma for hugetlb pages · cc81717e
      Michal Hocko 提交于
      Commit 11c731e8 ("mm/mempolicy: fix !vma in new_vma_page()") has
      removed BUG_ON(!vma) from new_vma_page which is partially correct
      because page_address_in_vma will return EFAULT for non-linear mappings
      and at least shared shmem might be mapped this way.
      
      The patch also tried to prevent NULL ptr for hugetlb pages which is not
      correct AFAICS because hugetlb pages cannot be mapped as VM_NONLINEAR
      and other conditions in page_address_in_vma seem to be legit and catch
      real bugs.
      
      This patch restores BUG_ON for PageHuge to catch potential issues when
      the to-be-migrated page is not setup properly.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc81717e
    • N
      mm/memory-failure.c: shift page lock from head page to tail page after thp split · 54b9dd14
      Naoya Horiguchi 提交于
      After thp split in hwpoison_user_mappings(), we hold page lock on the
      raw error page only between try_to_unmap, hence we are in danger of race
      condition.
      
      I found in the RHEL7 MCE-relay testing that we have "bad page" error
      when a memory error happens on a thp tail page used by qemu-kvm:
      
        Triggering MCE exception on CPU 10
        mce: [Hardware Error]: Machine check events logged
        MCE exception done on CPU 10
        MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
        MCE 0x38c535: dirty LRU page recovery: Recovered
        qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
        BUG: Bad page state in process qemu-kvm  pfn:38c400
        page:ffffea000e310000 count:0 mapcount:0 mapping:          (null) index:0x7ffae3c00
        page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
        Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
        CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G   M        --------------   3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
        Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
        Call Trace:
          dump_stack+0x19/0x1b
          bad_page.part.59+0xcf/0xe8
          free_pages_prepare+0x148/0x160
          free_hot_cold_page+0x31/0x140
          free_hot_cold_page_list+0x46/0xa0
          release_pages+0x1c1/0x200
          free_pages_and_swap_cache+0xad/0xd0
          tlb_flush_mmu.part.46+0x4c/0x90
          tlb_finish_mmu+0x55/0x60
          exit_mmap+0xcb/0x170
          mmput+0x67/0xf0
          vhost_dev_cleanup+0x231/0x260 [vhost_net]
          vhost_net_release+0x3f/0x90 [vhost_net]
          __fput+0xe9/0x270
          ____fput+0xe/0x10
          task_work_run+0xc4/0xe0
          do_exit+0x2bb/0xa40
          do_group_exit+0x3f/0xa0
          get_signal_to_deliver+0x1d0/0x6e0
          do_signal+0x48/0x5e0
          do_notify_resume+0x71/0xc0
          retint_signal+0x48/0x8c
      
      The reason of this bug is that a page fault happens before unlocking the
      head page at the end of memory_failure().  This strange page fault is
      trying to access to address 0x20 and I'm not sure why qemu-kvm does
      this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
      way we catch the bad page bug/warning because we try to free a locked
      page (which was the former head page.)
      
      To fix this, this patch suggests to shift page lock from head page to
      tail page just after thp split.  SIGSEGV still happens, but it affects
      only error affected VMs, not a whole system.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>        [3.9+] # a3e0f9e4 "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54b9dd14
    • A
      numa: add a sysctl for numa_balancing · 54a43d54
      Andi Kleen 提交于
      Add a working sysctl to enable/disable automatic numa memory balancing
      at runtime.
      
      This allows us to track down performance problems with this feature and
      is generally a good idea.
      
      This was possible earlier through debugfs, but only with special
      debugging options set.  Also fix the boot message.
      
      [akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a43d54
    • P
      mm: free memblock.memory in free_all_bootmem · 5e270e25
      Philipp Hachtmann 提交于
      When calling free_all_bootmem() the free areas under memblock's control
      are released to the buddy allocator.  Additionally the reserved list is
      freed if it was reallocated by memblock.  The same should apply for the
      memory list.
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e270e25
    • P
      mm/nobootmem.c: add return value check in __alloc_memory_core_early() · 87379ec8
      Philipp Hachtmann 提交于
      When memblock_reserve() fails because memblock.reserved.regions cannot
      be resized, the caller (e.g.  alloc_bootmem()) is not informed of the
      failed allocation.  Therefore alloc_bootmem() silently returns the same
      pointer again and again.
      
      This patch adds a check for the return value of memblock_reserve() in
      __alloc_memory_core().
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87379ec8
    • V
      memcg: rework memcg_update_kmem_limit synchronization · d6441637
      Vladimir Davydov 提交于
      Currently we take both the memcg_create_mutex and the set_limit_mutex
      when we enable kmem accounting for a memory cgroup, which makes kmem
      activation events serialize with both memcg creations and other memcg
      limit updates (memory.limit, memory.memsw.limit).  However, there is no
      point in such strict synchronization rules there.
      
      First, the set_limit_mutex was introduced to keep the memory.limit and
      memory.memsw.limit values in sync.  Since memory.kmem.limit can be set
      independently of them, it is better to introduce a separate mutex to
      synchronize against concurrent kmem limit updates.
      
      Second, we take the memcg_create_mutex in order to make sure all
      children of this memcg will be kmem-active as well.  For achieving that,
      it is enough to hold this mutex only while checking if
      memcg_has_children() though.  This guarantees that if a child is added
      after we checked that the memcg has no children, the newly added cgroup
      will see its parent kmem-active (of course if the latter succeeded), and
      call kmem activation for itself.
      
      This patch simplifies the locking rules of memcg_update_kmem_limit()
      according to these considerations.
      
      [vdavydov@parallels.com: fix unintialized var warning]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6441637
    • V
      memcg: remove KMEM_ACCOUNTED_ACTIVATED flag · 6de64beb
      Vladimir Davydov 提交于
      Currently we have two state bits in mem_cgroup::kmem_account_flags
      regarding kmem accounting activation, ACTIVATED and ACTIVE.  We start
      kmem accounting only if both flags are set (memcg_can_account_kmem()),
      plus throughout the code there are several places where we check only
      the ACTIVE flag, but we never check the ACTIVATED flag alone.  These
      flags are both set from memcg_update_kmem_limit() under the
      set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
      they never get cleared.  That said checking if both flags are set is
      equivalent to checking only for the ACTIVE flag, and since there is no
      ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
      nothing will change.
      
      Let's try to understand what was the reason for introducing these flags.
      The purpose of the ACTIVE flag is clear - it states that kmem should be
      accounting to the cgroup.  The only requirement for it is that it should
      be set after we have fully initialized kmem accounting bits for the
      cgroup and patched all static branches relating to kmem accounting.
      Since we always check if static branch is enabled before actually
      considering if we should account (otherwise we wouldn't benefit from
      static branching), this guarantees us that we won't skip a commit or
      uncharge after a charge due to an unpatched static branch.
      
      Now let's move on to the ACTIVATED bit.  As I proved in the beginning of
      this message, it is absolutely useless, and removing it will change
      nothing.  So what was the reason introducing it?
      
      The ACTIVATED flag was introduced by commit a8964b9b ("memcg: use
      static branches when code not in use") in order to guarantee that
      static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
      for each memory cgroup when its kmem accounting was activated.  The
      point was that at that time the memcg_update_kmem_limit() function's
      work-flow looked like this:
      
              bool must_inc_static_branch = false;
      
              cgroup_lock();
              mutex_lock(&set_limit_mutex);
              if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
                      /* The kmem limit is set for the first time */
                      ret = res_counter_set_limit(&memcg->kmem, val);
      
                      memcg_kmem_set_activated(memcg);
                      must_inc_static_branch = true;
              } else
                      ret = res_counter_set_limit(&memcg->kmem, val);
              mutex_unlock(&set_limit_mutex);
              cgroup_unlock();
      
              if (must_inc_static_branch) {
                      /* We can't do this under cgroup_lock */
                      static_key_slow_inc(&memcg_kmem_enabled_key);
                      memcg_kmem_set_active(memcg);
              }
      
      So that without the ACTIVATED flag we could race with other threads
      trying to set the limit and increment the static branching ref-counter
      more than once.  Today we call the whole memcg_update_kmem_limit()
      function under the set_limit_mutex and this race is impossible.
      
      As now we understand why the ACTIVATED bit was introduced and why we
      don't need it now, and know that removing it will change nothing anyway,
      let's get rid of it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de64beb
    • V
      memcg, slab: RCU protect memcg_params for root caches · f8570263
      Vladimir Davydov 提交于
      We relocate root cache's memcg_params whenever we need to grow the
      memcg_caches array to accommodate all kmem-active memory cgroups.
      Currently on relocation we free the old version immediately, which can
      lead to use-after-free, because the memcg_caches array is accessed
      lock-free (see cache_from_memcg_idx()).  This patch fixes this by making
      memcg_params RCU-protected for root caches.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8570263
    • V
      slab: do not panic if we fail to create memcg cache · f717eb3a
      Vladimir Davydov 提交于
      There is no point in flooding logs with warnings or especially crashing
      the system if we fail to create a cache for a memcg.  In this case we
      will be accounting the memcg allocation to the root cgroup until we
      succeed to create its own cache, but it isn't that critical.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f717eb3a
    • V
      memcg: get rid of kmem_cache_dup() · 842e2873
      Vladimir Davydov 提交于
      kmem_cache_dup() is only called from memcg_create_kmem_cache().  The
      latter, in fact, does nothing besides this, so let's fold
      kmem_cache_dup() into memcg_create_kmem_cache().
      
      This patch also makes the memcg_cache_mutex private to
      memcg_create_kmem_cache(), because it is not used anywhere else.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      842e2873
    • V
      memcg, slab: fix races in per-memcg cache creation/destruction · 2edefe11
      Vladimir Davydov 提交于
      We obtain a per-memcg cache from a root kmem_cache by dereferencing an
      entry of the root cache's memcg_params::memcg_caches array.  If we find
      no cache for a memcg there on allocation, we initiate the memcg cache
      creation (see memcg_kmem_get_cache()).  The cache creation proceeds
      asynchronously in memcg_create_kmem_cache() in order to avoid lock
      clashes, so there can be several threads trying to create the same
      kmem_cache concurrently, but only one of them may succeed.  However, due
      to a race in the code, it is not always true.  The point is that the
      memcg_caches array can be relocated when we activate kmem accounting for
      a memcg (see memcg_update_all_caches(), memcg_update_cache_size()).  If
      memcg_update_cache_size() and memcg_create_kmem_cache() proceed
      concurrently as described below, we can leak a kmem_cache.
      
      Asume two threads schedule creation of the same kmem_cache.  One of them
      successfully creates it.  Another one should fail then, but if
      memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
      follows, it won't:
      
        memcg_create_kmem_cache()             memcg_update_cache_size()
        (called w/o mutexes held)             (called with slab_mutex,
                                               set_limit_mutex held)
        -------------------------             -------------------------
      
        mutex_lock(&memcg_cache_mutex)
      
                                              s->memcg_params=kzalloc(...)
      
        new_cachep=cache_from_memcg_idx(cachep,idx)
        // new_cachep==NULL => proceed to creation
      
                                              s->memcg_params->memcg_caches[i]
                                                  =cur_params->memcg_caches[i]
      
        // kmem_cache_create_memcg takes slab_mutex
        // so we will hang around until
        // memcg_update_cache_size finishes, but
        // nothing will prevent it from succeeding so
        // memcg_caches[idx] will be overwritten in
        // memcg_register_cache!
      
        new_cachep = kmem_cache_create_memcg(...)
        mutex_unlock(&memcg_cache_mutex)
      
      Let's fix this by moving the check for existence of the memcg cache to
      kmem_cache_create_memcg() to be called under the slab_mutex and make it
      return NULL if so.
      
      A similar race is possible when destroying a memcg cache (see
      kmem_cache_destroy()).  Since memcg_unregister_cache(), which clears the
      pointer in the memcg_caches array, is called w/o protection, we can race
      with memcg_update_cache_size() and omit clearing the pointer.  Therefore
      memcg_unregister_cache() should be moved before we release the
      slab_mutex.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2edefe11
    • V
      memcg: fix possible NULL deref while traversing memcg_slab_caches list · 96403da2
      Vladimir Davydov 提交于
      All caches of the same memory cgroup are linked in the memcg_slab_caches
      list via kmem_cache::memcg_params::list.  This list is traversed, for
      example, when we read memory.kmem.slabinfo.
      
      Since the list actually consists of memcg_cache_params objects, we have
      to convert an element of the list to a kmem_cache object using
      memcg_params_to_cache(), which obtains the pointer to the cache from the
      memcg_params::memcg_caches array of the corresponding root cache.  That
      said the pointer to a kmem_cache in its parent's memcg_params must be
      initialized before adding the cache to the list, and cleared only after
      it has been unlinked.  Currently it is vice-versa, which can result in a
      NULL ptr dereference while traversing the memcg_slab_caches list.  This
      patch restores the correct order.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96403da2