1. 17 10月, 2013 8 次提交
    • J
      fs: buffer: move allocation failure loop into the allocator · 84235de3
      Johannes Weiner 提交于
      Buffer allocation has a very crude indefinite loop around waking the
      flusher threads and performing global NOFS direct reclaim because it can
      not handle allocation failures.
      
      The most immediate problem with this is that the allocation may fail due
      to a memory cgroup limit, where flushers + direct reclaim might not make
      any progress towards resolving the situation at all.  Because unlike the
      global case, a memory cgroup may not have any cache at all, only
      anonymous pages but no swap.  This situation will lead to a reclaim
      livelock with insane IO from waking the flushers and thrashing unrelated
      filesystem cache in a tight loop.
      
      Use __GFP_NOFAIL allocations for buffers for now.  This makes sure that
      any looping happens in the page allocator, which knows how to
      orchestrate kswapd, direct reclaim, and the flushers sensibly.  It also
      allows memory cgroups to detect allocations that can't handle failure
      and will allow them to ultimately bypass the limit if reclaim can not
      make progress.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84235de3
    • J
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49426420
    • A
      mm: hugetlb: initialize PG_reserved for tail pages of gigantic compound pages · ef5a22be
      Andrea Arcangeli 提交于
      Commit 11feeb49 ("kvm: optimize away THP checks in
      kvm_is_mmio_pfn()") introduced a memory leak when KVM is run on gigantic
      compound pages.
      
      That commit depends on the assumption that PG_reserved is identical for
      all head and tail pages of a compound page.  So that if get_user_pages
      returns a tail page, we don't need to check the head page in order to
      know if we deal with a reserved page that requires different
      refcounting.
      
      The assumption that PG_reserved is the same for head and tail pages is
      certainly correct for THP and regular hugepages, but gigantic hugepages
      allocated through bootmem don't clear the PG_reserved on the tail pages
      (the clearing of PG_reserved is done later only if the gigantic hugepage
      is freed).
      
      This patch corrects the gigantic compound page initialization so that we
      can retain the optimization in 11feeb49.  The cacheline was already
      modified in order to set PG_tail so this won't affect the boot time of
      large memory systems.
      
      [akpm@linux-foundation.org: tweak comment layout and grammar]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: Nandy123 <ajs124.ajs124@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef5a22be
    • W
      mm/zswap: bugfix: memory leak when re-swapon · aa9bca05
      Weijie Yang 提交于
      zswap_tree is not freed when swapoff, and it got re-kmalloced in swapon,
      so a memory leak occurs.
      
      Free the memory of zswap_tree in zswap_frontswap_invalidate_area().
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      From: Weijie Yang <weijie.yang@samsung.com>
      Subject: mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
      
      Consider the following scenario:
      thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
      thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
      	finished, entry x and its zbud is not freed as its refcount != 0
      	now, the swap_map[x] = 0
      thread 0: now call zswap_get_swap_cache_page
      	swapcache_prepare return -ENOENT because entry x is not used any more
      	zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
      	zswap_writeback_entry do nothing except put refcount
      Now, the memory of zswap_entry x and its zpage leak.
      
      Modify:
       - check the refcount in fail path, free memory if it is not referenced.
      
       - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
         can be not only caused by nomem but also by invalidate.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa9bca05
    • C
      mm: migration: do not lose soft dirty bit if page is in migration state · c3d16e16
      Cyrill Gorcunov 提交于
      If page migration is turned on in config and the page is migrating, we
      may lose the soft dirty bit.  If fork and mprotect are called on
      migrating pages (once migration is complete) pages do not obtain the
      soft dirty bit in the correspond pte entries.  Fix it adding an
      appropriate test on swap entries.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3d16e16
    • J
      mm/hugetlb.c: correct missing private flag clearing · 16c794b4
      Joonsoo Kim 提交于
      We should clear the page's private flag when returing the page to the
      hugepage pool.  Otherwise, marked hugepage can be allocated to the user
      who tries to allocate the non-reserved hugepage.  If this user fail to
      map this hugepage, he would try to return the page to the hugepage pool.
      Since this page has a private flag, resv_huge_pages would mistakenly
      increase.  This patch fixes this situation.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c794b4
    • A
      mm/vmscan.c: don't forget to free shrinker->nr_deferred · ae393321
      Andrew Vagin 提交于
      This leak was added by commit 1d3d4437 ("vmscan: per-node deferred
      work").
      
      unreferenced object 0xffff88006ada3bd0 (size 8):
        comm "criu", pid 14781, jiffies 4295238251 (age 105.641s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<ffffffff8170caee>] kmemleak_alloc+0x5e/0xc0
          [<ffffffff811c0527>] __kmalloc+0x247/0x310
          [<ffffffff8117848c>] register_shrinker+0x3c/0xa0
          [<ffffffff811e115b>] sget+0x5ab/0x670
          [<ffffffff812532f4>] proc_mount+0x54/0x170
          [<ffffffff811e1893>] mount_fs+0x43/0x1b0
          [<ffffffff81202dd2>] vfs_kern_mount+0x72/0x110
          [<ffffffff81202e89>] kern_mount_data+0x19/0x30
          [<ffffffff812530a0>] pid_ns_prepare_proc+0x20/0x40
          [<ffffffff81083c56>] alloc_pid+0x466/0x4a0
          [<ffffffff8105aeda>] copy_process+0xc6a/0x1860
          [<ffffffff8105beab>] do_fork+0x8b/0x370
          [<ffffffff8105c1a6>] SyS_clone+0x16/0x20
          [<ffffffff8171f739>] stub_clone+0x69/0x90
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae393321
    • D
      mm, memcg: protect mem_cgroup_read_events for cpu hotplug · 9c567512
      David Rientjes 提交于
      for_each_online_cpu() needs the protection of {get,put}_online_cpus() so
      cpu_online_mask doesn't change during the iteration.
      
      cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock
      that is used kernel-wide to synchronize cpu hotplug activity.  Memcg has
      a cpu hotplug notifier, called while there may not be any cpu hotplug
      refcounts, which drains per-cpu event counts to memcg->nocpu_base.events
      to maintain a cumulative event count as cpus disappear.  Without
      get_online_cpus() in mem_cgroup_read_events(), it's possible to account
      for the event count on a dying cpu twice, and this value may be
      significantly large.
      
      In fact, all memcg->pcp_counter_lock use should be nested by
      {get,put}_online_cpus().
      
      This fixes that issue and ensures the reported statistics are not vastly
      over-reported during cpu hotplug.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c567512
  2. 03 10月, 2013 1 次提交
    • N
      powerpc: Fix memory hotplug with sparse vmemmap · f7e3334a
      Nathan Fontenot 提交于
      Previous commit 46723bfa... introduced a new config option
      HAVE_BOOTMEM_INFO_NODE that ended up breaking memory hot-remove for ppc
      when sparse vmemmap is not defined.
      
      This patch defines HAVE_BOOTMEM_INFO_NODE for ppc and adds the call to
      register_page_bootmem_info_node. Without this we get a BUG_ON for memory
      hot remove in put_page_bootmem().
      
      This also adds a stub for register_page_bootmem_memmap to allow ppc to build
      with sparse vmemmap defined. Leaving this as a stub is fine since the same
      vmemmap addresses are also handled in vmemmap_populate and as such are
      properly mapped.
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: <stable@vger.kernel.org> [v3.9+]
      f7e3334a
  3. 01 10月, 2013 9 次提交
  4. 28 9月, 2013 1 次提交
    • C
      slab_common: Do not check for duplicate slab names · 3e374919
      Christoph Lameter 提交于
      SLUB can alias multiple slab kmem_create_requests to one slab cache to save
      memory and increase the cache hotness. As a result the name of the slab can be
      stale. Only check the name for duplicates if we are in debug mode where we do
      not merge multiple caches.
      
      This fixes the following problem reported by Jonathan Brassow:
      
        The problem with kmem_cache* is this:
      
        *) Assume CONFIG_SLUB is set
        1) kmem_cache_create(name="foo-a")
        - creates new kmem_cache structure
        2) kmem_cache_create(name="foo-b")
        - If identical cache characteristics, it will be merged with the previously
          created cache associated with "foo-a".  The cache's refcount will be
          incremented and an alias will be created via sysfs_slab_alias().
        3) kmem_cache_destroy(<ptr>)
        - Attempting to destroy cache associated with "foo-a", but instead the
          refcount is simply decremented.  I don't even think the sysfs aliases are
          ever removed...
        4) kmem_cache_create(name="foo-a")
        - This FAILS because kmem_cache_sanity_check colides with the existing
          name ("foo-a") associated with the non-removed cache.
      
        This is a problem for RAID (specifically dm-raid) because the name used
        for the kmem_cache_create is ("raid%d-%p", level, mddev).  If the cache
        persists for long enough, the memory address of an old mddev will be
        reused for a new mddev - causing an identical formulation of the cache
        name.  Even though kmem_cache_destory had long ago been used to delete
        the old cache, the merging of caches has cause the name and cache of that
        old instance to be preserved and causes a colision (and thus failure) in
        kmem_cache_create().  I see this regularly in my testing.
      Reported-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      3e374919
  5. 25 9月, 2013 9 次提交
  6. 13 9月, 2013 12 次提交