1. 31 10月, 2012 3 次提交
  2. 24 10月, 2012 4 次提交
  3. 19 10月, 2012 1 次提交
    • J
      slub: remove one code path and reduce lock contention in __slab_free() · 837d678d
      Joonsoo Kim 提交于
      When we try to free object, there is some of case that we need
      to take a node lock. This is the necessary step for preventing a race.
      After taking a lock, then we try to cmpxchg_double_slab().
      But, there is a possible scenario that cmpxchg_double_slab() is failed
      with taking a lock. Following example explains it.
      
      CPU A               CPU B
      need lock
      ...                 need lock
      ...                 lock!!
      lock..but spin      free success
      spin...             unlock
      lock!!
      free fail
      
      In this case, retry with taking a lock is occured in CPU A.
      I think that in this case for CPU A,
      "release a lock first, and re-take a lock if necessary" is preferable way.
      
      There are two reasons for this.
      
      First, this makes __slab_free()'s logic somehow simple.
      With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
      So we can remove one code path.
      
      Second, it may reduce lock contention.
      When we do retrying, status of slab is already changed,
      so we don't need a lock anymore in almost every case.
      "release a lock first, and re-take a lock if necessary" policy is
      helpful to this.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      837d678d
  4. 13 10月, 2012 2 次提交
    • J
      vfs: make path_openat take a struct filename pointer · 669abf4e
      Jeff Layton 提交于
      ...and fix up the callers. For do_file_open_root, just declare a
      struct filename on the stack and fill out the .name field. For
      do_filp_open, make it also take a struct filename pointer, and fix up its
      callers to call it appropriately.
      
      For filp_open, add a variant that takes a struct filename pointer and turn
      filp_open into a wrapper around it.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      669abf4e
    • J
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton 提交于
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
  5. 10 10月, 2012 2 次提交
    • J
      mm, slab: release slab_mutex earlier in kmem_cache_destroy() · 210ed9de
      Jiri Kosina 提交于
      Commit 1331e7a1 ("rcu: Remove _rcu_barrier() dependency on
      __stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency
      through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() ->
      get_online_cpus().
      
      Lockdep thinks that this might actually result in ABBA deadlock,
      and reports it as below:
      
      === [ cut here ] ===
       ======================================================
       [ INFO: possible circular locking dependency detected ]
       3.6.0-rc5-00004-g0d8ee37e #143 Not tainted
       -------------------------------------------------------
       kworker/u:2/40 is trying to acquire lock:
        (rcu_sched_state.barrier_mutex){+.+...}, at: [<ffffffff810f2126>] _rcu_barrier+0x26/0x1e0
      
       but task is already holding lock:
        (slab_mutex){+.+.+.}, at: [<ffffffff81176e15>] kmem_cache_destroy+0x45/0xe0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #2 (slab_mutex){+.+.+.}:
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff81558cb5>] cpuup_callback+0x2f/0xbe
              [<ffffffff81564b83>] notifier_call_chain+0x93/0x140
              [<ffffffff81076f89>] __raw_notifier_call_chain+0x9/0x10
              [<ffffffff8155719d>] _cpu_up+0xba/0x14e
              [<ffffffff815572ed>] cpu_up+0xbc/0x117
              [<ffffffff81ae05e3>] smp_init+0x6b/0x9f
              [<ffffffff81ac47d6>] kernel_init+0x147/0x1dc
              [<ffffffff8156ab44>] kernel_thread_helper+0x4/0x10
      
       -> #1 (cpu_hotplug.lock){+.+.+.}:
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff81049197>] get_online_cpus+0x37/0x50
              [<ffffffff810f21bb>] _rcu_barrier+0xbb/0x1e0
              [<ffffffff810f22f0>] rcu_barrier_sched+0x10/0x20
              [<ffffffff810f2309>] rcu_barrier+0x9/0x10
              [<ffffffff8118c129>] deactivate_locked_super+0x49/0x90
              [<ffffffff8118cc01>] deactivate_super+0x61/0x70
              [<ffffffff811aaaa7>] mntput_no_expire+0x127/0x180
              [<ffffffff811ab49e>] sys_umount+0x6e/0xd0
              [<ffffffff81569979>] system_call_fastpath+0x16/0x1b
      
       -> #0 (rcu_sched_state.barrier_mutex){+.+...}:
              [<ffffffff810adb4e>] check_prev_add+0x3de/0x440
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff810f2126>] _rcu_barrier+0x26/0x1e0
              [<ffffffff810f22f0>] rcu_barrier_sched+0x10/0x20
              [<ffffffff810f2309>] rcu_barrier+0x9/0x10
              [<ffffffff81176ea1>] kmem_cache_destroy+0xd1/0xe0
              [<ffffffffa04c3154>] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack]
              [<ffffffffa04c31aa>] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack]
              [<ffffffffa04c42ce>] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack]
              [<ffffffff81454b79>] ops_exit_list+0x39/0x60
              [<ffffffff814551ab>] cleanup_net+0xfb/0x1b0
              [<ffffffff8106917b>] process_one_work+0x26b/0x4c0
              [<ffffffff81069f3e>] worker_thread+0x12e/0x320
              [<ffffffff8106f73e>] kthread+0x9e/0xb0
              [<ffffffff8156ab44>] kernel_thread_helper+0x4/0x10
      
       other info that might help us debug this:
      
       Chain exists of:
         rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(slab_mutex);
                                      lock(cpu_hotplug.lock);
                                      lock(slab_mutex);
         lock(rcu_sched_state.barrier_mutex);
      
        *** DEADLOCK ***
      === [ cut here ] ===
      
      This is actually a false positive. Lockdep has no way of knowing the fact
      that the ABBA can actually never happen, because of special semantics of
      cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual
      exclusion there is not achieved through mutex, but through
      cpu_hotplug.refcount.
      
      The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin()
      until everyone who called get_online_cpus() will call put_online_cpus()"
      semantics is totally invisible to lockdep.
      
      This patch therefore moves the unlock of slab_mutex so that rcu_barrier()
      is being called with it unlocked. It has two advantages:
      
      - it slightly reduces hold time of slab_mutex; as it's used to protect
        the cachep list, it's not necessary to hold it over kmem_cache_free()
        call any more
      - it silences the lockdep false positive warning, as it avoids lockdep ever
        learning about slab_mutex -> cpu_hotplug.lock dependency
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      210ed9de
    • H
      tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking · 35c2a7f4
      Hugh Dickins 提交于
      Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
      	u64 inum = fid->raw[2];
      which is unhelpfully reported as at the end of shmem_alloc_inode():
      
      BUG: unable to handle kernel paging request at ffff880061cd3000
      IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Call Trace:
       [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
       [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
       [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
       [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6
      
      Right, tmpfs is being stupid to access fid->raw[2] before validating that
      fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
      fall at the end of a page, and the next page not be present.
      
      But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
      careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
      could oops in the same way: add the missing fh_len checks to those.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35c2a7f4
  6. 09 10月, 2012 28 次提交
    • D
      mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). · f5c8ad47
      David Miller 提交于
      Invalidation sequences are handled in various ways on various
      architectures.
      
      One way, which sparc64 uses, is to let the set_*_at() functions accumulate
      pending flushes into a per-cpu array.  Then the flush_tlb_range() et al.
      calls process the pending TLB flushes.
      
      In this regime, the __tlb_remove_*tlb_entry() implementations are
      essentially NOPs.
      
      The canonical PTE zap in mm/memory.c is:
      
      			ptent = ptep_get_and_clear_full(mm, addr, pte,
      							tlb->fullmm);
      			tlb_remove_tlb_entry(tlb, pte, addr);
      
      With a subsequent tlb_flush_mmu() if needed.
      
      Mirror this in the THP PMD zapping using:
      
      		orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
      		page = pmd_page(orig_pmd);
      		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
      
      And we properly accomodate TLB flush mechanims like the one described
      above.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5c8ad47
    • D
      mm: Add and use update_mmu_cache_pmd() in transparent huge page code. · b113da65
      David Miller 提交于
      The transparent huge page code passes a PMD pointer in as the third
      argument of update_mmu_cache(), which expects a PTE pointer.
      
      This never got noticed because X86 implements update_mmu_cache() as a
      macro and thus we don't get any type checking, and X86 is the only
      architecture which supports transparent huge pages currently.
      
      Before other architectures can support transparent huge pages properly we
      need to add a new interface which will take a PMD pointer as the third
      argument rather than a PTE pointer.
      
      [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b113da65
    • Y
      memory-hotplug: suppress "Trying to free nonexistent resource... · d760afd4
      Yasuaki Ishimatsu 提交于
      memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning
      
      When our x86 box calls __remove_pages(), release_mem_region() shows many
      warnings.  And x86 box cannot unregister iomem_resource.
      
        "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"
      
      release_mem_region() has been changed to be called in each
      PAGES_PER_SECTION by commit de7f0cba ("memory hotplug: release
      memory regions in PAGES_PER_SECTION chunks").  Because powerpc registers
      iomem_resource in each PAGES_PER_SECTION chunk.  But when I hot add
      memory on x86 box, iomem_resource is register in each _CRS not
      PAGES_PER_SECTION chunk.  So x86 box unregisters iomem_resource.
      
      The patch fixes the problem.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Nathan Fontenot <nfont@austin.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d760afd4
    • A
      mm: document PageHuge somewhat · 7795912c
      Andrew Morton 提交于
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7795912c
    • K
      mm: use %pK for /proc/vmallocinfo · 45ec1690
      Kees Cook 提交于
      In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel
      virtual addresses in /proc/vmallocinfo too.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NBrad Spengler <spender@grsecurity.net>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45ec1690
    • D
      mm, thp: fix mlock statistics · 8449d21f
      David Rientjes 提交于
      NR_MLOCK is only accounted in single page units: there's no logic to
      handle transparent hugepages.  This patch checks the appropriate number of
      pages to adjust the statistics by so that the correct amount of memory is
      reflected.
      
      Currently:
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           19636 kB
      
      	#define MAP_SIZE	(4 << 30)	/* 4GB */
      
      	void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      			 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	mlock(ptr, MAP_SIZE);
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           29844 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           19636 kB
      
      And with this patch:
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:           19636 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:         4213664 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:           19636 kB
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NHugh Dickens <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8449d21f
    • D
      mm, thp: fix mapped pages avoiding unevictable list on mlock · b676b293
      David Rientjes 提交于
      When a transparent hugepage is mapped and it is included in an mlock()
      range, follow_page() incorrectly avoids setting the page's mlock bit and
      moving it to the unevictable lru.
      
      This is evident if you try to mlock(), munlock(), and then mlock() a
      range again.  Currently:
      
      	#define MAP_SIZE	(4 << 30)	/* 4GB */
      
      	void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      			 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	mlock(ptr, MAP_SIZE);
      
      		$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
      		Inactive(anon):     6304 kB
      		Unevictable:     4213924 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4186252 kB
      		Unevictable:       19652 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198556 kB
      		Unevictable:       21684 kB
      
      Notice that less than 2MB was added to the unevictable list; this is
      because these pages in the range are not transparent hugepages since the
      4GB range was allocated with mmap() and has no specific alignment.  If
      posix_memalign() were used instead, unevictable would not have grown at
      all on the second mlock().
      
      The fix is to call mlock_vma_page() so that the mlock bit is set and the
      page is added to the unevictable list.  With this patch:
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4056 kB
      		Unevictable:     4213940 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198268 kB
      		Unevictable:       19636 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4008 kB
      		Unevictable:     4213940 kB
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b676b293
    • W
      memory-hotplug: update memory block's state and notify userspace · e90bdb7f
      Wen Congyang 提交于
      remove_memory() will be called when hot removing a memory device.  But
      even if offlining memory, we cannot notice it.  So the patch updates the
      memory block's state and sends notification to userspace.
      
      Additionally, the memory device may contain more than one memory block.
      If the memory block has been offlined, __offline_pages() will fail.  So we
      should try to offline one memory block at a time.
      
      Thus remove_memory() also check each memory block's state.  So there is no
      need to check the memory block's state before calling remove_memory().
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e90bdb7f
    • W
      memory-hotplug: preparation to notify memory block's state at memory hot remove · a16cee10
      Wen Congyang 提交于
      remove_memory() is called in two cases:
      1. echo offline >/sys/devices/system/memory/memoryXX/state
      2. hot remove a memory device
      
      In the 1st case, the memory block's state is changed and the notification
      that memory block's state changed is sent to userland after calling
      remove_memory().  So user can notice memory block is changed.
      
      But in the 2nd case, the memory block's state is not changed and the
      notification is not also sent to userspcae even if calling
      remove_memory().  So user cannot notice memory block is changed.
      
      For adding the notification at memory hot remove, the patch just prepare
      as follows:
      1st case uses offline_pages() for offlining memory.
      2nd case uses remove_memory() for offlining memory and changing memory block's
          state and notifing the information.
      
      The patch does not implement notification to remove_memory().
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a16cee10
    • R
      mm: avoid section mismatch warning for memblock_type_name · c2233116
      Raghavendra D Prabhu 提交于
      Following section mismatch warning is thrown during build;
      
          WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
          The function memblock_type_name() references
          the variable __meminitdata memblock.
          This is often because memblock_type_name lacks a __meminitdata
          annotation or the annotation of memblock is wrong.
      
      This is because memblock_type_name makes reference to memblock variable
      with attribute __meminitdata.  Hence, the warning (even if the function is
      inline).
      
      [akpm@linux-foundation.org: remove inline]
      Signed-off-by: NRaghavendra D Prabhu <rprabhu@wnohang.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2233116
    • M
      cma: decrease cc.nr_migratepages after reclaiming pagelist · beb51eaa
      Minchan Kim 提交于
      reclaim_clean_pages_from_list() reclaims clean pages before migration so
      cc.nr_migratepages should be updated.  Currently, there is no problem but
      it can be wrong if we try to use the value in future.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      beb51eaa
    • M
      CMA: migrate mlocked pages · e46a2879
      Minchan Kim 提交于
      Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
      contiguous memory space.
      
      This patch makes mlocked pages be migrated out.  Of course, it can affect
      realtime processes but in CMA usecase, contiguous memory allocation failing
      is far worse than access latency to an mlocked page being variable while
      CMA is running.  If someone wants to make the system realtime, he shouldn't
      enable CMA because stalls can still happen at random times.
      
      [akpm@linux-foundation.org: tweak comment text, per Mel]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46a2879
    • R
    • H
      mm: remove unevictable_pgs_mlockfreed · 8befedfe
      Hugh Dickins 提交于
      Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
      from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
      have been used by any tool, and of course we can restore it easily enough
      if that turns out to be wrong.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8befedfe
    • M
      memory-hotplug: fix zone stat mismatch · 5a883813
      Minchan Kim 提交于
      During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
      causing the kernel to hang.  When the system doesn't have enough free
      pages, it enters reclaim but never reclaim any pages due to
      too_many_isolated()==true and loops forever.
      
      The cause is that when we do memory-hotadd after memory-remove,
      __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
      although the vm_stat_diff of all CPUs still have values.
      
      In addtion, when we offline all pages of the zone, we reset them in
      zone_pcp_reset without draining so we loss some zone stat item.
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a883813
    • M
      mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range") · 08270807
      Minchan Kim 提交于
      Revert commit 0def08e3 because check_range can't fail in
      migrate_to_node with considering current usecases.
      
      Quote from Johannes
      
      : I think it makes sense to revert.  Not because of the semantics, but I
      : just don't see how check_range() could even fail for this callsite:
      :
      : 1. we pass mm->mmap->vm_start in there, so we should not fail due to
      :    find_vma()
      :
      : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
      :    and so can not fail
      :
      : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
      :    continue until addr == end, so we never fail with -EIO
      
      And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
      which might pass to MPOL_MF_STRICT.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vasiliy Kulikov <segooon@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08270807
    • H
      mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end · 6bdb913f
      Haggai Eran 提交于
      In order to allow sleeping during invalidate_page mmu notifier calls, we
      need to avoid calling when holding the PT lock.  In addition to its direct
      calls, invalidate_page can also be called as a substitute for a change_pte
      call, in case the notifier client hasn't implemented change_pte.
      
      This patch drops the invalidate_page call from change_pte, and instead
      wraps all calls to change_pte with invalidate_range_start and
      invalidate_range_end calls.
      
      Note that change_pte still cannot sleep after this patch, and that clients
      implementing change_pte should not take action on it in case the number of
      outstanding invalidate_range_start calls is larger than one, otherwise
      they might miss a later invalidation.
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Andrea Arcangeli <andrea@qumranet.com>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bdb913f
    • S
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg 提交于
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • M
      hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach · 36e4f20a
      Michal Hocko 提交于
      Commit 0c176d52 ("mm: hugetlb: fix pgoff computation when unmapping
      page from vma") fixed pgoff calculation but it has replaced it by
      vma_hugecache_offset() which is not approapriate for offsets used for
      vma_prio_tree_foreach() because that one expects index in page units
      rather than in huge_page_shift.
      
      Johannes said:
      
      : The resulting index may not be too big, but it can be too small: assume
      : hpage size of 2M and the address to unmap to be 0x200000.  This is regular
      : page index 512 and hpage index 1.  If you have a VMA that maps the file
      : only starting at the second huge page, that VMAs vm_pgoff will be 512 but
      : you ask for offset 1 and miss it even though it does map the page of
      : interest.  hugetlb_cow() will try to unmap, miss the vma, and retry the
      : cow until the allocation succeeds or the skipped vma(s) go away.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36e4f20a
    • D
      mm, numa: reclaim from all nodes within reclaim distance · 957f822a
      David Rientjes 提交于
      RECLAIM_DISTANCE represents the distance between nodes at which it is
      deemed too costly to allocate from; it's preferred to try to reclaim from
      a local zone before falling back to allocating on a remote node with such
      a distance.
      
      To do this, zone_reclaim_mode is set if the distance between any two
      nodes on the system is greather than this distance.  This, however, ends
      up causing the page allocator to reclaim from every zone regardless of
      its affinity.
      
      What we really want is to reclaim only from zones that are closer than
      RECLAIM_DISTANCE.  This patch adds a nodemask to each node that
      represents the set of nodes that are within this distance.  During the
      zone iteration, if the bit for a zone's node is set for the local node,
      then reclaim is attempted; otherwise, the zone is skipped.
      
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      957f822a
    • H
      mm: remove free_page_mlock · a0c5e813
      Hugh Dickins 提交于
      We should not be seeing non-0 unevictable_pgs_mlockfreed any longer.  So
      remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
      already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
      checking it, reporting "BUG: Bad page state" if it's ever found set.
      Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0c5e813
    • H
      mm: use clear_page_mlock() in page_remove_rmap() · e6c509f8
      Hugh Dickins 提交于
      We had thought that pages could no longer get freed while still marked as
      mlocked; but Johannes Weiner posted this program to demonstrate that
      truncating an mlocked private file mapping containing COWed pages is still
      mishandled:
      
      #include <sys/types.h>
      #include <sys/mman.h>
      #include <sys/stat.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <stdio.h>
      
      int main(void)
      {
      	char *map;
      	int fd;
      
      	system("grep mlockfreed /proc/vmstat");
      	fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
      	unlink("chigurh");
      	ftruncate(fd, 4096);
      	map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
      	map[0] = 11;
      	mlock(map, sizeof(fd));
      	ftruncate(fd, 0);
      	close(fd);
      	munlock(map, sizeof(fd));
      	munmap(map, 4096);
      	system("grep mlockfreed /proc/vmstat");
      	return 0;
      }
      
      The anon COWed pages are not caught by truncation's clear_page_mlock() of
      the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
      look out for them there in page_remove_rmap().  Indeed, why should
      truncation or invalidation be doing the clear_page_mlock() when removing
      from pagecache?  mlock is a property of mapping in userspace, not a
      property of pagecache: an mlocked unmapped page is nonsensical.
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6c509f8
    • H
      mm: remove vma arg from page_evictable · 39b5f29a
      Hugh Dickins 提交于
      page_evictable(page, vma) is an irritant: almost all its callers pass
      NULL for vma.  Remove the vma arg and use mlocked_vma_newpage(vma, page)
      explicitly in the couple of places it's needed.  But in those places we
      don't even need page_evictable() itself!  They're dealing with a freshly
      allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39b5f29a
    • H
      mm: fix invalidate_complete_page2() lock ordering · ec4d9f62
      Hugh Dickins 提交于
      In fuzzing with trinity, lockdep protested "possible irq lock inversion
      dependency detected" when isolate_lru_page() reenabled interrupts while
      still holding the supposedly irq-safe tree_lock:
      
      invalidate_inode_pages2
        invalidate_complete_page2
          spin_lock_irq(&mapping->tree_lock)
          clear_page_mlock
            isolate_lru_page
              spin_unlock_irq(&zone->lru_lock)
      
      isolate_lru_page() is correct to enable interrupts unconditionally:
      invalidate_complete_page2() is incorrect to call clear_page_mlock() while
      holding tree_lock, which is supposed to nest inside lru_lock.
      
      Both truncate_complete_page() and invalidate_complete_page() call
      clear_page_mlock() before taking tree_lock to remove page from radix_tree.
       I guess invalidate_complete_page2() preferred to test PageDirty (again)
      under tree_lock before committing to the munlock; but since the page has
      already been unmapped, its state is already somewhat inconsistent, and no
      worse if clear_page_mlock() moved up.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Deciphered-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec4d9f62
    • M
      memcg: move mem_cgroup_is_root upwards · 7ffc0edc
      Michal Hocko 提交于
      kmem code uses this function and it is better to not use forward
      declarations for static inline functions as some (older) compilers don't
      like it:
      
      gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)
      
        mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
        mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Sachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ffc0edc
    • M
      memcg: cleanup kmem tcp ifdefs · 4bd2c1ee
      Michal Hocko 提交于
      TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
      the code is not used if !CONFIG_INET so we should rather test for both.
      The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
      let's keep those outside of any ifdefs because it is considered safer wrt.
       future maintainability.
      
      Tested with
      - CONFIG_INET && CONFIG_MEMCG_KMEM
      - !CONFIG_INET && CONFIG_MEMCG_KMEM
      - CONFIG_INET && !CONFIG_MEMCG_KMEM
      - !CONFIG_INET && !CONFIG_MEMCG_KMEM
      Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bd2c1ee
    • J
      mm: fix-up zone present pages · 7f1290f2
      Jianguo Wu 提交于
      I think zone->present_pages indicates pages that buddy system can management,
      it should be:
      
      	zone->present_pages = spanned pages - absent pages - bootmem pages,
      
      but is now:
      	zone->present_pages = spanned pages - absent pages - memmap pages.
      
      spanned pages: total size, including holes.
      absent pages: holes.
      bootmem pages: pages used in system boot, managed by bootmem allocator.
      memmap pages: pages used by page structs.
      
      This may cause zone->present_pages less than it should be.  For example,
      numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
      bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
      present_pages should be spanned pages - absent pages, but now it also
      minus memmap pages(free_area_init_core), which are actually allocated from
      ZONE_MOVABLE.  When offlining all memory of a zone, this will cause
      zone->present_pages less than 0, because present_pages is unsigned long
      type, it is actually a very large integer, it indirectly caused
      zone->watermark[WMARK_MIN] becomes a large
      integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
      large integer(calculate_totalreserve_pages()), and finally cause memory
      allocating failure when fork process(__vm_enough_memory()).
      
      [root@localhost ~]# dmesg
      -bash: fork: Cannot allocate memory
      
      I think the bug described in
      
        http://marc.info/?l=linux-mm&m=134502182714186&w=2
      
      is also caused by wrong zone present pages.
      
      This patch intends to fix-up zone->present_pages when memory are freed to
      buddy system on x86_64 and IA64 platforms.
      Signed-off-by: NJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Reported-by: NPetr Tesarik <ptesarik@suse.cz>
      Tested-by: NPetr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f1290f2
    • R
      mm: enable CONFIG_COMPACTION by default · 05106e6a
      Rik van Riel 提交于
      Now that lumpy reclaim has been removed, compaction is the only way left
      to free up contiguous memory areas.  It is time to just enable
      CONFIG_COMPACTION by default.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05106e6a