1. 04 4月, 2014 40 次提交
    • F
      sys_sysfs: Add CONFIG_SYSFS_SYSCALL · 6af9f7bf
      Fabian Frederick 提交于
      sys_sysfs is an obsolete system call no longer supported by libc.
      
       - This patch adds a default CONFIG_SYSFS_SYSCALL=y
      
       - Option can be turned off in expert mode.
      
       - cond_syscall added to kernel/sys_ni.c
      
      [akpm@linux-foundation.org: tweak Kconfig help text]
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6af9f7bf
    • R
      include/linux/syscalls.h: add sys32_quotactl() prototype · e3a0cfdc
      Rashika Kheria 提交于
      This eliminates the following warning in quota/compat.c:
      
        fs/quota/compat.c:43:17: warning: no previous prototype for `sys32_quotactl' [-Wmissing-prototypes]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3a0cfdc
    • R
      mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages · 6d2be915
      Raghavendra K T 提交于
      Currently max_sane_readahead() returns zero on the cpu whose NUMA node
      has no local memory which leads to readahead failure.  Fix this
      readahead failure by returning minimum of (requested pages, 512).  Users
      running applications on a memory-less cpu which needs readahead such as
      streaming application see considerable boost in the performance.
      
      Result:
      
      fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
      CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.
      
      fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
      testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
      the normal NUMA cases w/ patch.
      
        Kernel       Avg  Stddev
        base      7.4975   3.92%
        patched   7.4174   3.26%
      
      [Andrew: making return value PAGE_SIZE independent]
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d2be915
    • V
      slub: do not drop slab_mutex for sysfs_slab_add · 421af243
      Vladimir Davydov 提交于
      We release the slab_mutex while calling sysfs_slab_add from
      __kmem_cache_create since commit 66c4c35c ("slub: Do not hold
      slub_lock when calling sysfs_slab_add()"), because kobject_uevent called
      by sysfs_slab_add might block waiting for the usermode helper to exec,
      which would result in a deadlock if we took the slab_mutex while
      executing it.
      
      However, apart from complicating synchronization rules, releasing the
      slab_mutex on kmem cache creation can result in a kmemcg-related race.
      The point is that we check if the memcg cache exists before going to
      __kmem_cache_create, but register the new cache in memcg subsys after
      it.  Since we can drop the mutex there, several threads can see that the
      memcg cache does not exist and proceed to creating it, which is wrong.
      
      Fortunately, recently kobject_uevent was patched to call the usermode
      helper with the UMH_NO_WAIT flag, making the deadlock impossible.
      Therefore there is no point in releasing the slab_mutex while calling
      sysfs_slab_add, so let's simplify kmem_cache_create synchronization and
      fix the kmemcg-race mentioned above by holding the slab_mutex during the
      whole cache creation path.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      421af243
    • V
      kobject: don't block for each kobject_uevent · bcccff93
      Vladimir Davydov 提交于
      Currently kobject_uevent has somewhat unpredictable semantics.  The
      point is, since it may call a usermode helper and wait for it to execute
      (UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
      it will introduce for the caller - strictly speaking it depends on what
      fs the binary is located on and the set of locks fork may take.  There
      are quite a few kobject_uevent's users that do not take this into
      account and call it with various mutexes taken, e.g.  rtnl_mutex,
      net_mutex, which might potentially lead to a deadlock.
      
      Since there is actually no reason to wait for the usermode helper to
      execute there, let's make kobject_uevent start the helper asynchronously
      with the aid of the UMH_NO_WAIT flag.
      
      Personally, I'm interested in this, because I really want kobject_uevent
      to be called under the slab_mutex in the slub implementation as it used
      to be some time ago, because it greatly simplifies synchronization and
      automatically fixes a kmemcg-related race.  However, there was a
      deadlock detected on an attempt to call kobject_uevent under the
      slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported
      to be fixed by releasing the slab_mutex for kobject_uevent.
      
      Unfortunately, there was no information about who exactly blocked on the
      slab_mutex causing the usermode helper to stall, neither have I managed
      to find this out or reproduce the issue.
      
      BTW, this is not the first attempt to make kobject_uevent use
      UMH_NO_WAIT.  Previous one was made by commit f520360d ("kobject:
      don't block for each kobject_uevent"), but it was wrong (it passed
      arguments allocated on stack to async thread) so it was reverted in
      05f54c13 ("Revert "kobject: don't block for each kobject_uevent".").
      It targeted on speeding up the boot process though.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcccff93
    • D
      drop_caches: add some documentation and info message · 5509a5d2
      Dave Hansen 提交于
      There is plenty of anecdotal evidence and a load of blog posts
      suggesting that using "drop_caches" periodically keeps your system
      running in "tip top shape".  Perhaps adding some kernel documentation
      will increase the amount of accurate data on its use.
      
      If we are not shrinking caches effectively, then we have real bugs.
      Using drop_caches will simply mask the bugs and make them harder to
      find, but certainly does not fix them, nor is it an appropriate
      "workaround" to limit the size of the caches.  On the contrary, there
      have been bug reports on issues that turned out to be misguided use of
      cache dropping.
      
      Dropping caches is a very drastic and disruptive operation that is good
      for debugging and running tests, but if it creates bug reports from
      production use, kernel developers should be aware of its use.
      
      Add a bit more documentation about it, a syslog message to track down
      abusers, and vmstat drop counters to help analyze problem reports.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [hannes@cmpxchg.org: add runtime suppression control]
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5509a5d2
    • S
      mm: remove read_cache_page_async() · 67f9fd91
      Sasha Levin 提交于
      This patch removes read_cache_page_async() which wasn't really needed
      anywhere and simplifies the code around it a bit.
      
      read_cache_page_async() is useful when we want to read a page into the
      cache without waiting for it to complete.  This happens when the
      appropriate callback 'filler' doesn't complete its read operation and
      releases the page lock immediately, and instead queues a different
      completion routine to do that.  This never actually happened anywhere in
      the code.
      
      read_cache_page_async() had 3 different callers:
      
      - read_cache_page() which is the sync version, it would just wait for
        the requested read to complete using wait_on_page_read().
      
      - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
        function it supplied doesn't do any async reads, and would complete
        before the filler function returns - making it actually a sync read.
      
      - CRAMFS would call it using the read_mapping_page_async() wrapper, with
        a similar story to JFFS2 - the filler function doesn't do anything that
        reminds async reads and would always complete before the filler function
        returns.
      
      To sum it up, the code in mm/filemap.c never took advantage of having
      read_cache_page_async().  While there are filler callbacks that do async
      reads (such as the block one), we always called it with the
      read_cache_page().
      
      This patch adds a mandatory wait for read to complete when adding a new
      page to the cache, and removes read_cache_page_async() and its wrappers.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67f9fd91
    • K
      mm, thp: drop do_huge_pmd_wp_zero_page_fallback() · e9b71ca9
      Kirill A. Shutemov 提交于
      I've realized that there's no need for do_huge_pmd_wp_zero_page_fallback().
      We can just split zero page with split_huge_page_pmd() and return
      VM_FAULT_FALLBACK.  handle_pte_fault() will handle write-protection
      fault for us.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9b71ca9
    • K
      mm: consolidate code to setup pte · 3bb97794
      Kirill A. Shutemov 提交于
      Extract and consolidate code to setup pte from do_read_fault(),
      do_cow_fault() and do_shared_fault().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3bb97794
    • K
      mm: consolidate code to call vm_ops->page_mkwrite() · fb09a464
      Kirill A. Shutemov 提交于
      There are two functions which need to call vm_ops->page_mkwrite():
      do_shared_fault() and do_wp_page().  We can consolidate preparation
      code.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb09a464
    • K
      mm: introduce do_shared_fault() and drop do_fault() · f0c6d4d2
      Kirill A. Shutemov 提交于
      Introduce do_shared_fault().  The function does what do_fault() does for
      write faults to shared mappings
      
      Unlike do_fault(), do_shared_fault() is relatively clean and
      straight-forward.
      
      Old do_fault() is not needed anymore.  Let it die.
      
      [lliubbo@gmail.com: fix NULL pointer dereference]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0c6d4d2
    • K
      mm: introduce do_cow_fault() · ec47c3b9
      Kirill A. Shutemov 提交于
      Introduce do_cow_fault().  The function does what do_fault() does for
      write page faults to private mappings.
      
      Unlike do_fault(), do_read_fault() is relatively clean and
      straight-forward.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec47c3b9
    • K
      mm: introduce do_read_fault() · e655fb29
      Kirill A. Shutemov 提交于
      Introduce do_read_fault().  The function does what do_fault() does for
      read page faults.
      
      Unlike do_fault(), do_read_fault() is pretty clean and straightforward.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e655fb29
    • K
      mm: do_fault(): extract to call vm_ops->do_fault() to separate function · 7eae74af
      Kirill A. Shutemov 提交于
      Extract code to vm_ops->do_fault() and basic error handling to separate
      function.  The code will be reused.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7eae74af
    • K
      mm: rename __do_fault() -> do_fault() · 80d7ef66
      Kirill A. Shutemov 提交于
      Current __do_fault() is awful and unmaintainable.  These patches try to
      sort it out by split __do_fault() into three destinct codepaths:
      
       - to handle read page fault;
       - to handle write page fault to private mappings;
       - to handle write page fault to shared mappings;
      
      I also found page refcount leak in PageHWPoison() path of __do_fault().
      
      This patch (of 7):
      
      do_fault() is unused: no reason for underscores.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80d7ef66
    • R
      include/linux/mm.h: remove ifdef condition · c558784f
      Rashika Kheria 提交于
      The ifdef conditions in include/linux/mm.h presents three cases:
      
       - !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
      
         There is no actual definition of function but include/linux/mm.h has a
         static inline stub defined.
      
       - defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
      
         linux/mm.h does not define a prototype, but mm/page_alloc.c defines
         the function.
      
         Hence, compiler reports the following warning:
      
           mm/page_alloc.c:4300:15: warning: no previous prototype for `__early_pfn_to_nid' [-Wmissing-prototypes]
      
       - defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
      
         The architecture defines the function, and linux/mm.h has a
         prototype.
      
      Thus, join the conditions of Case 2 and 3 ie eliminate the ifdef
      condition of CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID to eliminate the missing
      prototype warning from file mm/page_alloc.c.
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c558784f
    • R
      mm/nobootmem.c: mark function as static · de498507
      Rashika Kheria 提交于
      Mark function as static in nobootmem.c because it is not used outside
      this file.
      
      This eliminates the following warning in mm/nobootmem.c:
      
        mm/nobootmem.c:324:15: warning: no previous prototype for `___alloc_bootmem_node' [-Wmissing-prototypes]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de498507
    • R
      mm/page_cgroup.c: mark functions as static · d20199e1
      Rashika Kheria 提交于
      Mark functions as static in page_cgroup.c because they are not used
      outside this file.
      
      This eliminates the following warning in mm/page_cgroup.c:
      
        mm/page_cgroup.c:177:6: warning: no previous prototype for `__free_page_cgroup' [-Wmissing-prototypes]
        mm/page_cgroup.c:190:15: warning: no previous prototype for `online_page_cgroup' [-Wmissing-prototypes]
        mm/page_cgroup.c:225:15: warning: no previous prototype for `offline_page_cgroup' [-Wmissing-prototypes]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d20199e1
    • R
      mm/process_vm_access.c: mark function as static · 2eb2e141
      Rashika Kheria 提交于
      Mark function as static in process_vm_access.c because it is not used
      outside this file.
      
      This eliminates the following warning in mm/process_vm_access.c:
      
        mm/process_vm_access.c:416:1: warning: no previous prototype for `compat_process_vm_rw' [-Wmissing-prototypes]
      
      [akpm@linux-foundation.org: remove unneeded asmlinkage - compat_process_vm_rw isn't referenced from asm]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2eb2e141
    • R
      mm/mmap.c: mark function as static · eafd4dc4
      Rashika Kheria 提交于
      Mark function as static in mmap.c because they are not used outside this
      file.
      
      This eliminates the following warning in mm/mmap.c:
      
        mm/mmap.c:407:6: warning: no previous prototype for `validate_mm' [-Wmissing-prototypes]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eafd4dc4
    • R
      mm/memory.c: mark functions as static · b19a9939
      Rashika Kheria 提交于
      mark functions as static in memory.c because they are not used outside
      this file.
      
      This eliminates the following warnings in mm/memory.c:
      
        mm/memory.c:3530:5: warning: no previous prototype for `numa_migrate_prep' [-Wmissing-prototypes]
        mm/memory.c:3545:5: warning: no previous prototype for `do_numa_page' [-Wmissing-prototypes]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b19a9939
    • R
      mm/compaction.c: mark function as static · 74e77fb9
      Rashika Kheria 提交于
      Mark function as static in compaction.c because it is not used outside
      this file.
      
      This eliminates the following warning from mm/compaction.c:
      
        mm/compaction.c:1190:9: warning: no previous prototype for `sysfs_compact_node' [-Wmissing-prototypes
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74e77fb9
    • D
      mm, compaction: avoid isolating pinned pages · 119d6d59
      David Rientjes 提交于
      Page migration will fail for memory that is pinned in memory with, for
      example, get_user_pages().  In this case, it is unnecessary to take
      zone->lru_lock or isolating the page and passing it to page migration
      which will ultimately fail.
      
      This is a racy check, the page can still change from under us, but in
      that case we'll just fail later when attempting to move the page.
      
      This avoids very expensive memory compaction when faulting transparent
      hugepages after pinning a lot of memory with a Mellanox driver.
      
      On a 128GB machine and pinning ~120GB of memory, before this patch we
      see the enormous disparity in the number of page migration failures
      because of the pinning (from /proc/vmstat):
      
      	compact_pages_moved 8450
      	compact_pagemigrate_failed 15614415
      
      0.05% of pages isolated are successfully migrated and explicitly
      triggering memory compaction takes 102 seconds.  After the patch:
      
      	compact_pages_moved 9197
      	compact_pagemigrate_failed 7
      
      99.9% of pages isolated are now successfully migrated in this
      configuration and memory compaction takes less than one second.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      119d6d59
    • D
      mm, hugetlb: mark some bootstrap functions as __init · f412c97a
      David Rientjes 提交于
      Both prep_compound_huge_page() and prep_compound_gigantic_page() are
      only called at bootstrap and can be marked as __init.
      
      The __SetPageTail(page) in prep_compound_gigantic_page() happening
      before page->first_page is initialized is not concerning since this is
      bootstrap.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f412c97a
    • J
      mm: keep page cache radix tree nodes in check · 449dd698
      Johannes Weiner 提交于
      Previously, page cache radix tree nodes were freed after reclaim emptied
      out their page pointers.  But now reclaim stores shadow entries in their
      place, which are only reclaimed when the inodes themselves are
      reclaimed.  This is problematic for bigger files that are still in use
      after they have a significant amount of their cache reclaimed, without
      any of those pages actually refaulting.  The shadow entries will just
      sit there and waste memory.  In the worst case, the shadow entries will
      accumulate until the machine runs out of memory.
      
      To get this under control, the VM will track radix tree nodes
      exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
      rather than global because we expect the radix tree nodes themselves to
      be allocated node-locally and we want to reduce cross-node references of
      otherwise independent cache workloads.  A simple shrinker will then
      reclaim these nodes on memory pressure.
      
      A few things need to be stored in the radix tree node to implement the
      shadow node LRU and allow tree deletions coming from the list:
      
      1. There is no index available that would describe the reverse path
         from the node up to the tree root, which is needed to perform a
         deletion.  To solve this, encode in each node its offset inside the
         parent.  This can be stored in the unused upper bits of the same
         member that stores the node's height at no extra space cost.
      
      2. The number of shadow entries needs to be counted in addition to the
         regular entries, to quickly detect when the node is ready to go to
         the shadow node LRU list.  The current entry count is an unsigned
         int but the maximum number of entries is 64, so a shadow counter
         can easily be stored in the unused upper bits.
      
      3. Tree modification needs tree lock and tree root, which are located
         in the address space, so store an address_space backpointer in the
         node.  The parent pointer of the node is in a union with the 2-word
         rcu_head, so the backpointer comes at no extra cost as well.
      
      4. The node needs to be linked to an LRU list, which requires a list
         head inside the node.  This does increase the size of the node, but
         it does not change the number of objects that fit into a slab page.
      
      [akpm@linux-foundation.org: export the right function]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449dd698
    • J
      lib: radix_tree: tree node interface · 139e5616
      Johannes Weiner 提交于
      Make struct radix_tree_node part of the public interface and provide API
      functions to create, look up, and delete whole nodes.  Refactor the
      existing insert, look up, delete functions on top of these new node
      primitives.
      
      This will allow the VM to track and garbage collect page cache radix
      tree nodes.
      
      [sasha.levin@oracle.com: return correct error code on insertion failure]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      139e5616
    • J
      mm: thrash detection-based file cache sizing · a528910e
      Johannes Weiner 提交于
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently usedbut
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This patch solves one half of the problem by decoupling the ability to
      detect working set changes from the inactive list size.  By maintaining
      a history of recently evicted file pages it can detect frequently used
      pages with an arbitrarily small inactive list size, and subsequently
      apply pressure on the active list based on actual demand for cache, not
      just overall eviction speed.
      
      Every zone maintains a counter that tracks inactive list aging speed.
      When a page is evicted, a snapshot of this counter is stored in the
      now-empty page cache radix tree slot.  On refault, the minimum access
      distance of the page can be assessed, to evaluate whether the page
      should be part of the active list or not.
      
      This fixes the VM's blindness towards working set changes in excess of
      the inactive list.  And it's the foundation to further improve the
      protection ability and reduce the minimum inactive list size of 50%.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a528910e
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
    • J
      mm + fs: prepare for non-page entries in page cache radix trees · 0cd6144a
      Johannes Weiner 提交于
      shmem mappings already contain exceptional entries where swap slot
      information is remembered.
      
      To be able to store eviction information for regular page cache, prepare
      every site dealing with the radix trees directly to handle entries other
      than pages.
      
      The common lookup functions will filter out non-page entries and return
      NULL for page cache holes, just as before.  But provide a raw version of
      the API which returns non-page entries as well, and switch shmem over to
      use it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cd6144a
    • J
      mm: filemap: move radix tree hole searching here · e7b563bb
      Johannes Weiner 提交于
      The radix tree hole searching code is only used for page cache, for
      example the readahead code trying to get a a picture of the area
      surrounding a fault.
      
      It sufficed to rely on the radix tree definition of holes, which is
      "empty tree slot".  But this is about to change, though, as shadow page
      descriptors will be stored in the page cache after the actual pages get
      evicted from memory.
      
      Move the functions over to mm/filemap.c and make them native page cache
      operations, where they can later be adapted to handle the new definition
      of "page cache hole".
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7b563bb
    • J
      mm: shmem: save one radix tree lookup when truncating swapped pages · 6dbaf22c
      Johannes Weiner 提交于
      Page cache radix tree slots are usually stabilized by the page lock, but
      shmem's swap cookies have no such thing.  Because the overall truncation
      loop is lockless, the swap entry is currently confirmed by a tree lookup
      and then deleted by another tree lookup under the same tree lock region.
      
      Use radix_tree_delete_item() instead, which does the verification and
      deletion with only one lookup.  This also allows removing the
      delete-only special case from shmem_radix_tree_replace().
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dbaf22c
    • J
      lib: radix-tree: add radix_tree_delete_item() · 53c59f26
      Johannes Weiner 提交于
      Provide a function that does not just delete an entry at a given index,
      but also allows passing in an expected item.  Delete only if that item
      is still located at the specified index.
      
      This is handy when lockless tree traversals want to delete entries as
      well because they don't have to do an second, locked lookup to verify
      the slot has not changed under them before deleting the entry.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53c59f26
    • J
      fs: cachefiles: use add_to_page_cache_lru() · 55881bc7
      Johannes Weiner 提交于
      This code used to have its own lru cache pagevec up until a0b8cab3 ("mm:
      remove lru parameter from __pagevec_lru_add and remove parts of pagevec
      API").  Now it's just add_to_page_cache() followed by lru_cache_add(),
      might as well use add_to_page_cache_lru() directly.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55881bc7
    • J
      mm: vmstat: fix UP zone state accounting · 6a3ed212
      Johannes Weiner 提交于
      Summary:
      
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently used list
      "inactive list" and the frequently used list "active list".
      
      Currently, the VM aims for a 1:1 ratio between the lists, which is the
      "perfect" trade-off between the ability to *protect* frequently used
      pages and the ability to *detect* frequently used pages.  This means
      that working set changes bigger than half of cache memory go undetected
      and thrash indefinitely, whereas working sets bigger than half of cache
      memory are unprotected against used-once streams that don't even need
      caching.
      
      This happens on file servers and media streaming servers, where the
      popular files and file sections change over time.  Even though the
      individual files might be smaller than half of memory, concurrent access
      to many of them may still result in their inter-reference distance being
      greater than half of memory.  It's also been reported as a problem on
      database workloads that switch back and forth between tables that are
      bigger than half of memory.  In these cases the VM never recognizes the
      new working set and will for the remainder of the workload thrash disk
      data which could easily live in memory.
      
      Historically, every reclaim scan of the inactive list also took a
      smaller number of pages from the tail of the active list and moved them
      to the head of the inactive list.  This model gave established working
      sets more gracetime in the face of temporary use-once streams, but
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This series solves the problem by maintaining a history of pages evicted
      from the inactive list, enabling the VM to detect frequently used pages
      regardless of inactive list size and facilitate working set transitions.
      
      Tests:
      
      The reported database workload is easily demonstrated on a 8G machine
      with two filesets a 6G.  This fio workload operates on one set first,
      then switches to the other.  The VM should obviously always cache the
      set that the workload is currently using.
      
      This test is based on a problem encountered by Citus Data customers:
        http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data
      
      unpatched:
        db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
        db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
        sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%
      
        real    27m15.541s
        user    0m19.059s
        sys     0m51.459s
      
      patched:
        db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
        db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
        sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%
      
        real    6m8.630s
        user    0m14.714s
        sys     0m31.233s
      
      As can be seen, the unpatched kernel simply never adapts to the
      workingset change and db2 is stuck indefinitely with secondary storage
      speed.  The patched kernel needs 2-3 iterations over db2 before it
      replaces db1 and reaches full memory speed.  Given the unbounded
      negative affect of the existing VM behavior, these patches should be
      considered correctness fixes rather than performance optimizations.
      
      Another test resembles a fileserver or streaming server workload, where
      data in excess of memory size is accessed at different frequencies.
      There is very hot data accessed at a high frequency.  Machines should be
      fitted so that the hot set of such a workload can be fully cached or all
      bets are off.  Then there is a very big (compared to available memory)
      set of data that is used-once or at a very low frequency; this is what
      drives the inactive list and does not really benefit from caching.
      Lastly, there is a big set of warm data in between that is accessed at
      medium frequencies and benefits from caching the pages between the first
      and last streamer of each burst.
      
      unpatched:
         hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
        warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
        cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
         sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00%
      
      patched:
         hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
        warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
        cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
         sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18%
      
      In both kernels, the hot set is propagated to the active list and then
      served from cache.
      
      In both kernels, the beginning of the warm set is propagated to the
      active list as well, but in the unpatched case the active list
      eventually takes up half of memory and no new pages from the warm set
      get activated, despite repeated access, and despite most of the active
      list soon being stale.  The patched kernel on the other hand detects the
      thrashing and manages to keep this cache window rolling through the data
      set.  This frees up enough IO bandwidth that the cold set is served at
      full speed as well and disk utilization even drops by 20%.
      
      For reference, this same test was performed with the traditional
      demotion mechanism, where deactivation is coupled to inactive list
      reclaim.  However, this had the same outcome as the unpatched kernel:
      while the warm set does indeed get activated continuously, it is forced
      out of the active list by inactive list pressure, which is dictated
      primarily by the unrelated cold set.  The warm set is evicted before
      subsequent streamers can benefit from it, even though there would be
      enough space available to cache the pages of interest.
      
      Costs:
      
      Page reclaim used to shrink the radix trees but now the tree nodes are
      reused for shadow entries, where the cost depends heavily on the page
      cache access patterns.  However, with workloads that maintain spatial or
      temporal locality, the shadow entries are either refaulted quickly or
      reclaimed along with the inode object itself.  Workloads that will
      experience a memory cost increase are those that don't really benefit
      from caching in the first place.
      
      A more predictable alternative would be a fixed-cost separate pool of
      shadow entries, but this would incur relatively higher memory cost for
      well-behaved workloads at the benefit of cornercases.  It would also
      make the shadow entry lookup more costly compared to storing them
      directly in the cache structure.
      
      Future:
      
      To simplify the merging process, this patch set is implementing thrash
      detection on a global per-zone level only for now, but the design is
      such that it can be extended to memory cgroups as well.  All we need to
      do is store the unique cgroup ID along the node and zone identifier
      inside the eviction cookie to identify the lruvec.
      
      Right now we have a fixed ratio (50:50) between inactive and active list
      but we already have complaints about working sets exceeding half of
      memory being pushed out of the cache by simple streaming in the
      background.  Ultimately, we want to adjust this ratio and allow for a
      much smaller inactive list.  These patches are an essential step in this
      direction because they decouple the VMs ability to detect working set
      changes from the inactive list size.  This would allow us to base the
      inactive list size on the combined readahead window size for example and
      potentially protect a much bigger working set.
      
      It's also a big step towards activating pages with a reuse distance
      larger than memory, as long as they are the most frequently used pages
      in the workload.  This will require knowing more about the access
      frequency of active pages than what we measure right now, so it's also
      deferred in this series.
      
      Another possibility of having thrashing information would be to revisit
      the idea of local reclaim in the form of zero-config memory control
      groups.  Instead of having allocating tasks go straight to global
      reclaim, they could try to reclaim the pages in the memcg they are part
      of first as long as the group is not thrashing.  This would allow a user
      to drop e.g.  a back-up job in an otherwise unconfigured memcg and it
      would only inflate (and possibly do global reclaim) until it has enough
      memory to do proper readahead.  But once it reaches that point and stops
      thrashing it would just recycle its own used-once pages without kicking
      out the cache of any other tasks in the system more than necessary.
      
      This patch (of 10):
      
      Fengguang Wu's build testing spotted problems with inc_zone_state() and
      dec_zone_state() on UP configurations in out-of-tree patches.
      
      inc_zone_state() is declared but not defined, dec_zone_state() is
      missing entirely.
      
      Just like with *_zone_page_state(), they can be defined like their
      preemption-unsafe counterparts on UP.
      
      [akpm@linux-foundation.org: make it build]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a3ed212
    • V
      mm: vmscan: shrink_slab: rename max_pass -> freeable · d5bc5fd3
      Vladimir Davydov 提交于
      The name `max_pass' is misleading, because this variable actually keeps
      the estimate number of freeable objects, not the maximal number of
      objects we can scan in this pass, which can be twice that.  Rename it to
      reflect its actual meaning.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5bc5fd3
    • D
      mm, hugetlb: improve page-fault scalability · 8382d914
      Davidlohr Bueso 提交于
      The kernel can currently only handle a single hugetlb page fault at a
      time.  This is due to a single mutex that serializes the entire path.
      This lock protects from spurious OOM errors under conditions of low
      availability of free hugepages.  This problem is specific to hugepages,
      because it is normal to want to use every single hugepage in the system
      - with normal pages we simply assume there will always be a few spare
      pages which can be used temporarily until the race is resolved.
      
      Address this problem by using a table of mutexes, allowing a better
      chance of parallelization, where each hugepage is individually
      serialized.  The hash key is selected depending on the mapping type.
      For shared ones it consists of the address space and file offset being
      faulted; while for private ones the mm and virtual address are used.
      The size of the table is selected based on a compromise of collisions
      and memory footprint of a series of database workloads.
      
      Large database workloads that make heavy use of hugepages can be
      particularly exposed to this issue, causing start-up times to be
      painfully slow.  This patch reduces the startup time of a 10 Gb Oracle
      DB (with ~5000 faults) from 37.5 secs to 25.7 secs.  Larger workloads
      will naturally benefit even more.
      
      NOTE:
      The only downside to this patch, detected by Joonsoo Kim, is that a
      small race is possible in private mappings: A child process (with its
      own mm, after cow) can instantiate a page that is already being handled
      by the parent in a cow fault.  When low on pages, can trigger spurious
      OOMs.  I have not been able to think of a efficient way of handling
      this...  but do we really care about such a tiny window? We already
      maintain another theoretical race with normal pages.  If not, one
      possible way to is to maintain the single hash for private mappings --
      any workloads that *really* suffer from this scaling problem should
      already use shared mappings.
      
      [akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8382d914
    • J
      mm, hugetlb: use vma_resv_map() map types · 4e35f483
      Joonsoo Kim 提交于
      Util now, we get a resv_map by two ways according to each mapping type.
      This makes code dirty and unreadable.  Unify it.
      
      [davidlohr@hp.com: code cleanups]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e35f483
    • J
      mm, hugetlb: remove resv_map_put · f031dd27
      Joonsoo Kim 提交于
      This is a preparation patch to unify the use of vma_resv_map()
      regardless of the map type.  This patch prepares it by removing
      resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
      for all resv_maps.
      
      [davidlohr@hp.com: update changelog]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f031dd27
    • D
      mm, hugetlb: fix race in region tracking · 7b24d861
      Davidlohr Bueso 提交于
      There is a race condition if we map a same file on different processes.
      Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
      When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
      mmap_sem (exclusively).  This doesn't prevent other tasks from modifying
      the region structure, so it can be modified by two processes
      concurrently.
      
      To solve this, introduce a spinlock to resv_map and make region
      manipulation function grab it before they do actual work.
      
      [davidlohr@hp.com: updated changelog]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b24d861
    • J
      mm, hugetlb: improve, cleanup resv_map parameters · 1406ec9b
      Joonsoo Kim 提交于
      To change a protection method for region tracking to find grained one,
      we pass the resv_map, instead of list_head, to region manipulation
      functions.
      
      This doesn't introduce any functional change, and it is just for
      preparing a next step.
      
      [davidlohr@hp.com: update changelog]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1406ec9b