1. 22 3月, 2012 40 次提交
    • A
      mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() · 45f3e385
      Anton Vorontsov 提交于
      In the following code:
      
      	if (type == _MEM)
      		thresholds = &memcg->thresholds;
      	else if (type == _MEMSWAP)
      		thresholds = &memcg->memsw_thresholds;
      	else
      		BUG();
      
      	BUG_ON(!thresholds);
      
      The BUG_ON() seems redundant.
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45f3e385
    • A
      mm/memcontrol.c: s/stealed/stolen/ · 13fd1dd9
      Andrew Morton 提交于
      A grammatical fix.
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13fd1dd9
    • K
      memcg: fix performance of mem_cgroup_begin_update_page_stat() · 4331f7d3
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_begin_update_page_stat() should be very fast because it's
      called very frequently.  Now, it needs to look up page_cgroup and its
      memcg....this is slow.
      
      This patch adds a global variable to check "any memcg is moving or not".
      With this, the caller doesn't need to visit page_cgroup and memcg.
      
      Here is a test result.  A test program makes page faults onto a file,
      MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the
      range by madvise() and page fault again.  This program causes 26214400
      times of page fault onto a file(size was 1G.) and shows shows the cost of
      mem_cgroup_begin_update_page_stat().
      
      Before this patch for mem_cgroup_begin_update_page_stat()
      
          [kamezawa@bluextal test]$ time ./mmap 1G
      
          real    0m21.765s
          user    0m5.999s
          sys     0m15.434s
      
          27.46%     mmap  mmap               [.] reader
          21.15%     mmap  [kernel.kallsyms]  [k] page_fault
           9.17%     mmap  [kernel.kallsyms]  [k] filemap_fault
           2.96%     mmap  [kernel.kallsyms]  [k] __do_fault
           2.83%     mmap  [kernel.kallsyms]  [k] __mem_cgroup_begin_update_page_stat
      
      After this patch
      
          [root@bluextal test]# time ./mmap 1G
      
          real    0m21.373s
          user    0m6.113s
          sys     0m15.016s
      
      In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.
      
      Note: we may be able to remove this optimization in future if
            we can get pointer to memcg directly from struct page.
      
      [akpm@linux-foundation.org: don't return a void]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4331f7d3
    • K
      memcg: remove PCG_FILE_MAPPED · 2ff76f11
      KAMEZAWA Hiroyuki 提交于
      With the new lock scheme for updating memcg's page stat, we don't need a
      flag PCG_FILE_MAPPED which was duplicated information of page_mapped().
      
      [hughd@google.com: cosmetic fix]
      [hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff76f11
    • K
      memcg: use new logic for page stat accounting · 89c06bd5
      KAMEZAWA Hiroyuki 提交于
      Now, page-stat-per-memcg is recorded into per page_cgroup flag by
      duplicating page's status into the flag.  The reason is that memcg has a
      feature to move a page from a group to another group and we have race
      between "move" and "page stat accounting",
      
      Under current logic, assume CPU-A and CPU-B.  CPU-A does "move" and CPU-B
      does "page stat accounting".
      
      When CPU-A goes 1st,
      
                  CPU-A                           CPU-B
                                          update "struct page" info.
          move_lock_mem_cgroup(memcg)
          see pc->flags
          copy page stat to new group
          overwrite pc->mem_cgroup.
          move_unlock_mem_cgroup(memcg)
                                          move_lock_mem_cgroup(mem)
                                          set pc->flags
                                          update page stat accounting
                                          move_unlock_mem_cgroup(mem)
      
      stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
      (CPU-A) doesn't see changes in "struct page" information.
      
      But it's costly to have the same information both in 'struct page' and
      'struct page_cgroup'.  And, there is a potential problem.
      
      For example, assume we have PG_dirty accounting in memcg.
      PG_..is a flag for struct page.
      PCG_ is a flag for struct page_cgroup.
      (This is just an example. The same problem can be found in any
       kind of page stat accounting.)
      
      	  CPU-A                               CPU-B
            TestSet PG_dirty
            (delay)                        TestClear PG_dirty
                                           if (TestClear(PCG_dirty))
                                                memcg->nr_dirty--
            if (TestSet(PCG_dirty))
                memcg->nr_dirty++
      
      Here, memcg->nr_dirty = +1, this is wrong.  This race was reported by Greg
      Thelen <gthelen@google.com>.  Now, only FILE_MAPPED is supported but
      fortunately, it's serialized by page table lock and this is not real bug,
      _now_,
      
      If this potential problem is caused by having duplicated information in
      struct page and struct page_cgroup, we may be able to fix this by using
      original 'struct page' information.  But we'll have a problem in "move
      account"
      
      Assume we use only PG_dirty.
      
               CPU-A                   CPU-B
          TestSet PG_dirty
          (delay)                    move_lock_mem_cgroup()
                                     if (PageDirty(page))
                                            new_memcg->nr_dirty++
                                     pc->mem_cgroup = new_memcg;
                                     move_unlock_mem_cgroup()
          move_lock_mem_cgroup()
          memcg = pc->mem_cgroup
          new_memcg->nr_dirty++
      
      accounting information may be double-counted.  This was original reason to
      have PCG_xxx flags but it seems PCG_xxx has another problem.
      
      I think we need a bigger lock as
      
           move_lock_mem_cgroup(page)
           TestSetPageDirty(page)
           update page stats (without any checks)
           move_unlock_mem_cgroup(page)
      
      This fixes both of problems and we don't have to duplicate page flag into
      page_cgroup.  Please note: move_lock_mem_cgroup() is held only when there
      are possibility of "account move" under the system.  So, in most path,
      status update will go without atomic locks.
      
      This patch introduces mem_cgroup_begin_update_page_stat() and
      mem_cgroup_end_update_page_stat() both should be called at modifying
      'struct page' information if memcg takes care of it.  as
      
           mem_cgroup_begin_update_page_stat()
           modify page information
           mem_cgroup_update_page_stat()
           => never check any 'struct page' info, just update counters.
           mem_cgroup_end_update_page_stat().
      
      This patch is slow because we need to call begin_update_page_stat()/
      end_update_page_stat() regardless of accounted will be changed or not.  A
      following patch adds an easy optimization and reduces the cost.
      
      [akpm@linux-foundation.org: s/lock/locked/]
      [hughd@google.com: fix deadlock by avoiding stat lock when anon]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Greg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c06bd5
    • K
      memcg: remove PCG_MOVE_LOCK flag from page_cgroup · 312734c0
      KAMEZAWA Hiroyuki 提交于
      PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
      pc->mem_cgroup and page statistics accounting per memcg.  This lock helps
      to avoid the race but the race is very rare because moving tasks between
      cgroup is not a usual job.  So, it seems using 1bit per page is too
      costly.
      
      This patch changes this lock as per-memcg spinlock and removes
      PCG_MOVE_LOCK.
      
      If smaller lock is required, we'll be able to add some hashes but I'd like
      to start from this.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      312734c0
    • K
      memcg: simplify move_account() check · 619d094b
      KAMEZAWA Hiroyuki 提交于
      In memcg, for avoiding take-lock-irq-off at accessing page_cgroup, a
      logic, flag + rcu_read_lock(), is used.  This works as following
      
           CPU-A                     CPU-B
                                   rcu_read_lock()
          set flag
                                   if(flag is set)
                                         take heavy lock
                                   do job.
          synchronize_rcu()        rcu_read_unlock()
          take heavy lock.
      
      In recent discussion, it's argued that using per-cpu value for this flag
      just complicates the code because 'set flag' is very rare.
      
      This patch changes 'flag' implementation from percpu to atomic_t.  This
      will be much simpler.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      619d094b
    • K
      memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) · 9e335790
      KAMEZAWA Hiroyuki 提交于
      As described in the log, I guess EXPORT was for preparing dirty
      accounting.  But _now_, we don't need to export this.  Remove this for
      now.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e335790
    • K
      memcg: remove PCG_CACHE page_cgroup flag · b2402857
      KAMEZAWA Hiroyuki 提交于
      We record 'the page is cache' with the PCG_CACHE bit in page_cgroup.
      Here, "CACHE" means anonymous user pages (and SwapCache).  This doesn't
      include shmem.
      
      Considering callers, at charge/uncharge, the caller should know what the
      page is and we don't need to record it by using one bit per page.
      
      This patch removes PCG_CACHE bit and make callers of
      mem_cgroup_charge_statistics() to specify what the page is.
      
      About page migration: Mapping of the used page is not touched during migra
      tion (see page_remove_rmap) so we can rely on it and push the correct
      charge type down to __mem_cgroup_uncharge_common from end_migration for
      unused page.  The force flag was misleading was abused for skipping the
      needless page_mapped() / PageCgroupMigration() check, as we know the
      unused page is no longer mapped and cleared the migration flag just a few
      lines up.  But doing the checks is no biggie and it's not worth adding
      another flag just to skip them.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [hughd@google.com: fix PageAnon uncharging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2402857
    • K
      memcg: remove unnecessary thp check in page stat accounting · 0e79dedd
      KAMEZAWA Hiroyuki 提交于
      Commit e94c8a9c ("memcg: make mem_cgroup_split_huge_fixup() more
      efficient") removed move_lock_page_cgroup().  So we do not have to check
      PageTransHuge in mem_cgroup_update_page_stat() and fallback into the
      locked accounting because both move_account() and thp split are done
      with compound_lock so they cannot race.
      
      The race between update vs.  move is protected by mem_cgroup_stealed.
      
      PageTransHuge pages shouldn't appear in this code path currently because
      we are tracking only file pages at the moment but later we are planning
      to track also other pages (e.g.  mlocked ones).
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAcked-by: Michal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: Ying Han<yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e79dedd
    • H
      memcg: remove redundant returns · 1f2b71f4
      Hugh Dickins 提交于
      Remove redundant returns from ends of functions, and one blank line.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f2b71f4
    • H
      memcg: enum lru_list lru · f156ab93
      Hugh Dickins 提交于
      Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f156ab93
    • H
      memcg: lru_size instead of MEM_CGROUP_ZSTAT · 1eb49272
      Hugh Dickins 提交于
      I never understood why we need a MEM_CGROUP_ZSTAT(mz, idx) macro to
      obscure the LRU counts.  For easier searching? So call it lru_size
      rather than bare count (lru_length sounds better, but would be wrong,
      since each huge page raises lru_size hugely).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eb49272
    • H
      memcg: replace mem and mem_cont stragglers · d79154bb
      Hugh Dickins 提交于
      Replace mem and mem_cont stragglers in memcontrol.c by memcg.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d79154bb
    • S
      swap: don't do discard if no discard option added · 052b1987
      Shaohua Li 提交于
      When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon()
      will still perform a discard operation.  This can cause problems if
      discard is slow or buggy.
      
      Reverse the order of the check so that a discard operation is performed
      only if the sys_swapon() caller is attempting to enable discard.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Reported-by: NHolger Kiehl <Holger.Kiehl@dwd.de>
      Tested-by: NHolger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      052b1987
    • K
      mm: forbid lumpy-reclaim in shrink_active_list() · 1480de03
      Konstantin Khlebnikov 提交于
      Reset the reclaim mode in shrink_active_list() to RECLAIM_MODE_SINGLE |
      RECLAIM_MODE_ASYNC.  (sync/async sign is used only in shrink_page_list
      and does not affect shrink_active_list)
      
      Currenly shrink_active_list() sometimes works in lumpy-reclaim mode, if
      RECLAIM_MODE_LUMPYRECLAIM is left over from an earlier
      shrink_inactive_list().  Meanwhile, in age_active_anon()
      sc->reclaim_mode is totally zero.  So the current behavior is too
      complex and confusing, and this looks like bug.
      
      In general, shrink_active_list() populates the inactive list for the
      next shrink_inactive_list().  Lumpy shring_inactive_list() isolates
      pages around the chosen one from both the active and inactive lists.
      So, there is no reason for lumpy isolation in shrink_active_list().
      
      See also: https://lkml.org/lkml/2012/3/15/583Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Proposed-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1480de03
    • K
      mmap.c: fix comment for __insert_vm_struct() · 88f6b4c3
      Kautuk Consul 提交于
      The comment above __insert_vm_struct seems to suggest that this function
      is also going to link the VMA with the anon_vma, but this is not true.
      This function only links the VMA to the mm->mm_rb tree and the mm->mmap
      linked list.
      
      [akpm@linux-foundation.org: improve comment layout and text]
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f6b4c3
    • K
      page_alloc: remove unused find_zone_movable_pfns_for_nodes() argument · b224ef85
      Kautuk Consul 提交于
      find_zone_movable_pfns_for_nodes() does not use its argument.
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b224ef85
    • K
      page_alloc.c: remove add_from_early_node_map() · 8d13bddd
      Kautuk Consul 提交于
      add_from_early_node_map() is unused.
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d13bddd
    • S
      hugetlbfs: fix alignment of huge page requests · 40716e29
      Steven Truelove 提交于
      When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
      PAGE_SIZE, but this is not sufficient.
      
      Modify hugetlb_file_setup() to align requests to the huge page size, and
      to accept an address argument so that all alignment checks can be
      performed in hugetlb_file_setup(), rather than in its callers.  Change
      newseg() and mmap_pgoff() to match the new prototype and eliminate a now
      redundant alignment check.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NSteven Truelove <steven.truelove@utoronto.ca>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40716e29
    • D
      mm, counters: fold __sync_task_rss_stat() into sync_mm_rss() · ea48cf78
      David Rientjes 提交于
      There's no difference between sync_mm_rss() and __sync_task_rss_stat(),
      so fold the latter into the former.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea48cf78
    • D
      mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() · 05af2e10
      David Rientjes 提交于
      sync_mm_rss() can only be used for current to avoid race conditions in
      iterating and clearing its per-task counters.  Remove the task argument
      for it and its helper function, __sync_task_rss_stat(), to avoid thinking
      it can be used safely for anything other than current.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05af2e10
    • D
      hugepages: fix use after free bug in "quota" handling · 90481622
      David Gibson 提交于
      hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
      general quota handling code, and they don't much resemble its behaviour.
      Rather than being about maintaining limits on on-disk block usage by
      particular users, they are instead about maintaining limits on in-memory
      page usage (including anonymous MAP_PRIVATE copied-on-write pages)
      associated with a particular hugetlbfs filesystem instance.
      
      Worse, they work by having callbacks to the hugetlbfs filesystem code from
      the low-level page handling code, in particular from free_huge_page().
      This is a layering violation of itself, but more importantly, if the
      kernel does a get_user_pages() on hugepages (which can happen from KVM
      amongst others), then the free_huge_page() can be delayed until after the
      associated inode has already been freed.  If an unmount occurs at the
      wrong time, even the hugetlbfs superblock where the "quota" limits are
      stored may have been freed.
      
      Andrew Barry proposed a patch to fix this by having hugepages, instead of
      storing a pointer to their address_space and reaching the superblock from
      there, had the hugepages store pointers directly to the superblock,
      bumping the reference count as appropriate to avoid it being freed.
      Andrew Morton rejected that version, however, on the grounds that it made
      the existing layering violation worse.
      
      This is a reworked version of Andrew's patch, which removes the extra, and
      some of the existing, layering violation.  It works by introducing the
      concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
      finite logical pool of hugepages to allocate from.  hugetlbfs now creates
      a subpool for each filesystem instance with a page limit set, and a
      pointer to the subpool gets added to each allocated hugepage, instead of
      the address_space pointer used now.  The subpool has its own lifetime and
      is only freed once all pages in it _and_ all other references to it (i.e.
      superblocks) are gone.
      
      subpools are optional - a NULL subpool pointer is taken by the code to
      mean that no subpool limits are in effect.
      
      Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
      quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
      http://marc.info/?l=linux-mm&m=126928970510627&w=1
      
      v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
      alloc_huge_page() - since it already takes the vma, it is not necessary.
      Signed-off-by: NAndrew Barry <abarry@cray.com>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90481622
    • B
      ksm: cleanup: introduce find_mergeable_vma() · ef694222
      Bob Liu 提交于
      There are multiple places which perform the same check.  Add a new
      find_mergeable_vma() to handle this.
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef694222
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
    • D
      mm, memcg: pass charge order to oom killer · e845e199
      David Rientjes 提交于
      The oom killer typically displays the allocation order at the time of oom
      as a part of its diangostic messages (for global, cpuset, and mempolicy
      ooms).
      
      The memory controller may also pass the charge order to the oom killer so
      it can emit the same information.  This is useful in determining how large
      the memory allocation is that triggered the oom killer.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e845e199
    • C
      mm/vmscan.c: fix spelling error · c7cfa37b
      Copot Alexandru 提交于
      s/noticable/noticeable/
      Signed-off-by: NCopot Alexandru <alex.mihai.c@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7cfa37b
    • A
      mm: update stale lock ordering comment for memory-failure.c · 9a3c531d
      Andi Kleen 提交于
      When i_mmap_lock changed to a mutex the locking order in memory failure
      was changed to take the sleeping lock first.  But the big fat mm lock
      ordering comment (BFMLO) wasn't updated.  Do this here.
      
      Pointed out by Andrew.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a3c531d
    • F
      mm: use global_dirty_limit in throttle_vm_writeout() · 47a13333
      Fengguang Wu 提交于
      When starting a memory hog task, a desktop box w/o swap is found to go
      unresponsive for a long time.  It's solely caused by lots of congestion
      waits in throttle_vm_writeout():
      
       gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
       gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
                 gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
                 gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
                  Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
                  Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
      
      The root cause is, the dirty threshold is knocked down a lot by the memory
      hog task.  Fixed by using global_dirty_limit which decreases gradually on
      such events and can guarantee we stay above (the also decreasing) nr_dirty
      in the progress of following down to the new dirty threshold.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47a13333
    • F
      mm: don't set __GFP_WRITE on ramfs/sysfs writes · 1010bb1b
      Fengguang Wu 提交于
      There is not much point in skipping zones during allocation based on the
      dirty usage which they'll never contribute to.  And we'd like to avoid
      page reclaim waits when writing to ramfs/sysfs etc.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1010bb1b
    • N
      bootmem/sparsemem: remove limit constraint in alloc_bootmem_section · f5bf18fa
      Nishanth Aravamudan 提交于
      While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
      Overcommit) on powerpc, we tripped the following:
      
        kernel BUG at mm/bootmem.c:483!
        cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
            pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
            lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
            sp: c000000000c03bc0
           msr: 8000000000021032
          current = 0xc000000000b0cce0
          paca    = 0xc000000001d80000
            pid   = 0, comm = swapper
        kernel BUG at mm/bootmem.c:483!
        enter ? for help
        [c000000000c03c80] c000000000a64bcc
        .sparse_early_usemaps_alloc_node+0x84/0x29c
        [c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
        [c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
        [c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
        [c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c
      
      This is
      
              BUG_ON(limit && goal + size > limit);
      
      and after some debugging, it seems that
      
      	goal = 0x7ffff000000
      	limit = 0x80000000000
      
      and sparse_early_usemaps_alloc_node ->
      sparse_early_usemaps_alloc_pgdat_section calls
      
      	return alloc_bootmem_section(usemap_size() * count, section_nr);
      
      This is on a system with 8TB available via the AMS pool, and as a quirk
      of AMS in firmware, all of that memory shows up in node 0.  So, we end
      up with an allocation that will fail the goal/limit constraints.
      
      In theory, we could "fall-back" to alloc_bootmem_node() in
      sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
      defined, we'll BUG_ON() instead.  A simple solution appears to be to
      unconditionally remove the limit condition in alloc_bootmem_section,
      meaning allocations are allowed to cross section boundaries (necessary
      for systems of this size).
      
      Johannes Weiner pointed out that if alloc_bootmem_section() no longer
      guarantees section-locality, we need check_usemap_section_nr() to print
      possible cross-dependencies between node descriptors and the usemaps
      allocated through it.  That makes the two loops in
      sparse_early_usemaps_alloc_node() identical, so re-factor the code a
      bit.
      
      [akpm@linux-foundation.org: code simplification]
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Anton Blanchard <anton@au1.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.3.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5bf18fa
    • K
      mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug · f0cb3c76
      Konstantin Khlebnikov 提交于
      This cpu hotplug hook was accidentally removed in commit 00a62ce9
      ("mm: fix Committed_AS underflow on large NR_CPUS environment")
      
      The visible effect of this accident: some pages are borrowed in per-cpu
      page-vectors.  Truncate can deal with it, but these pages cannot be
      reused while this cpu is offline.  So this is like a temporary memory
      leak.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Eric B Munson <ebmunson@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0cb3c76
    • C
      mm: fix move/migrate_pages() race on task struct · 3268c63e
      Christoph Lameter 提交于
      Migration functions perform the rcu_read_unlock too early.  As a result
      the task pointed to may change from under us.  This can result in an oops,
      as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.
      
      The following patch extend the period of the rcu_read_lock until after the
      permissions checks are done.  We also take a refcount so that the task
      reference is stable when calling security check functions and performing
      cpuset node validation (which takes a mutex).
      
      The refcount is dropped before actual page migration occurs so there is no
      change to the refcounts held during page migration.
      
      Also move the determination of the mm of the task struct to immediately
      before the do_migrate*() calls so that it is clear that we switch from
      handling the task during permission checks to the mm for the actual
      migration.  Since the determination is only done once and we then no
      longer use the task_struct we can be sure that we operate on a specific
      address space that will not change from under us.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Reported-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3268c63e
    • D
      thp: allow a hwpoisoned head page to be put back to LRU · 385de357
      Dean Nelson 提交于
      Andrea Arcangeli pointed out to me that a check in __memory_failure()
      which was intended to prevent THP tail pages from being checked for the
      absence of the PG_lru flag (something that is always the case), was also
      preventing THP head pages from being checked.
      
      A THP head page could actually benefit from the call to shake_page() by
      ending up being put back to a LRU, provided it had been waiting in a
      pagevec array.
      
      Andrea suggested that the "!PageTransCompound(p)" in the if-statement
      should be replaced by a "!PageTransTail(p)", thus allowing THP head pages
      to be checked and possibly shaken.
      Signed-off-by: NDean Nelson <dnelson@redhat.com>
      Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      385de357
    • J
      tmpfs: security xattr setting on inode creation · 6d9d88d0
      Jarkko Sakkinen 提交于
      Adds to generic xattr support introduced in Linux 3.0 by implementing
      initxattrs callback.  This enables consulting of security attributes from
      LSM and EVM when inode is created.
      
      [hughd@google.com: moved under CONFIG_TMPFS_XATTR, with memcpy in shmem_xattr_alloc]
      Signed-off-by: NJarkko Sakkinen <jarkko.sakkinen@intel.com>
      Reviewed-by: NJames Morris <james.l.morris@oracle.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d9d88d0
    • D
      mm, oom: force oom kill on sysrq+f · 08ab9b10
      David Rientjes 提交于
      The oom killer chooses not to kill a thread if:
      
       - an eligible thread has already been oom killed and has yet to exit,
         and
      
       - an eligible thread is exiting but has yet to free all its memory and
         is not the thread attempting to currently allocate memory.
      
      SysRq+F manually invokes the global oom killer to kill a memory-hogging
      task.  This is normally done as a last resort to free memory when no
      progress is being made or to test the oom killer itself.
      
      For both uses, we always want to kill a thread and never defer.  This
      patch causes SysRq+F to always kill an eligible thread and can be used to
      force a kill even if another oom killed thread has failed to exit.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08ab9b10
    • S
      procfs: mark thread stack correctly in proc/<pid>/maps · b7643757
      Siddhesh Poyarekar 提交于
      Stack for a new thread is mapped by userspace code and passed via
      sys_clone.  This memory is currently seen as anonymous in
      /proc/<pid>/maps, which makes it difficult to ascertain which mappings
      are being used for thread stacks.  This patch uses the individual task
      stack pointers to determine which vmas are actually thread stacks.
      
      For a multithreaded program like the following:
      
      	#include <pthread.h>
      
      	void *thread_main(void *foo)
      	{
      		while(1);
      	}
      
      	int main()
      	{
      		pthread_t t;
      		pthread_create(&t, NULL, thread_main, NULL);
      		pthread_join(t, NULL);
      	}
      
      proc/PID/maps looks like the following:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
      the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
      that is not always a reliable way to find out which vma is a thread
      stack.  Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
      content.
      
      With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
      as the task would see it' and hence, only the vma that that task uses as
      stack is marked as [stack].  All other 'stack' vmas are marked as
      anonymous memory.  /proc/PID/maps acts as a thread group level view,
      where all thread stack vmas are marked as [stack:TID] where TID is the
      process ID of the task that uses that vma as stack, while the process
      stack is marked as [stack].
      
      So /proc/PID/maps will look like this:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      Thus marking all vmas that are used as stacks by the threads in the
      thread group along with the process stack.  The task level maps will
      however like this:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      where only the vma that is being used as a stack by *that* task is
      marked as [stack].
      
      Analogous changes have been made to /proc/PID/smaps,
      /proc/PID/numa_maps, /proc/PID/task/TID/smaps and
      /proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
      numa_maps:
      
          [siddhesh@localhost ~ ]$ pgrep a.out
          1441
          [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
          7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
          7fff6273a000 default stack anon=3 dirty=3 N0=3
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
          7f8a44492000 default stack anon=2 dirty=2 N0=2
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
          7fff6273a000 default stack anon=3 dirty=3 N0=3
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NSiddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jamie Lokier <jamie@shareable.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7643757
    • H
      mm: hugetlb: bail out unmapping after serving reference page · 9e81130b
      Hillf Danton 提交于
      When unmapping a given VM range, we could bail out if a reference page is
      supplied and is unmapped, which is a minor optimization.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e81130b
    • H
      vmscan: handle isolated pages with lru lock released · d563c050
      Hillf Danton 提交于
      When shrinking inactive lru list, isolated pages are queued on locally
      private list, so the lock-hold time could be reduced if pages are counted
      without lock protection.
      
      To achieve that, firstly updating reclaim stat is delayed until the
      putback stage, after reacquiring the lru lock.
      
      Secondly, operations related to vm and zone stats are now proteced with
      preemption disabled as they are per-cpu operations.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d563c050
    • K
      rmap: anon_vma_prepare: Reduce code duplication by calling anon_vma_chain_link · 6583a843
      Kautuk Consul 提交于
      Reduce code duplication by calling anon_vma_chain_link() from
      anon_vma_prepare().
      
      Also move anon_vmal_chain_link() to a more suitable location in the file.
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6583a843