1. 01 8月, 2012 40 次提交
    • M
      mm: remove redundant initialization · 6527af5d
      Minchan Kim 提交于
      pg_data_t is zeroed before reaching free_area_init_core(), so remove the
      now unnecessary initializations.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6527af5d
    • M
      mm: warn if pg_data_t isn't initialized with zero · 88fdf75d
      Minchan Kim 提交于
      Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
      when it is allocated.  Arch code and memory hotplug already initiailize
      pg_data_t.  So this warning should never happen.  I select fields randomly
      near the beginning, middle and end of pg_data_t for checking.
      
      This patch isn't for performance but for removing initialization code
      which is necessary to add whenever we adds new field to pg_data_t or zone.
      
      Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
      Tejun doesn't like it because in the future, some archs can initialize
      some fields in arch code and pass them into general MM part so blindly
      clearing it out in mm core part would be very annoying.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88fdf75d
    • T
      memcg: gix memory accounting scalability in shrink_page_list · 69980e31
      Tim Chen 提交于
      I noticed in a multi-process parallel files reading benchmark I ran on a 8
      socket machine, throughput slowed down by a factor of 8 when I ran the
      benchmark within a cgroup container.  I traced the problem to the
      following code path (see below) when we are trying to reclaim memory from
      file cache.  The res_counter_uncharge function is called on every page
      that's reclaimed and created heavy lock contention.  The patch below
      allows the reclaimed pages to be uncharged from the resource counter in
      batch and recovered the regression.
      
      Tim
      
           40.67%           usemem  [kernel.kallsyms]                   [k] _raw_spin_lock
                            |
                            --- _raw_spin_lock
                               |
                               |--92.61%-- res_counter_uncharge
                               |          |
                               |          |--100.00%-- __mem_cgroup_uncharge_common
                               |          |          |
                               |          |          |--100.00%-- mem_cgroup_uncharge_cache_page
                               |          |          |          __remove_mapping
                               |          |          |          shrink_page_list
                               |          |          |          shrink_inactive_list
                               |          |          |          shrink_mem_cgroup_zone
                               |          |          |          shrink_zone
                               |          |          |          do_try_to_free_pages
                               |          |          |          try_to_free_pages
                               |          |          |          __alloc_pages_nodemask
                               |          |          |          alloc_pages_current
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69980e31
    • G
      mm/sparse: remove index_init_lock · c1c95183
      Gavin Shan 提交于
      sparse_index_init() uses the index_init_lock spinlock to protect root
      mem_section assignment.  The lock is not necessary anymore because the
      function is called only during boot (during paging init which is executed
      only from a single CPU) and from the hotplug code (by add_memory() via
      arch_add_memory()) which uses mem_hotplug_mutex.
      
      The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
      preparation") and sparse_index_init() was used only during boot at that
      time.
      
      Later when the hotplug code (and add_memory()) was introduced there was no
      synchronization so it was possible to online more sections from the same
      root probably (though I am not 100% sure about that).  The first
      synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
      hibernation in the same kernel") which was later replaced by the
      mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
      {un}lock_memory_hotplug()").
      
      Let's remove the lock as it is not needed and it makes the code more
      confusing.
      
      [mhocko@suse.cz: changelog]
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1c95183
    • G
      mm/sparse: more checks on mem_section number · db36a461
      Gavin Shan 提交于
      __section_nr() was implemented to retrieve the corresponding memory
      section number according to its descriptor.  It's possible that the
      specified memory section descriptor doesn't exist in the global array.  So
      add more checking on that and report an error for a wrong case.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db36a461
    • G
      mm/sparse: optimize sparse_index_alloc · 5b760e64
      Gavin Shan 提交于
      With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
      descriptors are allocated from slab or bootmem.  When allocating from
      slab, let slab/bootmem allocator clear the memory chunk.  We needn't clear
      it explicitly.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b760e64
    • W
      memcg: add mem_cgroup_from_css() helper · b2145145
      Wanpeng Li 提交于
      Add a mem_cgroup_from_css() helper to replace open-coded invokations of
      container_of().  To clarify the code and to add a little more type safety.
      
      [akpm@linux-foundation.org: fix extensive breakage]
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2145145
    • H
      memcg: further prevent OOM with too many dirty pages · c3b94f44
      Hugh Dickins 提交于
      The may_enter_fs test turns out to be too restrictive: though I saw no
      problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
      on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
      slightly changed the way I started off the testing: dd if=/dev/zero
      of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
      memory.limit_in_bytes cgroup to ext4 on USB stick.
      
      ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
      AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
      the transaction needs to be started even before allocating pagecache
      memory.  But it may not be worth worrying about these days: if direct
      reclaim avoids FS writeback, does __GFP_FS now mean anything?
      
      Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
      device; but since that also masks off __GFP_IO, we can test for __GFP_IO
      directly, ignoring may_enter_fs and __GFP_FS.
      
      But even so, the test still OOMs sometimes: when originally testing on
      3.5-rc6, it OOMed about one time in five or ten; when testing just now on
      3.5-rc6-mm1, it OOMed on the first iteration.
      
      This residual problem comes from an accumulation of pages under ordinary
      writeback, not marked PageReclaim, so rightly not causing the memcg check
      to wait on their writeback: these too can prevent shrink_page_list() from
      freeing any pages, so many times that memcg reclaim fails and OOMs.
      
      Deal with these in the same way as direct reclaim now deals with dirty FS
      pages: mark them PageReclaim.  It is appropriate to rotate these to tail
      of list when writepage completes, but more importantly, the PageReclaim
      flag makes memcg reclaim wait on them if encountered again.  Increment
      NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
      
      Setting PageReclaim here may occasionally race with end_page_writeback()
      clearing it: lru_deactivate_fn() already faced the same race, and
      correctly concluded that the window is small and the issue non-critical.
      
      With these changes, the test runs indefinitely without OOMing on ext4,
      ext3 and ext2: I'll move on to test with other filesystems later.
      
      Trivia: invert conditions for a clearer block without an else, and goto
      keep_locked to do the unlock_page.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3b94f44
    • M
      memcg: prevent OOM with too many dirty pages · e62e384e
      Michal Hocko 提交于
      The current implementation of dirty pages throttling is not memcg aware
      which makes it easy to have memcg LRUs full of dirty pages.  Without
      throttling, these LRUs can be scanned faster than the rate of writeback,
      leading to memcg OOM conditions when the hard limit is small.
      
      This patch fixes the problem by throttling the allocating process
      (possibly a writer) during the hard limit reclaim by waiting on
      PageReclaim pages.  We are waiting only for PageReclaim pages because
      those are the pages that made one full round over LRU and that means that
      the writeback is much slower than scanning.
      
      The solution is far from being ideal - long term solution is memcg aware
      dirty throttling - but it is meant to be a band aid until we have a real
      fix.  We are seeing this happening during nightly backups which are placed
      into containers to prevent from eviction of the real working set.
      
      The change affects only memcg reclaim and only when we encounter
      PageReclaim pages which is a signal that the reclaim doesn't catch up on
      with the writers so somebody should be throttled.  This could be
      potentially unfair because it could be somebody else from the group who
      gets throttled on behalf of the writer but as writers need to allocate as
      well and they allocate in higher rate the probability that only innocent
      processes would be penalized is not that high.
      
      I have tested this change by a simple dd copying /dev/zero to tmpfs or
      ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
      containers) and dd got killed by OOM killer every time.  With the patch I
      could run the dd with the same size under 5M controller without any OOM.
      The issue is more visible with slower devices for output.
      
      * With the patch
      ================
      * tmpfs size=2G
      ---------------
      $ vim cgroup_cache_oom_test.sh
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s
      
      * ext3
      ------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s
      
      * Without the patch
      ===================
      * tmpfs size=2G
      ---------------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      ./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s
      
      * ext3
      ------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      ./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      ./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s
      
      [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
      [hughd@google.com: fix deadlock with loop driver]
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e62e384e
    • X
      mm: mmu_notifier: fix freed page still mapped in secondary MMU · 3ad3d901
      Xiao Guangrong 提交于
      mmu_notifier_release() is called when the process is exiting.  It will
      delete all the mmu notifiers.  But at this time the page belonging to the
      process is still present in page tables and is present on the LRU list, so
      this race will happen:
      
            CPU 0                 CPU 1
      mmu_notifier_release:    try_to_unmap:
         hlist_del_init_rcu(&mn->hlist);
                                  ptep_clear_flush_notify:
                                        mmu nofifler not found
                                  free page  !!!!!!
                                  /*
                                   * At the point, the page has been
                                   * freed, but it is still mapped in
                                   * the secondary MMU.
                                   */
      
        mn->ops->release(mn, mm);
      
      Then the box is not stable and sometimes we can get this bug:
      
      [  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
      [  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
      [  738.075936] page flags: 0x20000000000014(referenced|dirty)
      
      The same issue is present in mmu_notifier_unregister().
      
      We can call ->release before deleting the notifier to ensure the page has
      been unmapped from the secondary MMU before it is freed.
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad3d901
    • J
      mm: memcg: only check anon swapin page charges for swap cache · bdf4f4d2
      Johannes Weiner 提交于
      shmem knows for sure that the page is in swap cache when attempting to
      charge a page, because the cache charge entry function has a check for it.
      Only anon pages may be removed from swap cache already when trying to
      charge their swapin.
      
      Adjust the comment, though: '4969c119 mm: fix swapin race condition' added
      a stable PageSwapCache check under the page lock in the do_swap_page()
      before calling the memory controller, so it's unuse_pte()'s pte_same()
      that may fail.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdf4f4d2
    • J
      mm: memcg: only check swap cache pages for repeated charging · 90deb788
      Johannes Weiner 提交于
      Only anon and shmem pages in the swap cache are attempted to be charged
      multiple times, from every swap pte fault or from shmem_unuse().  No other
      pages require checking PageCgroupUsed().
      
      Charging pages in the swap cache is also serialized by the page lock, and
      since both the try_charge and commit_charge are called under the same page
      lock section, the PageCgroupUsed() check might as well happen before the
      counter charging, let alone reclaim.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90deb788
    • J
      mm: memcg: split swapin charge function into private and public part · 0435a2fd
      Johannes Weiner 提交于
      When shmem is charged upon swapin, it does not need to check twice whether
      the memory controller is enabled.
      
      Also, shmem pages do not have to be checked for everything that regular
      anon pages have to be checked for, so let shmem use the internal version
      directly and allow future patches to move around checks that are only
      required when swapping in anon pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0435a2fd
    • J
      mm: memcg: remove needless !mm fixup to init_mm when charging · 24467cac
      Johannes Weiner 提交于
      It does not matter to __mem_cgroup_try_charge() if the passed mm is NULL
      or init_mm, it will charge the root memcg in either case.
      
      Also fix up the comment in __mem_cgroup_try_charge() that claimed the
      init_mm would be charged when no mm was passed.  It's not really
      incorrect, but confusing.  Clarify that the root memcg is charged in this
      case.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24467cac
    • J
      mm: memcg: remove unneeded shmem charge type · 62ba7442
      Johannes Weiner 提交于
      shmem page charges have not needed a separate charge type to tell them
      from regular file pages since 08e552c6 ("memcg: synchronized LRU").
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62ba7442
    • J
      mm: memcg: move swapin charge functions above callsites · 827a03d2
      Johannes Weiner 提交于
      Charging cache pages may require swapin in the shmem case.  Save the
      forward declaration and just move the swapin functions above the cache
      charging functions.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      827a03d2
    • J
      mm: memcg: only check for PageSwapCache when uncharging anon · 7d188958
      Johannes Weiner 提交于
      Only anon pages that are uncharged at the time of the last page table
      mapping vanishing may be in swapcache.
      
      When shmem pages, file pages, swap-freed anon pages, or just migrated
      pages are uncharged, they are known for sure to be not in swapcache.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d188958
    • J
      mm: memcg: push down PageSwapCache check into uncharge entry functions · 0c59b89c
      Johannes Weiner 提交于
      Not all uncharge paths need to check if the page is swapcache, some of
      them can know for sure.
      
      Push down the check into all callsites of uncharge_common() so that the
      patch that removes some of them is more obvious.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c59b89c
    • J
      mm: swapfile: clean up unuse_pte race handling · 5d84c776
      Johannes Weiner 提交于
      The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when
      the function would continue to reestablish the page even after
      mem_cgroup_try_charge_swapin() failed.  After 85d9fc89 "memcg: fix refcnt
      handling at swapoff", the condition is always true when this code is
      reached.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d84c776
    • J
      mm: memcg: fix compaction/migration failing due to memcg limits · 0030f535
      Johannes Weiner 提交于
      Compaction (and page migration in general) can currently be hindered
      through pages being owned by memory cgroups that are at their limits and
      unreclaimable.
      
      The reason is that the replacement page is being charged against the limit
      while the page being replaced is also still charged.  But this seems
      unnecessary, given that only one of the two pages will still be in use
      after migration finishes.
      
      This patch changes the memcg migration sequence so that the replacement
      page is not charged.  Whatever page is still in use after successful or
      failed migration gets to keep the charge of the page that was going to be
      replaced.
      
      The replacement page will still show up temporarily in the rss/cache
      statistics, this can be fixed in a later patch as it's less urgent.
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0030f535
    • M
      swapfile: avoid dereferencing bd_disk during swap_entry_free for network storage · 73744923
      Mel Gorman 提交于
      Commit b3a27d ("swap: Add swap slot free callback to
      block_device_operations") dereferences p->bdev->bd_disk but this is a NULL
      dereference if using swap-over-NFS.  This patch checks SWP_BLKDEV on the
      swap_info_struct before dereferencing.
      
      With reference to this callback, Christoph Hellwig stated "Please just
      remove the callback entirely.  It has no user outside the staging tree and
      was added clearly against the rules for that staging tree".  This would
      also be my preference but there was not an obvious way of keeping zram in
      staging/ happy.
      Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73744923
    • M
      mm: add support for direct_IO to highmem pages · 5a178119
      Mel Gorman 提交于
      The patch "mm: add support for a filesystem to activate swap files and use
      direct_IO for writing swap pages" added support for using direct_IO to
      write swap pages but it is insufficient for highmem pages.
      
      To support highmem pages, this patch kmaps() the page before calling the
      direct_IO() handler.  As direct_IO deals with virtual addresses an
      additional helper is necessary for get_kernel_pages() to lookup the struct
      page for a kmap virtual address.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a178119
    • M
      mm: swap: implement generic handler for swap_activate · a509bc1a
      Mel Gorman 提交于
      The version of swap_activate introduced is sufficient for swap-over-NFS
      but would not provide enough information to implement a generic handler.
      This patch shuffles things slightly to ensure the same information is
      available for aops->swap_activate() as is available to the core.
      
      No functionality change.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a509bc1a
    • M
      mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages · 62c230bc
      Mel Gorman 提交于
      Currently swapfiles are managed entirely by the core VM by using ->bmap to
      allocate space and write to the blocks directly.  This effectively ensures
      that the underlying blocks are allocated and avoids the need for the swap
      subsystem to locate what physical blocks store offsets within a file.
      
      If the swap subsystem is to use the filesystem information to locate the
      blocks, it is critical that information such as block groups, block
      bitmaps and the block descriptor table that map the swap file were
      resident in memory.  This patch adds address_space_operations that the VM
      can call when activating or deactivating swap backed by a file.
      
        int swap_activate(struct file *);
        int swap_deactivate(struct file *);
      
      The ->swap_activate() method is used to communicate to the file that the
      VM relies on it, and the address_space should take adequate measures such
      as reserving space in the underlying device, reserving memory for mempools
      and pinning information such as the block descriptor table in memory.  The
      ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
      returned success.
      
      After a successful swapfile ->swap_activate, the swapfile is marked
      SWP_FILE and swapper_space.a_ops will proxy to
      sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
      pages and ->readpage to read.
      
      It is perfectly possible that direct_IO be used to read the swap pages but
      it is an unnecessary complication.  Similarly, it is possible that
      ->writepage be used instead of direct_io to write the pages but filesystem
      developers have stated that calling writepage from the VM is undesirable
      for a variety of reasons and using direct_IO opens up the possibility of
      writing back batches of swap pages in the future.
      
      [a.p.zijlstra@chello.nl: Original patch]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62c230bc
    • M
      mm: add get_kernel_page[s] for pinning of kernel addresses for I/O · 18022c5d
      Mel Gorman 提交于
      This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
      may be used to pin a vector of kernel addresses for IO.  The initial user
      is expected to be NFS for allowing pages to be written to swap using
      aops->direct_IO().  Strictly speaking, swap-over-NFS only needs to pin one
      page for IO but it makes sense to express the API in terms of a vector and
      add a helper for pinning single pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18022c5d
    • M
      mm: methods for teaching filesystems about PG_swapcache pages · f981c595
      Mel Gorman 提交于
      In order to teach filesystems to handle swap cache pages, three new page
      functions are introduced:
      
        pgoff_t page_file_index(struct page *);
        loff_t page_file_offset(struct page *);
        struct address_space *page_file_mapping(struct page *);
      
      page_file_index() - gives the offset of this page in the file in
      PAGE_CACHE_SIZE blocks.  Like page->index is for mapped pages, this
      function also gives the correct index for PG_swapcache pages.
      
      page_file_offset() - uses page_file_index(), so that it will give the
      expected result, even for PG_swapcache pages.
      
      page_file_mapping() - gives the mapping backing the actual page; that is
      for swap cache pages it will give swap_file->f_mapping.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f981c595
    • M
      mm: account for the number of times direct reclaimers get throttled · 68243e76
      Mel Gorman 提交于
      Under significant pressure when writing back to network-backed storage,
      direct reclaimers may get throttled.  This is expected to be a short-lived
      event and the processes get woken up again but processes do get stalled.
      This patch counts how many times such stalling occurs.  It's up to the
      administrator whether to reduce these stalls by increasing
      min_free_kbytes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68243e76
    • M
      mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is... · 5515061d
      Mel Gorman 提交于
      mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
      
      If swap is backed by network storage such as NBD, there is a risk that a
      large number of reclaimers can hang the system by consuming all
      PF_MEMALLOC reserves.  To avoid these hangs, the administrator must tune
      min_free_kbytes in advance which is a bit fragile.
      
      This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
      are in use.  If the system is routinely getting throttled the system
      administrator can increase min_free_kbytes so degradation is smoother but
      the system will keep running.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5515061d
    • M
      mm: micro-optimise slab to avoid a function call · 381760ea
      Mel Gorman 提交于
      Getting and putting objects in SLAB currently requires a function call but
      the bulk of the work is related to PFMEMALLOC reserves which are only
      consumed when network-backed storage is critical.  Use an inline function
      to determine if the function call is required.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      381760ea
    • M
      netvm: allow skb allocation to use PFMEMALLOC reserves · c93bdd0e
      Mel Gorman 提交于
      Change the skb allocation API to indicate RX usage and use this to fall
      back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
      reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
      reserve and the socket is later found to be unrelated to page reclaim, the
      packet is dropped so that the memory remains available for page reclaim.
      Network protocols are expected to recover from this packet loss.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [davem@davemloft.net: Use static branches, coding style corrections]
      [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c93bdd0e
    • M
      mm: ignore mempolicies when using ALLOC_NO_WATERMARK · 183f6371
      Mel Gorman 提交于
      The reserve is proportionally distributed over all !highmem zones in the
      system.  So we need to allow an emergency allocation access to all zones.
      In order to do that we need to break out of any mempolicy boundaries we
      might have.
      
      In my opinion that does not break mempolicies as those are user oriented
      and not system oriented.  That is, system allocations are not guaranteed
      to be within mempolicy boundaries.  For instance IRQs do not even have a
      mempolicy.
      
      So breaking out of mempolicy boundaries for 'rare' emergency allocations,
      which are always system allocations (as opposed to user) is ok.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      183f6371
    • M
      mm: only set page->pfmemalloc when ALLOC_NO_WATERMARKS was used · cfd19c5a
      Mel Gorman 提交于
      __alloc_pages_slowpath() is called when the number of free pages is below
      the low watermark.  If the caller is entitled to use ALLOC_NO_WATERMARKS
      then the page will be marked page->pfmemalloc.  This protects more pages
      than are strictly necessary as we only need to protect pages allocated
      below the min watermark (the pfmemalloc reserves).
      
      This patch only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was
      required to allocate the page.
      
      [rientjes@google.com: David noticed the problem during review]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfd19c5a
    • M
      mm: allow PF_MEMALLOC from softirq context · 907aed48
      Mel Gorman 提交于
      This is needed to allow network softirq packet processing to make use of
      PF_MEMALLOC.
      
      Currently softirq context cannot use PF_MEMALLOC due to it not being
      associated with a task, and therefore not having task flags to fiddle with
      - thus the gfp to alloc flag mapping ignores the task flags when in
      interrupts (hard or soft) context.
      
      Allowing softirqs to make use of PF_MEMALLOC therefore requires some
      trickery.  This patch borrows the task flags from whatever process happens
      to be preempted by the softirq.  It then modifies the gfp to alloc flags
      mapping to not exclude task flags in softirq context, and modify the
      softirq code to save, clear and restore the PF_MEMALLOC flag.
      
      The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
      leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
      cannot leak back into the preempted process.  This should be safe due to
      the following reasons
      
      Softirqs can run on multiple CPUs sure but the same task should not be
      	executing the same softirq code. Neither should the softirq
      	handler be preempted by any other softirq handler so the flags
      	should not leak to an unrelated softirq.
      
      Softirqs re-enable hardware interrupts in __do_softirq() so can be
      	preempted by hardware interrupts so PF_MEMALLOC is inherited
      	by the hard IRQ. However, this is similar to a process in
      	reclaim being preempted by a hardirq. While PF_MEMALLOC is
      	set, gfp_to_alloc_flags() distinguishes between hard and
      	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
      	flag.
      
      If the softirq is deferred to ksoftirq then its flags may be used
              instead of a normal tasks but as the softirq cannot be preempted,
              the PF_MEMALLOC flag does not leak to other code by accident.
      
      [davem@davemloft.net: Document why PF_MEMALLOC is safe]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      907aed48
    • M
      mm: introduce __GFP_MEMALLOC to allow access to emergency reserves · b37f1dd0
      Mel Gorman 提交于
      __GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
      like PF_MEMALLOC.  It allows one to pass along the memalloc state in
      object related allocation flags as opposed to task related flags, such as
      sk->sk_allocation.  This removes the need for ALLOC_PFMEMALLOC as callers
      using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
      enough to identify allocations related to page reclaim.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b37f1dd0
    • C
      mm: slub: optimise the SLUB fast path to avoid pfmemalloc checks · 5091b74a
      Christoph Lameter 提交于
      This patch removes the check for pfmemalloc from the alloc hotpath and
      puts the logic after the election of a new per cpu slab.  For a pfmemalloc
      page we do not use the fast path but force the use of the slow path which
      is also used for the debug case.
      
      This has the side-effect of weakening pfmemalloc processing in the
      following way;
      
      1. A process that is allocating for network swap calls __slab_alloc.
         pfmemalloc_match is true so the freelist is loaded and c->freelist is
         now pointing to a pfmemalloc page.
      
      2. A process that is attempting normal allocations calls slab_alloc,
         finds the pfmemalloc page on the freelist and uses it because it did
         not check pfmemalloc_match()
      
      The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
      the kmalloc slabs being the most vunerable caches on the grounds they
      are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A
      later patch will still protect the system as processes will get throttled
      if the pfmemalloc reserves get depleted but performance will not degrade
      as smoothly.
      
      [mgorman@suse.de: Expanded changelog]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5091b74a
    • M
      mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages · 072bb0aa
      Mel Gorman 提交于
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  Swap over the network is considered as an option in diskless
      systems.  The two likely scenarios are when blade servers are used as part
      of a cluster where the form factor or maintenance costs do not allow the
      use of disks and thin clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap according to the manual at
      https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
      There is also documentation and tutorials on how to setup swap over NBD at
      places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
      nbd-client also documents the use of NBD as swap.  Despite this, the fact
      is that a machine using NBD for swap can deadlock within minutes if swap
      is used intensively.  This patch series addresses the problem.
      
      The core issue is that network block devices do not use mempools like
      normal block devices do.  As the host cannot control where they receive
      packets from, they cannot reliably work out in advance how much memory
      they might need.  Some years ago, Peter Zijlstra developed a series of
      patches that supported swap over an NFS that at least one distribution is
      carrying within their kernels.  This patch series borrows very heavily
      from Peter's work to support swapping over NBD as a pre-requisite to
      supporting swap-over-NFS.  The bulk of the complexity is concerned with
      preserving memory that is allocated from the PFMEMALLOC reserves for use
      by the network layer which is needed for both NBD and NFS.
      
      Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
      	preserve access to pages allocated under low memory situations
      	to callers that are freeing memory.
      
      Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
      
      Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
      	reserves without setting PFMEMALLOC.
      
      Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
      	for later use by network packet processing.
      
      Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
      
      Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
      
      Patches 7-12 allows network processing to use PFMEMALLOC reserves when
      	the socket has been marked as being used by the VM to clean pages. If
      	packets are received and stored in pages that were allocated under
      	low-memory situations and are unrelated to the VM, the packets
      	are dropped.
      
      	Patch 11 reintroduces __skb_alloc_page which the networking
      	folk may object to but is needed in some cases to propogate
      	pfmemalloc from a newly allocated page to an skb. If there is a
      	strong objection, this patch can be dropped with the impact being
      	that swap-over-network will be slower in some cases but it should
      	not fail.
      
      Patch 13 is a micro-optimisation to avoid a function call in the
      	common case.
      
      Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
      	PFMEMALLOC if necessary.
      
      Patch 15 notes that it is still possible for the PFMEMALLOC reserve
      	to be depleted. To prevent this, direct reclaimers get throttled on
      	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
      	expected that kswapd and the direct reclaimers already running
      	will clean enough pages for the low watermark to be reached and
      	the throttled processes are woken up.
      
      Patch 16 adds a statistic to track how often processes get throttled
      
      Some basic performance testing was run using kernel builds, netperf on
      loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
      sysbench.  Each of them were expected to use the sl*b allocators
      reasonably heavily but there did not appear to be significant performance
      variances.
      
      For testing swap-over-NBD, a machine was booted with 2G of RAM with a
      swapfile backed by NBD.  8*NUM_CPU processes were started that create
      anonymous memory mappings and read them linearly in a loop.  The total
      size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
      memory pressure.
      
      Without the patches and using SLUB, the machine locks up within minutes
      and runs to completion with them applied.  With SLAB, the story is
      different as an unpatched kernel run to completion.  However, the patched
      kernel completed the test 45% faster.
      
      MICRO
                                               3.5.0-rc2 3.5.0-rc2
      					 vanilla     swapnbd
      Unrecognised test vmscan-anon-mmap-write
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             197.80    173.07
      User+Sys Time Running Test (seconds)        206.96    182.03
      Total Elapsed Time (seconds)               3240.70   1762.09
      
      This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
      
      Allocations of pages below the min watermark run a risk of the machine
      hanging due to a lack of memory.  To prevent this, only callers who have
      PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
      allowed to allocate with ALLOC_NO_WATERMARKS.  Once they are allocated to
      a slab though, nothing prevents other callers consuming free objects
      within those slabs.  This patch limits access to slab pages that were
      alloced from the PFMEMALLOC reserves.
      
      When this patch is applied, pages allocated from below the low watermark
      are returned with page->pfmemalloc set and it is up to the caller to
      determine how the page should be protected.  SLAB restricts access to any
      page with page->pfmemalloc set to callers which are known to able to
      access the PFMEMALLOC reserve.  If one is not available, an attempt is
      made to allocate a new page rather than use a reserve.  SLUB is a bit more
      relaxed in that it only records if the current per-CPU page was allocated
      from PFMEMALLOC reserve and uses another partial slab if the caller does
      not have the necessary GFP or process flags.  This was found to be
      sufficient in tests to avoid hangs due to SLUB generally maintaining
      smaller lists than SLAB.
      
      In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
      a slab allocation even though free objects are available because they are
      being preserved for callers that are freeing pages.
      
      [a.p.zijlstra@chello.nl: Original implementation]
      [sebastian@breakpoint.cc: Correct order of page flag clearing]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072bb0aa
    • M
      memory-hotplug: fix kswapd looping forever problem · 702d1a6e
      Minchan Kim 提交于
      When hotplug offlining happens on zone A, it starts to mark freed page as
      MIGRATE_ISOLATE type in buddy for preventing further allocation.
      (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
      we can't allocate them).
      
      When the memory shortage happens during hotplug offlining, current task
      starts to reclaim, then wake up kswapd.  Kswapd checks watermark, then go
      sleep because current zone_watermark_ok_safe doesn't consider
      MIGRATE_ISOLATE freed page count.  Current task continue to reclaim in
      direct reclaim path without kswapd's helping.  The problem is that
      zone->all_unreclaimable is set by only kswapd so that current task would
      be looping forever like below.
      
      __alloc_pages_slowpath
      restart:
      	wake_all_kswapd
      rebalance:
      	__alloc_pages_direct_reclaim
      		do_try_to_free_pages
      			if global_reclaim && !all_unreclaimable
      				return 1; /* It means we did did_some_progress */
      	skip __alloc_pages_may_oom
      	should_alloc_retry
      		goto rebalance;
      
      If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
      setting zone->all_unreclaimable, we can solve this problem by killing some
      task in direct reclaim path.  But it doesn't wake up kswapd, still.  It
      could be a problem still if other subsystem needs GFP_ATOMIC request.  So
      kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
      going sleep.
      
      This patch counts the number of MIGRATE_ISOLATE page block and
      zone_watermark_ok_safe will consider it if the system has such blocks
      (fortunately, it's very rare so no problem in POV overhead and kswapd is
      never hotpath).
      
      Copy/modify from Mel's quote
      "
      Ideal solution would be "allocating" the pageblock.
      It would keep the free space accounting as it is but historically,
      memory hotplug didn't allocate pages because it would be difficult to
      detect if a pageblock was isolated or if part of some balloon.
      Allocating just full pageblocks would work around this, However,
      it would play very badly with CMA.
      "
      
      [1] http://lkml.org/lkml/2012/6/14/74
      
      [akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
      [akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Suggested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NAaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      702d1a6e
    • M
      mm: fix free page check in zone_watermark_ok() · 2cfed075
      Minchan Kim 提交于
      __zone_watermark_ok currently compares free_pages which is a signed type
      with z->lowmem_reserve[classzone_idx] which is unsigned which might lead
      to sign overflow if free_pages doesn't satisfy the given order (or it came
      as negative already) and then we rely on the following order loop to fix
      it (which doesn't work for order-0).  Let's fix the type conversion and do
      not rely on the given value of free_pages or follow up fixups.
      
      This patch fixes it because "memory-hotplug: fix kswapd looping forever
      problem" depends on this.
      
      As benefit of this patch, it doesn't rely on the loop to exit
      __zone_watermark_ok in case of high order check and make the first test
      effective.(ie, if (free_pages <= min + lowmem_reserve))
      
      Aaditya reported this problem when he test my hotplug patch.
      Reported-off-by: NAaditya Kumar <aaditya.kumar@ap.sony.com>
      Tested-by: NAaditya Kumar <aaditya.kumar@ap.sony.com>
      Signed-off-by: NAaditya Kumar <aaditya.kumar@ap.sony.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cfed075
    • M
      mm: factor out memory isolate functions · ee6f509c
      Minchan Kim 提交于
      mm/page_alloc.c has some memory isolation functions but they are used only
      when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.  So let's make
      it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce
      binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if
      defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee6f509c
    • D
      mm, memcg: move all oom handling to memcontrol.c · 876aafbf
      David Rientjes 提交于
      By globally defining check_panic_on_oom(), the memcg oom handler can be
      moved entirely to mm/memcontrol.c.  This removes the ugly #ifdef in the
      oom killer and cleans up the code.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      876aafbf