1. 30 5月, 2012 26 次提交
    • H
      mm/memcg: get_lru_size not get_lruvec_size · 4d7dcca2
      Hugh Dickins 提交于
      Konstantin just introduced mem_cgroup_get_lruvec_size() and
      get_lruvec_size(), I'm about to add mem_cgroup_update_lru_size(): but
      we're dealing with the same thing, lru_size[lru].  We ought to agree on
      the naming, and I do think lru_size is the more correct: so rename his
      ones to get_lru_size().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d7dcca2
    • K
      mm/vmscan: kill struct mem_cgroup_zone · f9be23d6
      Konstantin Khlebnikov 提交于
      Kill struct mem_cgroup_zone and rename shrink_mem_cgroup_zone() to
      shrink_lruvec(), it always shrinks one lruvec which it takes as an
      argument.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9be23d6
    • K
      mm/vmscan: push lruvec pointer into should_continue_reclaim() · 90bdcfaf
      Konstantin Khlebnikov 提交于
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90bdcfaf
    • K
      mm/vmscan: push lruvec pointer into get_scan_count() · 90126375
      Konstantin Khlebnikov 提交于
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90126375
    • K
      mm/vmscan: push lruvec pointer into shrink_list() · 1a93be0e
      Konstantin Khlebnikov 提交于
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a93be0e
    • K
      mm/vmscan: push lruvec pointer into inactive_list_is_low() · c56d5c7d
      Konstantin Khlebnikov 提交于
      Switch mem_cgroup_inactive_anon_is_low() to lruvec pointers,
      mem_cgroup_get_lruvec_size() is more effective than
      mem_cgroup_zone_nr_lru_pages()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c56d5c7d
    • K
      mm/vmscan: replace zone_nr_lru_pages() with get_lruvec_size() · 074291fe
      Konstantin Khlebnikov 提交于
      If memory cgroup is enabled we always use lruvecs which are embedded into
      struct mem_cgroup_per_zone, so we can reach lru_size counters via
      container_of().
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074291fe
    • K
      mm/vmscan: push lruvec pointer into putback_inactive_pages() · 27ac81d8
      Konstantin Khlebnikov 提交于
      As zone_reclaim_stat is now located in the lruvec, we can reach it
      directly.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ac81d8
    • K
      mm/vmscan: remove update_isolated_counts() · 95d918fc
      Konstantin Khlebnikov 提交于
      update_isolated_counts() is no longer required, because lumpy-reclaim was
      removed.  Insanity is over, now there is only one kind of inactive page.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95d918fc
    • K
      mm/vmscan: push zone pointer into shrink_page_list() · 6a18adb3
      Konstantin Khlebnikov 提交于
      It doesn't need a pointer to the cgroup - pointer to the zone is enough.
      This patch also kills the "mz" argument of page_check_references() - it is
      unused after "mm: memcg: count pte references from every member of the
      reclaimed hierarch"
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a18adb3
    • K
      mm/vmscan: push lruvec pointer into isolate_lru_pages() · 5dc35979
      Konstantin Khlebnikov 提交于
      Move the mem_cgroup_zone_lruvec() call from isolate_lru_pages() into
      shrink_[in]active_list().  Further patches push it to shrink_zone() step
      by step.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dc35979
    • K
      mm/vmscan: store "priority" in struct scan_control · 9e3b2f8c
      Konstantin Khlebnikov 提交于
      In memory reclaim some function have too many arguments - "priority" is
      one of them.  It can be stored in struct scan_control - we construct them
      on the same level.  Instead of an open coded loop we set the initial
      sc.priority, and do_try_to_free_pages() decreases it down to zero.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e3b2f8c
    • K
      mm/memcg: use vm_swappiness from target memory cgroup · 3d58ab5c
      Konstantin Khlebnikov 提交于
      Use vm_swappiness from memory cgroup which is triggered this memory
      reclaim.  This is more reasonable and allows to kill one argument.
      
      [akpm@linux-foundation.org: fix build (patch skew)]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d58ab5c
    • K
      mm/memcg: kill mem_cgroup_lru_del() · bbf808ed
      Konstantin Khlebnikov 提交于
      This patch kills mem_cgroup_lru_del(), we can use
      mem_cgroup_lru_del_list() instead.  On 0-order isolation we already have
      right lru list id.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbf808ed
    • K
      mm: remove lru type checks from __isolate_lru_page() · f3fd4a61
      Konstantin Khlebnikov 提交于
      After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
      completely remove anon/file and active/inactive lru type filters from
      __isolate_lru_page(), because isolation for 0-order reclaim always
      isolates pages from right lru list.  And pages-isolation for lumpy
      shrink_inactive_list() or memory-compaction anyway allowed to isolate
      pages from all evictable lru lists.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3fd4a61
    • K
      mm: push lru index into shrink_[in]active_list() · 3cb99451
      Konstantin Khlebnikov 提交于
      Let's toss lru index through call stack to isolate_lru_pages(), this is
      better than its reconstructing from individual bits.
      
      [akpm@linux-foundation.org: fix kerneldoc, per Minchan]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cb99451
    • H
      mm/memcg: move reclaim_stat into lruvec · 89abfab1
      Hugh Dickins 提交于
      With mem_cgroup_disabled() now explicit, it becomes clear that the
      zone_reclaim_stat structure actually belongs in lruvec, per-zone when
      memcg is disabled but per-memcg per-zone when it's enabled.
      
      We can delete mem_cgroup_get_reclaim_stat(), and change
      update_page_reclaim_stat() to update just the one set of stats, the one
      which get_scan_count() will actually use.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89abfab1
    • H
      mm/memcg: scanning_global_lru means mem_cgroup_disabled · c3c787e8
      Hugh Dickins 提交于
      Although one has to admire the skill with which it has been concealed,
      scanning_global_lru(mz) is actually just an interesting way to test
      mem_cgroup_disabled().  Too many developer hours have been wasted on
      confusing it with global_reclaim(): just use mem_cgroup_disabled().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGlauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3c787e8
    • S
      mm: avoid swapping out with swappiness==0 · fe35004f
      Satoru Moriya 提交于
      Sometimes we'd like to avoid swapping out anonymous memory.  In
      particular, avoid swapping out pages of important process or process
      groups while there is a reasonable amount of pagecache on RAM so that we
      can satisfy our customers' requirements.
      
      OTOH, we can control how aggressive the kernel will swap memory pages with
      /proc/sys/vm/swappiness for global and
      /sys/fs/cgroup/memory/memory.swappiness for each memcg.
      
      But with current reclaim implementation, the kernel may swap out even if
      we set swappiness=0 and there is pagecache in RAM.
      
      This patch changes the behavior with swappiness==0.  If we set
      swappiness==0, the kernel does not swap out completely (for global reclaim
      until the amount of free pages and filebacked pages in a zone has been
      reduced to something very very small (nr_free + nr_filebacked < high
      watermark)).
      Signed-off-by: NSatoru Moriya <satoru.moriya@hds.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe35004f
    • M
      mm: consider all swapped back pages in used-once logic · e4898273
      Michal Hocko 提交于
      Commit 64574746 ("vmscan: detect mapped file pages used only once")
      made mapped pages have another round in inactive list because they might
      be just short lived and so we could consider them again next time.  This
      heuristic helps to reduce pressure on the active list with a streaming
      IO worklods.
      
      This patch fixes a regression introduced by this commit for heavy shmem
      based workloads because unlike Anon pages, which are excluded from this
      heuristic because they are usually long lived, shmem pages are handled
      as a regular page cache.
      
      This doesn't work quite well, unfortunately, if the workload is mostly
      backed by shmem (in memory database sitting on 80% of memory) with a
      streaming IO in the background (backup - up to 20% of memory).  Anon
      inactive list is full of (dirty) shmem pages when watermarks are hit.
      Shmem pages are kept in the inactive list (they are referenced) in the
      first round and it is hard to reclaim anything else so we reach lower
      scanning priorities very quickly which leads to an excessive swap out.
      
      Let's fix this by excluding all swap backed pages (they tend to be long
      lived wrt.  the regular page cache anyway) from used-once heuristic and
      rather activate them if they are referenced.
      
      The customer's workload is shmem backed database (80% of RAM) and they
      are measuring transactions/s with an IO in the background (20%).
      Transactions touch more or less random rows in the table.  The
      transaction rate fell by a factor of 3 (in the worst case) because of
      commit 64574746.  This patch restores the previous numbers.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[2.6.34+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4898273
    • Y
      mm: rename is_mlocked_vma() to mlocked_vma_newpage() · 096a7cf4
      Ying Han 提交于
      Andrew pointed out that the is_mlocked_vma() is misnamed.  A function
      with name like that would expect bool return and no side-effects.
      
      Since it is called on the fault path for new page, rename it in this
      patch.
      Signed-off-by: NYing Han <yinghan@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      [akpm@linux-foundation.org: s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      096a7cf4
    • J
      mm: memcg: count pte references from every member of the reclaimed hierarchy · c3ac9a8a
      Johannes Weiner 提交于
      The rmap walker checking page table references has historically ignored
      references from VMAs that were not part of the memcg that was being
      reclaimed during memcg hard limit reclaim.
      
      When transitioning global reclaim to memcg hierarchy reclaim, I missed
      that bit and now references from outside a memcg are ignored even during
      global reclaim.
      
      Reverting back to traditional behaviour - count all references during
      global reclaim and only mind references of the memcg being reclaimed
      during limit reclaim would be one option.
      
      However, the more generic idea is to ignore references exactly then when
      they are outside the hierarchy that is currently under reclaim; because
      only then will their reclamation be of any use to help the pressure
      situation.  It makes no sense to ignore references from a sibling memcg
      and then evict a page that will be immediately refaulted by that sibling
      which contributes to the same usage of the common ancestor under
      reclaim.
      
      The solution: make the rmap walker ignore references from VMAs that are
      not part of the hierarchy that is being reclaimed.
      
      Flat limit reclaim will stay the same, hierarchical limit reclaim will
      mind the references only to pages that the hierarchy owns.  Global
      reclaim, since it reclaims from all memcgs, will be fixed to regard all
      references.
      
      [akpm@linux-foundation.org: name the args in the declaration]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ac9a8a
    • M
      mm: vmscan: remove reclaim_mode_t · 23b9da55
      Mel Gorman 提交于
      There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
      and lumpy reclaim have been removed.  This patch gets rid of
      reclaim_mode_t as well and improves the documentation about what
      reclaim/compaction is and when it is triggered.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b9da55
    • M
      mm: vmscan: do not stall on writeback during memory compaction · 41ac1999
      Mel Gorman 提交于
      This patch stops reclaim/compaction entering sync reclaim as this was
      only intended for lumpy reclaim and an oversight.  Page migration has
      its own logic for stalling on writeback pages if necessary and memory
      compaction is already using it.
      
      Waiting on page writeback is bad for a number of reasons but the primary
      one is that waiting on writeback to a slow device like USB can take a
      considerable length of time.  Page reclaim instead uses
      wait_iff_congested() to throttle if too many dirty pages are being
      scanned.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ac1999
    • M
      mm: vmscan: remove lumpy reclaim · c53919ad
      Mel Gorman 提交于
      This series removes lumpy reclaim and some stalling logic that was
      unintentionally being used by memory compaction.  The end result is that
      stalling on dirty pages during page reclaim now depends on
      wait_iff_congested().
      
      Four kernels were compared
      
        3.3.0     vanilla
        3.4.0-rc2 vanilla
        3.4.0-rc2 lumpyremove-v2 is patch one from this series
        3.4.0-rc2 nosync-v2r3 is the full series
      
      Removing lumpy reclaim saves almost 900 bytes of text whereas the full
      series removes 1200 bytes.
      
           text     data      bss       dec     hex  filename
        67403754  1927944  2260992  10929311  a6c49f  vmlinux-3.4.0-rc2-vanilla
        6739479  1927944  2260992  10928415  a6c11f  vmlinux-3.4.0-rc2-lumpyremove-v2
        6739159  1927944  2260992  10928095  a6bfdf  vmlinux-3.4.0-rc2-nosync-v2
      
      There are behaviour changes in the series and so tests were run with
      monitoring of ftrace events.  This disrupts results so the performance
      results are distorted but the new behaviour should be clearer.
      
      fs-mark running in a threaded configuration showed little of interest as
      it did not push reclaim aggressively
      
        FS-Mark Multi Threaded
                                3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
        Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
        Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
        Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
        Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
        Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
        User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
        Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73
      
        MMTests Statistics: vmstat
        Page Ins                                       80532       82212       81420       79480
        Page Outs                                  111434984   111456240   111437376   111582628
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                           44881       27889       27453       34843
        Kswapd pages scanned                        25841428    25860774    25861233    25843212
        Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
        Direct pages reclaimed                         44881       27889       27453       34843
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                               37.783      23.375      23.031      29.188
        Percentage direct scans                           0%          0%          0%          0%
      
      ftrace showed that there was no stalling on writeback or pages submitted
      for IO from reclaim context.
      
      postmark was similar and while it was more interesting, it also did not
      push reclaim heavily.
      
        POSTMARK
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
        Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
        Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
        Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
        Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
        Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
        Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
        User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
        Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27
      
        MMTests Statistics: vmstat
        Page Ins                                    13710192    13729032    13727944    13760136
        Page Outs                                   43071140    42987228    42733684    42931624
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                               0           0           0           0
        Kswapd pages scanned                         9941613     9937443     9939085     9929154
        Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
        Direct pages reclaimed                             0           0           0           0
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                                0.000       0.000       0.000       0.000
      
      It looks like here that the full series regresses performance but as
      ftrace showed no usage of wait_iff_congested() or sync reclaim I am
      assuming it's a disruption due to monitoring.  Other data such as memory
      usage, page IO, swap IO all looked similar.
      
      Running a benchmark with a plain DD showed nothing very interesting.
      The full series stalled in wait_iff_congested() slightly less but stall
      times on vanilla kernels were marginal.
      
      Running a benchmark that hammered on file-backed mappings showed stalls
      due to congestion but not in sync writebacks
      
        MICRO
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
        User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
        Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91
      
        MMTests Statistics: vmstat
        Page Ins                                      108712      120708       97224      110344
        Page Outs                                  155514576   156017404   155813676   156193256
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                         2599253     1550480     2512822     2414760
        Kswapd pages scanned                        69742364    71150694    68839041    69692533
        Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
        Direct pages reclaimed                         53693       94750       61792       75205
        Kswapd efficiency                                49%         48%         50%         49%
        Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
        Direct efficiency                                 2%          6%          2%          3%
        Direct velocity                             1432.174     845.464    1379.807    1317.446
        Percentage direct scans                           3%          2%          3%          3%
        Page writes by reclaim                             0           0           0           0
        Page writes file                                   0           0           0           0
        Page writes anon                                   0           0           0           0
        Page reclaim immediate                             0           0           0        1218
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                  15360       16384       13312       16384
        Direct inode steals                                0           0           0           0
        Kswapd inode steals                             4340        4327        1630        4323
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 0          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               900        870        754        789
        Direct time   conditional waited               0ms        0ms        0ms       20ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited              2106       2308       2116       1915
        KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
        KSwapd full   congest     waited              1346       1530       1202       1278
        KSwapd number conditional waited             12922      16320      10943      14670
        KSwapd time   conditional waited               0ms        0ms        0ms        0ms
        KSwapd full   conditional waited                 0          0          0          0
      
      Reclaim statistics are not radically changed.  The stall times in kswapd
      are massive but it is clear that it is due to calls to congestion_wait()
      and that is almost certainly the call in balance_pgdat().  Otherwise
      stalls due to dirty pages are non-existant.
      
      I ran a benchmark that stressed high-order allocation.  This is very
      artifical load but was used in the past to evaluate lumpy reclaim and
      compaction.  Generally I look at allocation success rates and latency
      figures.
      
        STRESS-HIGHALLOC
                         3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
        Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
        while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
        User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
        Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44
      
        MMTests Statistics: vmstat
        Page Ins                                     4486020     2807256     2855944     2876244
        Page Outs                                    7261600     7973688     7975320     7986120
        Swap Ins                                       31694           0           0           0
        Swap Outs                                      98179           0           0           0
        Direct pages scanned                           53494       57731       34406      113015
        Kswapd pages scanned                         6271173     1287481     1278174     1219095
        Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
        Direct pages reclaimed                          1468       14564       16649       92456
        Kswapd efficiency                                32%         99%         98%         98%
        Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
        Direct efficiency                                 2%         25%         48%         81%
        Direct velocity                               46.047      50.092      29.672      97.306
        Percentage direct scans                           0%          4%          2%          8%
        Page writes by reclaim                       1616049           0           0           0
        Page writes file                             1517870           0           0           0
        Page writes anon                               98179           0           0           0
        Page reclaim immediate                        103778       27339        9796       17831
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                1096704      986112      980992      998400
        Direct inode steals                              223      215040      216736      247881
        Kswapd inode steals                           175331       61548       68444       63066
        Kswapd skipped wait                            21991           0           1           0
        THP fault alloc                                    1         135         125         134
        THP collapse alloc                               393         311         228         236
        THP splits                                        25          13           7           8
        THP fault fallback                                 0           0           0           0
        THP collapse fail                                  3           5           7           7
        Compaction stalls                                865        1270        1422        1518
        Compaction success                               370         401         353         383
        Compaction failures                              495         869        1069        1135
        Compaction pages moved                        870155     3828868     4036106     4423626
        Compaction move failure                        26429       23865       29742       27514
      
      Success rates are completely hosed for 3.4-rc2 which is almost certainly
      due to commit fe2c2a10 ("vmscan: reclaim at order 0 when compaction
      is enabled").  I expected this would happen for kswapd and impair
      allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did
      not anticipate this much a difference: 80% less scanning, 37% less
      reclaim by kswapd
      
      In comparison, reclaim/compaction is not aggressive and gives up easily
      which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would
      be much more aggressive about reclaim/compaction than THP allocations
      are.  The stress test above is allocating like neither THP or hugetlbfs
      but is much closer to THP.
      
      Mainline is now impaired in terms of high order allocation under heavy
      load although I do not know to what degree as I did not test with
      __GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
      resizing, THP allocation and high order atomic allocation failures from
      network devices.
      
      In terms of congestion throttling, I see the following for this test
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 3          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               957        512       1081       1075
        Direct time   conditional waited               0ms        0ms        0ms        0ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited                36          4          3          5
        KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
        KSwapd full   congest     waited                30          4          3          5
        KSwapd number conditional waited             88514        197        332        542
        KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
        KSwapd full   conditional waited                49          0          0          0
      
      The "conditional waited" times are the most interesting as this is
      directly impacted by the number of dirty pages encountered during scan.
      As lumpy reclaim is no longer scanning contiguous ranges, it is finding
      fewer dirty pages.  This brings wait times from about 5 seconds to 0.
      kswapd itself is still calling congestion_wait() so it'll still stall but
      it's a lot less.
      
      In terms of the type of IO we were doing, I see this
      
        FTrace Reclaim Statistics: mm_vmscan_writepage
        Direct writes anon  sync                         0          0          0          0
        Direct writes anon  async                        0          0          0          0
        Direct writes file  sync                         0          0          0          0
        Direct writes file  async                        0          0          0          0
        Direct writes mixed sync                         0          0          0          0
        Direct writes mixed async                        0          0          0          0
        KSwapd writes anon  sync                         0          0          0          0
        KSwapd writes anon  async                    91682          0          0          0
        KSwapd writes file  sync                         0          0          0          0
        KSwapd writes file  async                   822629          0          0          0
        KSwapd writes mixed sync                         0          0          0          0
        KSwapd writes mixed async                        0          0          0          0
      
      In 3.2, kswapd was doing a bunch of async writes of pages but
      reclaim/compaction was never reaching a point where it was doing sync
      IO.  This does not guarantee that reclaim/compaction was not calling
      wait_on_page_writeback() but I would consider it unlikely.  It indicates
      that merging patches 2 and 3 to stop reclaim/compaction calling
      wait_on_page_writeback() should be safe.
      
      This patch:
      
      Lumpy reclaim had a purpose but in the mind of some, it was to kick the
      system so hard it trashed.  For others the purpose was to complicate
      vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
      memory compaction needs to step up and replace it so this patch sends
      lumpy reclaim to the farm.
      
      The tracepoint format changes for isolating LRU pages with this patch
      applied.  Furthermore reclaim/compaction can no longer queue dirty pages
      in pageout() if the underlying BDI is congested.  Lumpy reclaim used
      this logic and reclaim/compaction was using it in error.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53919ad
    • R
      mm: remove swap token code · e709ffd6
      Rik van Riel 提交于
      The swap token code no longer fits in with the current VM model.  It
      does not play well with cgroups or the better NUMA placement code in
      development, since we have only one swap token globally.
      
      It also has the potential to mess with scalability of the system, by
      increasing the number of non-reclaimable pages on the active and
      inactive anon LRU lists.
      
      Last but not least, the swap token code has been broken for a year
      without complaints, as reported by Konstantin Khlebnikov.  This suggests
      we no longer have much use for it.
      
      The days of sub-1G memory systems with heavy use of swap are over.  If
      we ever need thrashing reducing code in the future, we will have to
      implement something that does scale.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NBob Picco <bpicco@meloft.net>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e709ffd6
  2. 26 4月, 2012 1 次提交
  3. 13 4月, 2012 1 次提交
  4. 25 3月, 2012 1 次提交
  5. 23 3月, 2012 1 次提交
  6. 22 3月, 2012 10 次提交
    • K
      mm: forbid lumpy-reclaim in shrink_active_list() · 1480de03
      Konstantin Khlebnikov 提交于
      Reset the reclaim mode in shrink_active_list() to RECLAIM_MODE_SINGLE |
      RECLAIM_MODE_ASYNC.  (sync/async sign is used only in shrink_page_list
      and does not affect shrink_active_list)
      
      Currenly shrink_active_list() sometimes works in lumpy-reclaim mode, if
      RECLAIM_MODE_LUMPYRECLAIM is left over from an earlier
      shrink_inactive_list().  Meanwhile, in age_active_anon()
      sc->reclaim_mode is totally zero.  So the current behavior is too
      complex and confusing, and this looks like bug.
      
      In general, shrink_active_list() populates the inactive list for the
      next shrink_inactive_list().  Lumpy shring_inactive_list() isolates
      pages around the chosen one from both the active and inactive lists.
      So, there is no reason for lumpy isolation in shrink_active_list().
      
      See also: https://lkml.org/lkml/2012/3/15/583Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Proposed-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1480de03
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
    • C
      mm/vmscan.c: fix spelling error · c7cfa37b
      Copot Alexandru 提交于
      s/noticable/noticeable/
      Signed-off-by: NCopot Alexandru <alex.mihai.c@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7cfa37b
    • H
      vmscan: handle isolated pages with lru lock released · d563c050
      Hillf Danton 提交于
      When shrinking inactive lru list, isolated pages are queued on locally
      private list, so the lock-hold time could be reduced if pages are counted
      without lock protection.
      
      To achieve that, firstly updating reclaim stat is delayed until the
      putback stage, after reacquiring the lru lock.
      
      Secondly, operations related to vm and zone stats are now proteced with
      preemption disabled as they are per-cpu operations.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d563c050
    • M
      mm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem · cc715d99
      Mel Gorman 提交于
      Stuart Foster reported on bugzilla that copying large amounts of data
      from NTFS caused an OOM kill on 32-bit X86 with 16G of memory.  Andrew
      Morton correctly identified that the problem was NTFS was using 512
      blocks meaning each page had 8 buffer_heads in low memory pinning it.
      
      In the past, direct reclaim used to scan highmem even if the allocating
      process did not specify __GFP_HIGHMEM but not any more.  kswapd no longer
      will reclaim from zones that are above the high watermark.  The intention
      in both cases was to minimise unnecessary reclaim.  The downside is on
      machines with large amounts of highmem that lowmem can be fully consumed
      by buffer_heads with nothing trying to free them.
      
      The following patch is based on a suggestion by Andrew Morton to extend
      the buffer_heads_over_limit case to force kswapd and direct reclaim to
      scan the highmem zone regardless of the allocation request or watermarks.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=42578
      
      [hughd@google.com: move buffer_heads_over_limit check up]
      [akpm@linux-foundation.org: buffer_heads_over_limit is unlikely]
      Reported-by: NStuart Foster <smf.linux@ntlworld.com>
      Tested-by: NStuart Foster <smf.linux@ntlworld.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc715d99
    • R
      vmscan: only defer compaction for failed order and higher · aff62249
      Rik van Riel 提交于
      Currently a failed order-9 (transparent hugepage) compaction can lead to
      memory compaction being temporarily disabled for a memory zone.  Even if
      we only need compaction for an order 2 allocation, eg.  for jumbo frames
      networking.
      
      The fix is relatively straightforward: keep track of the highest order at
      which compaction is succeeding, and only defer compaction for orders at
      which compaction is failing.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aff62249
    • R
      vmscan: kswapd carefully call compaction · 7be62de9
      Rik van Riel 提交于
      With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous
      free pages, even when it is woken for a higher order request.
      
      This could be bad for eg.  jumbo frame network allocations, which are done
      from interrupt context and cannot compact memory themselves.  Higher than
      before allocation failure rates in the network receive path have been
      observed in kernels with compaction enabled.
      
      Teach kswapd to defragment the memory zones in a node, but only if
      required and compaction is not deferred in a zone.
      
      [akpm@linux-foundation.org: reduce scope of zones_need_compaction]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7be62de9
    • R
      vmscan: reclaim at order 0 when compaction is enabled · fe2c2a10
      Rik van Riel 提交于
      When built with CONFIG_COMPACTION, kswapd should not try to free
      contiguous pages, because it is not trying hard enough to have a real
      chance at being successful, but still disrupts the LRU enough to break
      other things.
      
      Do not do higher order page isolation unless we really are in lumpy
      reclaim mode.
      
      Stop reclaiming pages once we have enough free pages that compaction can
      deal with things, and we hit the normal order 0 watermarks used by kswapd.
      
      Also remove a line of code that increments balanced right before exiting
      the function.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe2c2a10
    • H
      mm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone() · c38446cc
      Hillf Danton 提交于
      The value of nr_reclaimed is the number of pages reclaimed in the current
      round of the loop, whereas nr_to_reclaim should be compared with the
      number of pages reclaimed in all rounds.
      
      In each round of the loop, reclaimed pages are cut off from the reclaim
      goal, and the loop stops once the goal achieved.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c38446cc
    • H
      mm/vmscan.c: cleanup with s/reclaim_mode/isolate_mode/ · 61317289
      Hillf Danton 提交于
      With tons of reclaim_mode (defined as one field of struct scan_control)
      already in the file, it is clearer to rename the local reclaim_mode when
      setting up the isolation mode.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61317289