1. 01 11月, 2011 40 次提交
    • R
      vmscan: limit direct reclaim for higher order allocations · e0887c19
      Rik van Riel 提交于
      When suffering from memory fragmentation due to unfreeable pages, THP page
      faults will repeatedly try to compact memory.  Due to the unfreeable
      pages, compaction fails.
      
      Needless to say, at that point page reclaim also fails to create free
      contiguous 2MB areas.  However, that doesn't stop the current code from
      trying, over and over again, and freeing a minimum of 4MB (2UL <<
      sc->order pages) at every single invocation.
      
      This resulted in my 12GB system having 2-3GB free memory, a corresponding
      amount of used swap and very sluggish response times.
      
      This can be avoided by having the direct reclaim code not reclaim from
      zones that already have plenty of free memory available for compaction.
      
      If compaction still fails due to unmovable memory, doing additional
      reclaim will only hurt the system, not help.
      
      [jweiner@redhat.com: change comment to explain the order check]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0887c19
    • M
      vmscan: add barrier to prevent evictable page in unevictable list · 21ee9f39
      Minchan Kim 提交于
      When a race between putback_lru_page() and shmem_lock with lock=0 happens,
      progrom execution order is as follows, but clear_bit in processor #1 could
      be reordered right before spin_unlock of processor #1.  Then, the page
      would be stranded on the unevictable list.
      
      spin_lock
      SetPageLRU
      spin_unlock
                                      clear_bit(AS_UNEVICTABLE)
                                      spin_lock
                                      if PageLRU()
                                              if !test_bit(AS_UNEVICTABLE)
                                              	move evictable list
      smp_mb
      if !test_bit(AS_UNEVICTABLE)
              move evictable list
                                      spin_unlock
      
      But, pagevec_lookup() in scan_mapping_unevictable_pages() has
      rcu_read_[un]lock() so it could protect reordering before reaching
      test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
      But it's a unexpected side effect and we should solve this problem
      properly.
      
      This patch adds a barrier after mapping_clear_unevictable.
      
      I didn't meet this problem but just found during review.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21ee9f39
    • H
      mm/huge_memory.c: quiet sparse noise · 2f1da642
      H Hartley Sweeten 提交于
      Quiet the sparse noise:
      
      warning: symbol 'khugepaged_scan' was not declared. Should it be static?
      warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlock
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f1da642
    • H
      mm/mempolicy.c: quiet sparse noise · e754d79d
      H Hartley Sweeten 提交于
      Quiet the spares noise:
      
      warning: symbol 'default_policy' was not declared. Should it be static?
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Stephen Wilson <wilsons@start.ca>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e754d79d
    • H
      mm/thrash.c: quiet sparse noise · 22d5368a
      H Hartley Sweeten 提交于
      Quiet the following sparse noise:
      
      warning: symbol 'swap_token_memcg' was not declared. Should it be static?
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22d5368a
    • H
      mm/memblock.c: quiet sparse noise · 2d7d3eb2
      H Hartley Sweeten 提交于
      Quiet the following sparse noise in this file:
      
      warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers,com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Tomi Valkeinen <tomi.valkeinen@nokia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d7d3eb2
    • J
      mm: disable user interface to manually rescue unevictable pages · 264e56d8
      Johannes Weiner 提交于
      At one point, anonymous pages were supposed to go on the unevictable list
      when no swap space was configured, and the idea was to manually rescue
      those pages after adding swap and making them evictable again.  But
      nowadays, swap-backed pages on the anon LRU list are not scanned without
      available swap space anyway, so there is no point in moving them to a
      separate list anymore.
      
      The manual rescue could also be used in case pages were stranded on the
      unevictable list due to race conditions.  But the code has been around for
      a while now and newly discovered bugs should be properly reported and
      dealt with instead of relying on such a manual fixup.
      
      In addition to the lack of a usecase, the sysfs interface to rescue pages
      from a specific NUMA node has been broken since its introduction, so it's
      unlikely that anybody ever relied on that.
      
      This patch removes the functionality behind the sysctl and the
      node-interface and emits a one-time warning when somebody tries to access
      either of them.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reported-by: NKautuk Consul <consul.kautuk@gmail.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      264e56d8
    • K
      vmscan.c: fix invalid strict_strtoul() check in write_scan_unevictable_node() · 3f380998
      Kautuk Consul 提交于
      write_scan_unevictable_node() checks the value req returned by
      strict_strtoul() and returns 1 if req is 0.
      
      However, when strict_strtoul() returns 0, it means successful conversion
      of buf to unsigned long.
      
      Due to this, the function was not proceeding to scan the zones for
      unevictable pages even though we write a valid value to the
      scan_unevictable_pages sys file.
      
      Change this check slightly to check for invalid value in buf as well as 0
      value stored in res after successful conversion via strict_strtoul.  In
      both cases, we do not perform the scanning of this node's zones.
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f380998
    • L
      mm: fix kunmap_high() comment · 4e9dc5df
      Li Haifeng 提交于
      Signed-off-by: NLi Haifeng <omycle@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e9dc5df
    • K
      mm: compaction: make compact_zone_order() static · d43a87e6
      Kyungmin Park 提交于
      There's no compact_zone_order() user outside file scope, so make it static.
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d43a87e6
    • D
      HWPOISON: convert pr_debug()s to pr_info()s · dd73e85f
      Dean Nelson 提交于
      Commit fb46e735 ("HWPOISON: Convert pr_debugs to pr_info) authored
      by Andi Kleen converted a number of pr_debug()s to pr_info()s.
      
      About the same time additional code with pr_debug()s was added by two
      other commits 8c6c2ecb ("HWPOSION, hugetlb: recover from free hugepage
      error when !MF_COUNT_INCREASED") and d950b958 ("HWPOISON, hugetlb:
      soft offlining for hugepage").  And these pr_debug()s failed to get
      converted to pr_info()s.
      
      This patch converts them as well.  And does some minor related whitespace
      cleanup.
      Signed-off-by: NDean Nelson <dnelson@redhat.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd73e85f
    • T
      fs/buffer.c: add device information for error output in __find_get_block_slow() · 72a2ebd8
      Tao Ma 提交于
      On the ext4 mailing list[1], we got some report about errors in
      __find_get_block_slow(), but the information is very limited.
      
      If the device information is given, we can know the name of the sick
      volume.  Futhermore, we can get the corresponding status of that
      block(group, inode block etc) by analyzing the disk layout.
      
      [1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72a2ebd8
    • K
      mm/mmap.c: eliminate the ret variable from mm_take_all_locks() · 584cff54
      Kautuk Consul 提交于
      The ret variable is really not needed in mm_take_all_locks().
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      584cff54
    • M
      vmscan: fix shrinker callback bug in fs/super.c · 09f363c7
      Mikulas Patocka 提交于
      The callback must not return -1 when nr_to_scan is zero. Fix the bug in
      fs/super.c and add this requirement to the callback specification.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09f363c7
    • A
      mm-add-comment-explaining-task-state-setting-in-bdi_forker_thread-fix · 20c8c628
      Andrew Morton 提交于
      fiddle wording
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20c8c628
    • W
      ksm: fix the comment of try_to_unmap_one() · 99ef0315
      Wanlong Gao 提交于
      try_to_unmap_one() is called by try_to_unmap_ksm(), too.
      Signed-off-by: NWanlong Gao <gaowanlong@cn.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99ef0315
    • J
      mm/vmalloc.c: report more vmalloc failures · de7d2b56
      Joe Perches 提交于
      Some vmalloc failure paths do not report OOM conditions.
      
      Add warn_alloc_failed, which also does a dump_stack, to those failure
      paths.
      
      This allows more site specific vmalloc failure logging message printks to
      be removed.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de7d2b56
    • A
      kswapd: assign new_order and new_classzone_idx after wakeup in sleeping · f0dfcde0
      Alex,Shi 提交于
      There 2 places to read pgdat in kswapd.  One is return from a successful
      balance, another is waked up from kswapd sleeping.  The new_order and
      new_classzone_idx represent the balance input order and classzone_idx.
      
      But current new_order and new_classzone_idx are not assigned after
      kswapd_try_to_sleep(), that will cause a bug in the following scenario.
      
      1: after a successful balance, kswapd goes to sleep, and new_order = 0;
         new_classzone_idx = __MAX_NR_ZONES - 1;
      
      2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL
      
      3: in the balance_pgdat() running, a new balance wakeup happened with
         order = 5, and classzone_idx = ZONE_NORMAL
      
      4: the first wakeup(order = 3) finished successufly, return order = 3
         but, the new_order is still 0, so, this balancing will be treated as a
         failed balance.  And then the second tighter balancing will be missed.
      
      So, to avoid the above problem, the new_order and new_classzone_idx need
      to be assigned for later successful comparison.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0dfcde0
    • J
      mm/memblock.c: small function definition fixes · d1f0ece6
      Jonghwan Choi 提交于
      warning: function 'memblock_memory_can_coalesce'
      with external linkage has definition.
      Signed-off-by: NJonghwan Choi <jhbird.choi@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1f0ece6
    • A
      kswapd: avoid unnecessary rebalance after an unsuccessful balancing · d2ebd0f6
      Alex,Shi 提交于
      In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
      when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
      after a unsuccessful balancing if there is tighter reclaim request pending
      in the balancing.  But in the following scenario, kswapd do something that
      is not matched our expectation.  The patch fixes this issue.
      
      1, Read pgdat request A (classzone_idx, order = 3)
      2, balance_pgdat()
      3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
      4, balance_pgdat() returns but failed since returned order = 0
      5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
         While the expectation behavior of kswapd should try to sleep.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Reviewed-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ebd0f6
    • A
      debug-pagealloc: add support for highmem pages · 64212ec5
      Akinobu Mita 提交于
      This adds support for highmem pages poisoning and verification to the
      debug-pagealloc feature for no-architecture support.
      
      [akpm@linux-foundation.org: remove unneeded preempt_disable/enable]
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64212ec5
    • J
      mm: neaten warn_alloc_failed · 3ee9a4f0
      Joe Perches 提交于
      Add __attribute__((format (printf...) to the function to validate format
      and arguments.  Use vsprintf extension %pV to avoid any possible message
      interleaving.  Coalesce format string.  Convert printks/pr_warning to
      pr_warn.
      
      [akpm@linux-foundation.org: use the __printf() macro]
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ee9a4f0
    • S
      include/asm-generic/page.h: calculate virt_to_page and page_to_virt via predefined macro · 06d5e032
      Sonic Zhang 提交于
      On NOMMU architectures, if physical memory doesn't start from 0,
      ARCH_PFN_OFFSET is defined to generate page index in mem_map array.
      Because virtual address is equal to physical address, PAGE_OFFSET is
      always 0.  virt_to_page and page_to_virt should not index page by
      PAGE_OFFSET directly.
      Signed-off-by: NSonic Zhang <sonic.zhang@analog.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06d5e032
    • A
      thp: mremap support and TLB optimization · 37a1c49a
      Andrea Arcangeli 提交于
      This adds THP support to mremap (decreases the number of split_huge_page()
      calls).
      
      Here are also some benchmarks with a proggy like this:
      
      ===
      #define _GNU_SOURCE
      #include <sys/mman.h>
      #include <stdlib.h>
      #include <stdio.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (5UL*1024*1024*1024)
      
      int main()
      {
              static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	gettimeofday(&oldstamp, NULL);
      	p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	gettimeofday(&newstamp, NULL);
      	diffsec = newstamp.tv_sec - oldstamp.tv_sec;
      	diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
      	printf("usec %ld\n", diffsec);
      	if (p == MAP_FAILED || p4 != p3)
      	//if (p == MAP_FAILED)
      		perror("mremap"), exit(1);
      	if (memcmp(p4, p2, SIZE))
      		printf("mremap bug\n"), exit(1);
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      THP on
      
       Performance counter stats for './largepage13' (3 runs):
      
                69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
                   60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
               676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
                   29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
              1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
                    8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)
      
              7.314454164  seconds time elapsed   ( +-   0.023% )
      
      THP off
      
       Performance counter stats for './largepage13' (3 runs):
      
              1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
                 9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
              2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
                 3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
              6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
                    8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)
      
              9.693655243  seconds time elapsed   ( +-   0.069% )
      
      grep thp /proc/vmstat
      thp_fault_alloc 35849
      thp_fault_fallback 0
      thp_collapse_alloc 3
      thp_collapse_alloc_failed 0
      thp_split 0
      
      thp_split 0 confirms no thp split despite plenty of hugepages allocated.
      
      The measurement of only the mremap time (so excluding the 3 long
      memset and final long 10GB memory accessing memcmp):
      
      THP on
      
      usec 14824
      usec 14862
      usec 14859
      
      THP off
      
      usec 256416
      usec 255981
      usec 255847
      
      With an older kernel without the mremap optimizations (the below patch
      optimizes the non THP version too).
      
      THP on
      
      usec 392107
      usec 390237
      usec 404124
      
      THP off
      
      usec 444294
      usec 445237
      usec 445820
      
      I guess with a threaded program that sends more IPI on large SMP it'd
      create an even larger difference.
      
      All debug options are off except DEBUG_VM to avoid skewing the
      results.
      
      The only problem for native 2M mremap like it happens above both the
      source and destination address must be 2M aligned or the hugepmd can't be
      moved without a split but that is an hardware limitation.
      
      [akpm@linux-foundation.org: coding-style nitpicking]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37a1c49a
    • A
      mremap: avoid sending one IPI per page · 7b6efc2b
      Andrea Arcangeli 提交于
      This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
      flush_tlb_range() at the end of the loop, to avoid sending one IPI for
      each page.
      
      The mmu_notifier_invalidate_range_start/end section is enlarged
      accordingly but this is not going to fundamentally change things.  It was
      more by accident that the region under mremap was for the most part still
      available for secondary MMUs: the primary MMU was never allowed to
      reliably access that region for the duration of the mremap (modulo
      trapping SIGSEGV on the old address range which sounds unpractical and
      flakey).  If users wants secondary MMUs not to lose access to a large
      region under mremap they should reduce the mremap size accordingly in
      userland and run multiple calls.  Overall this will run faster so it's
      actually going to reduce the time the region is under mremap for the
      primary MMU which should provide a net benefit to apps.
      
      For KVM this is a noop because the guest physical memory is never
      mremapped, there's just no point it ever moving it while guest runs.  One
      target of this optimization is JVM GC (so unrelated to the mmu notifier
      logic).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b6efc2b
    • A
      mremap: check for overflow using deltas · ebed4846
      Andrea Arcangeli 提交于
      Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
      those are reasonable requirements but the check remains obscure and it
      looks more like an off by one error than an overflow check.  This I feel
      will improve readability.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebed4846
    • S
      memblock: add NO_BOOTMEM config symbol · 66616720
      Sam Ravnborg 提交于
      With the NO_BOOTMEM symbol added architectures may now use the following
      syntax to tell that they do not need bootmem:
      
      	select NO_BOOTMEM
      
      This is much more convinient than adding a new kconfig symbol which was
      otherwise required.
      
      Adding this symbol does not conflict with the architctures that already
      define their own symbol.
      Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66616720
    • S
      memblock: add memblock_start_of_DRAM() · 0a93ebef
      Sam Ravnborg 提交于
      SPARC32 require access to the start address.  Add a new helper
      memblock_start_of_DRAM() to give access to the address of the first
      memblock - which contains the lowest address.
      
      The awkward name was chosen to match the already present
      memblock_end_of_DRAM().
      Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a93ebef
    • M
      mm: avoid null pointer access in vm_struct via /proc/vmallocinfo · f5252e00
      Mitsuo Hayasaka 提交于
      The /proc/vmallocinfo shows information about vmalloc allocations in
      vmlist that is a linklist of vm_struct.  It, however, may access pages
      field of vm_struct where a page was not allocated.  This results in a null
      pointer access and leads to a kernel panic.
      
      Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
      allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
      some fields of vm_struct such as nr_pages and pages are set at
      __vmalloc_area_node().  In other words, it is added to vmlist before it is
      fully initialized.  At the same time, when the /proc/vmallocinfo is read,
      it accesses the pages field of vm_struct according to the nr_pages field
      at show_numa_info().  Thus, a null pointer access happens.
      
      The patch adds the newly allocated vm_struct to the vmlist *after* it is
      fully initialized.  So, it can avoid accessing the pages field with
      unallocated page when show_numa_info() is called.
      Signed-off-by: NMitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5252e00
    • A
      mm/debug-pagealloc.c: use memchr_inv · 8c5fb8ea
      Akinobu Mita 提交于
      Use newly introduced memchr_inv() for page verification.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c5fb8ea
    • A
      lib/string.c: introduce memchr_inv() · 79824820
      Akinobu Mita 提交于
      memchr_inv() is mainly used to check whether the whole buffer is filled
      with just a specified byte.
      
      The function name and prototype are stolen from logfs and the
      implementation is from SLUB.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Acked-by: NJoern Engel <joern@logfs.org>
      Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79824820
    • A
      mm/debug-pagealloc.c: use plain __ratelimit() instead of printk_ratelimit() · 77311139
      Akinobu Mita 提交于
      printk_ratelimit() should not be used, because it shares ratelimiting
      state with all other unrelated printk_ratelimit() callsites.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77311139
    • S
      vmscan: count pages into balanced for zone with good watermark · 16fb9512
      Shaohua Li 提交于
      It's possible a zone watermark is ok when entering the balance_pgdat()
      loop, while the zone is within the requested classzone_idx.  Count pages
      from this zone into `balanced'.  In this way, we can skip shrinking zones
      too much for high order allocation.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16fb9512
    • M
      mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes · 49ea7eb6
      Mel Gorman 提交于
      When direct reclaim encounters a dirty page, it gets recycled around the
      LRU for another cycle.  This patch marks the page PageReclaim similar to
      deactivate_page() so that the page gets reclaimed almost immediately after
      the page gets cleaned.  This is to avoid reclaiming clean pages that are
      younger than a dirty page encountered at the end of the LRU that might
      have been something like a use-once page.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49ea7eb6
    • M
      mm: vmscan: throttle reclaim if encountering too many dirty pages under writeback · 92df3a72
      Mel Gorman 提交于
      Workloads that are allocating frequently and writing files place a large
      number of dirty pages on the LRU.  With use-once logic, it is possible for
      them to reach the end of the LRU quickly requiring the reclaimer to scan
      more to find clean pages.  Ordinarily, processes that are dirtying memory
      will get throttled by dirty balancing but this is a global heuristic and
      does not take into account that LRUs are maintained on a per-zone basis.
      This can lead to a situation whereby reclaim is scanning heavily, skipping
      over a large number of pages under writeback and recycling them around the
      LRU consuming CPU.
      
      This patch checks how many of the number of pages isolated from the LRU
      were dirty and under writeback.  If a percentage of them under writeback,
      the process will be throttled if a backing device or the zone is
      congested.  Note that this applies whether it is anonymous or file-backed
      pages that are under writeback meaning that swapping is potentially
      throttled.  This is intentional due to the fact if the swap device is
      congested, scanning more pages and dispatching more IO is not going to
      help matters.
      
      The percentage that must be in writeback depends on the priority.  At
      default priority, all of them must be dirty.  At DEF_PRIORITY-1, 50% of
      them must be, DEF_PRIORITY-2, 25% etc.  i.e.  as pressure increases the
      greater the likelihood the process will get throttled to allow the flusher
      threads to make some progress.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92df3a72
    • M
      mm: vmscan: do not writeback filesystem pages in kswapd except in high priority · f84f6e2b
      Mel Gorman 提交于
      It is preferable that no dirty pages are dispatched for cleaning from the
      page reclaim path.  At normal priorities, this patch prevents kswapd
      writing pages.
      
      However, page reclaim does have a requirement that pages be freed in a
      particular zone.  If it is failing to make sufficient progress (reclaiming
      < SWAP_CLUSTER_MAX at any priority priority), the priority is raised to
      scan more pages.  A priority of DEF_PRIORITY - 3 is considered to be the
      point where kswapd is getting into trouble reclaiming pages.  If this
      priority is reached, kswapd will dispatch pages for writing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f84f6e2b
    • M
      ext4: warn if direct reclaim tries to writeback pages · 966dbde2
      Mel Gorman 提交于
      Direct reclaim should never writeback pages.  Warn if an attempt is made.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      966dbde2
    • M
      xfs: warn if direct reclaim tries to writeback pages · 94054fa3
      Mel Gorman 提交于
      Direct reclaim should never writeback pages.  For now, handle the
      situation and warn about it.  Ultimately, this will be a BUG_ON.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94054fa3
    • M
      mm: vmscan: remove dead code related to lumpy reclaim waiting on pages under writeback · a18bba06
      Mel Gorman 提交于
      Lumpy reclaim worked with two passes - the first which queued pages for IO
      and the second which waited on writeback.  As direct reclaim can no longer
      write pages there is some dead code.  This patch removes it but direct
      reclaim will continue to wait on pages under writeback while in
      synchronous reclaim mode.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a18bba06
    • M
      mm: vmscan: do not writeback filesystem pages in direct reclaim · ee72886d
      Mel Gorman 提交于
      Testing from the XFS folk revealed that there is still too much I/O from
      the end of the LRU in kswapd.  Previously it was considered acceptable by
      VM people for a small number of pages to be written back from reclaim with
      testing generally showing about 0.3% of pages reclaimed were written back
      (higher if memory was low).  That writing back a small number of pages is
      ok has been heavily disputed for quite some time and Dave Chinner
      explained it well;
      
      	It doesn't have to be a very high number to be a problem. IO
      	is orders of magnitude slower than the CPU time it takes to
      	flush a page, so the cost of making a bad flush decision is
      	very high. And single page writeback from the LRU is almost
      	always a bad flush decision.
      
      To complicate matters, filesystems respond very differently to requests
      from reclaim according to Christoph Hellwig;
      
      	xfs tries to write it back if the requester is kswapd
      	ext4 ignores the request if it's a delayed allocation
      	btrfs ignores the request
      
      As a result, each filesystem has different performance characteristics
      when under memory pressure and there are many pages being dirtied.  In
      some cases, the request is ignored entirely so the VM cannot depend on the
      IO being dispatched.
      
      The objective of this series is to reduce writing of filesystem-backed
      pages from reclaim, play nicely with writeback that is already in progress
      and throttle reclaim appropriately when writeback pages are encountered.
      The assumption is that the flushers will always write pages faster than if
      reclaim issues the IO.
      
      A secondary goal is to avoid the problem whereby direct reclaim splices
      two potentially deep call stacks together.
      
      There is a potential new problem as reclaim has less control over how long
      before a page in a particularly zone or container is cleaned and direct
      reclaimers depend on kswapd or flusher threads to do the necessary work.
      However, as filesystems sometimes ignore direct reclaim requests already,
      it is not expected to be a serious issue.
      
      Patch 1 disables writeback of filesystem pages from direct reclaim
      	entirely. Anonymous pages are still written.
      
      Patch 2 removes dead code in lumpy reclaim as it is no longer able
      	to synchronously write pages. This hurts lumpy reclaim but
      	there is an expectation that compaction is used for hugepage
      	allocations these days and lumpy reclaim's days are numbered.
      
      Patches 3-4 add warnings to XFS and ext4 if called from
      	direct reclaim. With patch 1, this "never happens" and is
      	intended to catch regressions in this logic in the future.
      
      Patch 5 disables writeback of filesystem pages from kswapd unless
      	the priority is raised to the point where kswapd is considered
      	to be in trouble.
      
      Patch 6 throttles reclaimers if too many dirty pages are being
      	encountered and the zones or backing devices are congested.
      
      Patch 7 invalidates dirty pages found at the end of the LRU so they
      	are reclaimed quickly after being written back rather than
      	waiting for a reclaimer to find them
      
      I consider this series to be orthogonal to the writeback work but it is
      worth noting that the writeback work affects the viability of patch 8 in
      particular.
      
      I tested this on ext4 and xfs using fs_mark, a simple writeback test based
      on dd and a micro benchmark that does a streaming write to a large mapping
      (exercises use-once LRU logic) followed by streaming writes to a mix of
      anonymous and file-backed mappings.  The command line for fs_mark when
      botted with 512M looked something like
      
      ./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
      
      The number of files was adjusted depending on the amount of available
      memory so that the files created was about 3xRAM.  For multiple threads,
      the -d switch is specified multiple times.
      
      The test machine is x86-64 with an older generation of AMD processor with
      4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
      was the best configuration of storage I had available.  Swap is on a
      separate disk.  Dirty ratio was tuned to 40% instead of the default of
      20%.
      
      Testing was run with and without monitors to both verify that the patches
      were operating as expected and that any performance gain was real and not
      due to interference from monitors.
      
      Here is a summary of results based on testing XFS.
      
      512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
      512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
      512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
      512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
      512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
      512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
      512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
      512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
      512M-xfs             Elapsed Time fsmark                    56.08     48.90
      512M-xfs             Elapsed Time simple-wb                112.22     98.13
      512M-xfs             Elapsed Time mmap-strm                219.15    196.67
      512M-xfs             Kswapd efficiency fsmark                 54%       56%
      512M-xfs             Kswapd efficiency simple-wb              54%       55%
      512M-xfs             Kswapd efficiency mmap-strm              45%       44%
      512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
      512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
      512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
      512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
      512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
      512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
      512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
      512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
      512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
      512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
      512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
      512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
      512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
      512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%
      
      Up until 512-4X, the FSmark improvements were statistically significant.
      For the 4X and 16X tests the results were within standard deviations but
      just barely.  The time to completion for all tests is improved which is an
      important result.  In general, kswapd efficiency is not affected by
      skipping dirty pages.
      
      1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
      1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
      1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
      1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
      1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
      1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
      1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
      1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
      1024M-xfs            Elapsed Time fsmark                    94.59     91.00
      1024M-xfs            Elapsed Time simple-wb                229.84    195.08
      1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
      1024M-xfs            Kswapd efficiency fsmark                 79%       71%
      1024M-xfs            Kswapd efficiency simple-wb              74%       74%
      1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
      1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
      1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
      1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
      1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
      1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
      1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
      1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
      1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
      1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
      1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
      1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
      1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
      1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
      1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%
      
      All FSMark tests up to 16X had statistically significant improvements.
      For the most part, tests are completing faster with the exception of the
      streaming writes to a mixture of anonymous and file-backed mappings which
      were slower in two cases
      
      In the cases where the mmap-strm tests were slower, there was more
      swapping due to dirty pages being skipped.  The number of additional pages
      swapped is almost identical to the fewer number of pages written from
      reclaim.  In other words, roughly the same number of pages were reclaimed
      but swapping was slower.  As the test is a bit unrealistic and stresses
      memory heavily, the small shift is acceptable.
      
      4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
      4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
      4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
      4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
      4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
      4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
      4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
      4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
      4608M-xfs            Elapsed Time fsmark                   555.96    532.34
      4608M-xfs            Elapsed Time simple-wb                659.72    571.85
      4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
      4608M-xfs            Kswapd efficiency fsmark                 89%       91%
      4608M-xfs            Kswapd efficiency simple-wb              88%       82%
      4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
      4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
      4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
      4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
      4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
      4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
      4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
      4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
      4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
      4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
      4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
      4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
      4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
      4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
      4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%
      
      Unlike the other tests, the fsmark results are not statistically
      significant but the min and max times are both improved and for the most
      part, tests completed faster.
      
      There are other indications that this is an improvement as well.  For
      example, in the vast majority of cases, there were fewer pages scanned by
      direct reclaim implying in many cases that stalls due to direct reclaim
      are reduced.  KSwapd is scanning more due to skipping dirty pages which is
      unfortunate but the CPU usage is still acceptable
      
      In an earlier set of tests, I used blktrace and in almost all cases
      throughput throughout the entire test was higher.  However, I ended up
      discarding those results as recording blktrace data was too heavy for my
      liking.
      
      On a laptop, I plugged in a USB stick and ran a similar tests of tests
      using it as backing storage.  A desktop environment was running and for
      the entire duration of the tests, firefox and gnome terminal were
      launching and exiting to vaguely simulate a user.
      
      1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
      1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
      1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
      1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
      1024M-xfs            Kswapd efficiency fsmark              84%       85%
      1024M-xfs            Kswapd efficiency simple-wb           92%       81%
      1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
      1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
      1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
      1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53
      
      The mmap-strm results were hurt because firefox launching had a tendency
      to push the test out of memory.  On the postive side, firefox launched
      marginally faster with the patches applied.  Time to completion for many
      tests was faster but more importantly - the "Avg wait" time as measured by
      iostat was far lower implying the system would be more responsive.  It was
      also the case that "Avg wait ms" on the root filesystem was lower.  I
      tested it manually and while the system felt slightly more responsive
      while copying data to a USB stick, it was marginal enough that it could be
      my imagination.
      
      This patch: do not writeback filesystem pages in direct reclaim.
      
      When kswapd is failing to keep zones above the min watermark, a process
      will enter direct reclaim in the same manner kswapd does.  If a dirty page
      is encountered during the scan, this page is written to backing storage
      using mapping->writepage.
      
      This causes two problems.  First, it can result in very deep call stacks,
      particularly if the target storage or filesystem are complex.  Some
      filesystems ignore write requests from direct reclaim as a result.  The
      second is that a single-page flush is inefficient in terms of IO.  While
      there is an expectation that the elevator will merge requests, this does
      not always happen.  Quoting Christoph Hellwig;
      
      	The elevator has a relatively small window it can operate on,
      	and can never fix up a bad large scale writeback pattern.
      
      This patch prevents direct reclaim writing back filesystem pages by
      checking if current is kswapd.  Anonymous pages are still written to swap
      as there is not the equivalent of a flusher thread for anonymous pages.
      If the dirty pages cannot be written back, they are placed back on the LRU
      lists.  There is now a direct dependency on dirty page balancing to
      prevent too many pages in the system being dirtied which would prevent
      reclaim making forward progress.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee72886d