1. 17 6月, 2009 40 次提交
    • M
      vmscan: do not unconditionally treat zones that fail zone_reclaim() as full · fa5e084e
      Mel Gorman 提交于
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.  The problem is that zone_reclaim() failing at all means the
      zone gets marked full.
      
      This can cause situations where a zone is usable, but is being skipped
      because it has been considered full.  Take a situation where a large tmpfs
      mount is occuping a large percentage of memory overall.  The pages do not
      get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
      and the zonelist cache considers them not worth trying in the future.
      
      This patch makes zone_reclaim() return more fine-grained information about
      what occured when zone_reclaim() failued.  The zone only gets marked full
      if it really is unreclaimable.  If it's a case that the scan did not occur
      or if enough pages were not reclaimed with the limited reclaim_mode, then
      the zone is simply skipped.
      
      There is a side-effect to this patch.  Currently, if zone_reclaim()
      successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
      ahead.  With this patch applied, zone watermarks are rechecked after
      zone_reclaim() does some work.
      
      This bug was introduced by commit 9276b1bc
      ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
      zonelist_cache was introduced.  It was not intended that zone_reclaim()
      aggressively consider the zone to be full when it failed as full direct
      reclaim can still be an option.  Due to the age of the bug, it should be
      considered a -stable candidate.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa5e084e
    • M
      vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim · 90afa5de
      Mel Gorman 提交于
      A bug was brought to my attention against a distro kernel but it affects
      mainline and I believe problems like this have been reported in various
      guises on the mailing lists although I don't have specific examples at the
      moment.
      
      The reported problem was that malloc() stalled for a long time (minutes in
      some cases) if a large tmpfs mount was occupying a large percentage of
      memory overall.  The pages did not get cleaned or reclaimed by
      zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
      are uselessly scanned frequencly making the CPU spin at near 100%.
      
      This patchset intends to address that bug and bring the behaviour of
      zone_reclaim() more in line with expectations which were noticed during
      investigation.  It is based on top of mmotm and takes advantage of
      Kosaki's work with respect to zone_reclaim().
      
      Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
      	scan should go ahead. The broken heuristic is what was causing the
      	malloc() stall as it uselessly scanned the LRU constantly. Currently,
      	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
      	could not deal with tmpfs pages at all. This fixes up the heuristic so
      	that an unnecessary scan is more likely to be correctly avoided.
      
      Patch 2 notes that zone_reclaim() returning a failure automatically means
      	the zone is marked full. This is not always true. It could have
      	failed because the GFP mask or zone_reclaim_mode were unsuitable.
      
      Patch 3 introduces a counter zreclaim_failed that will increment each
      	time the zone_reclaim scan-avoidance heuristics fail. If that
      	counter is rapidly increasing, then zone_reclaim_mode should be
      	set to 0 as a temporarily resolution and a bug reported because
      	the scan-avoidance heuristic is still broken.
      
      This patch:
      
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.
      
      There is a heuristic that determines if the scan is worthwhile but the
      problem is that the heuristic is not being properly applied and is
      basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
      proper detection can manfiest as high CPU usage as the LRU list is scanned
      uselessly.
      
      Historically, once enabled it was depending on NR_FILE_PAGES which may
      include swapcache pages that the reclaim_mode cannot deal with.  Patch
      vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
      Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
      pages that were not file-backed such as swapcache and made a calculation
      based on the inactive, active and mapped files.  This is far superior when
      zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
      reasonable starting figure.
      
      This patch alters how zone_reclaim() works out how many pages it might be
      able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
      the reclaim_mode it will either consider NR_FILE_PAGES as potential
      candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
      swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
      then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
      not set, then NR_FILE_MAPPED are not.
      
      [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
      [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90afa5de
    • W
      writeback: skip new or to-be-freed inodes · 84a89245
      Wu Fengguang 提交于
      1) I_FREEING tests should be coupled with I_CLEAR
      
      The two I_FREEING tests are racy because clear_inode() can set i_state to
      I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
      
      2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
         races with generic_forget_inode()
      
      generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
      generic_sync_sb_inodes() shall not try to step in and create possible races:
      
        generic_forget_inode
          inode->i_state |= I_WILL_FREE;
          spin_unlock(&inode_lock);
                                             generic_sync_sb_inodes()
                                               spin_lock(&inode_lock);
                                               __iget(inode);
                                               __writeback_single_inode
                                                 // see non zero i_count
       may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
                                               spin_unlock(&inode_lock);
       may call generic_forget_inode again ==> iput(inode);
      
      The above race and warning didn't turn up because writeback_inodes() holds
      the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
      early.  But we are not sure the UBIFS calls and future callers will
      guarantee that.  So skip I_WILL_FREE inodes for the sake of safety.
      
      Cc: Eric Sandeen <sandeen@sandeen.net>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Cc: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84a89245
    • D
      oom: only oom kill exiting tasks with attached memory · 81236810
      David Rientjes 提交于
      When a task is chosen for oom kill and is found to be PF_EXITING,
      __oom_kill_task() is called to elevate the task's timeslice and give it
      access to memory reserves so that it may quickly exit.
      
      This privilege is unnecessary, however, if the task has already detached
      its mm.  Although its possible for the mm to become detached later since
      task_lock() is not held, __oom_kill_task() will simply be a no-op in such
      circumstances.
      
      Subsequently, it is no longer necessary to warn about killing mm-less
      tasks since it is a no-op.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81236810
    • D
      vmscan: handle may_swap more strictly · 9198e96c
      Daisuke Nishimura 提交于
      Commit 2e2e4259 ("vmscan,memcg:
      reintroduce sc->may_swap) add may_swap flag and handle it at
      get_scan_ratio().
      
      But the result of get_scan_ratio() is ignored when priority == 0, so anon
      lru is scanned even if may_swap == 0 or nr_swap_pages == 0.  IMHO, this is
      not an expected behavior.
      
      As for memcg especially, because of this behavior many and many pages are
      swapped-out just in vain when oom is invoked by mem+swap limit.
      
      This patch is for handling may_swap flag more strictly.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9198e96c
    • W
      vmscan: merge duplicate code in shrink_active_list() · 3eb4140f
      Wu Fengguang 提交于
      The "move pages to active list" and "move pages to inactive list" code
      blocks are mostly identical and can be served by a function.
      
      Thanks to Andrew Morton for pointing this out.
      
      Note that buffer_heads_over_limit check will also be carried out for
      re-activated pages, which is slightly different from pre-2.6.28 kernels.
      Also, Rik's "vmscan: evict use-once pages first" patch could totally stop
      scans of active file list when memory pressure is low.  So the net effect
      could be, the number of buffer heads is now more likely to grow large.
      
      However that's fine according to Johannes' comments:
      
        I don't think that this could be harmful.  We just preserve the buffer
        mappings of what we consider the working set and with low memory
        pressure, as you say, this set is not big.
      
        As to stripping of reactivated pages: the only pages we re-activate
        for now are those VM_EXEC mapped ones.  Since we don't expect IO from
        or to these pages, removing the buffer mappings in case they grow too
        large should be okay, I guess.
      
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eb4140f
    • W
      vmscan: make mapped executable pages the first class citizen · 8cab4754
      Wu Fengguang 提交于
      Protect referenced PROT_EXEC mapped pages from being deactivated.
      
      PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
      currently running executables and their linked libraries, they shall really be
      cached aggressively to provide good user experiences.
      
      Thanks to Johannes Weiner for the advice to reuse the VMA walk in
      page_referenced() to get the PROT_EXEC bit.
      
      [more details]
      
      ( The consequences of this patch will have to be discussed together with
        Rik van Riel's recent patch "vmscan: evict use-once pages first". )
      
      ( Some of the good points and insights are taken into this changelog.
        Thanks to all the involved people for the great LKML discussions. )
      
      the problem
      ===========
      
      For a typical desktop, the most precious working set is composed of
      *actively accessed*
      	(1) memory mapped executables
      	(2) and their anonymous pages
      	(3) and other files
      	(4) and the dcache/icache/.. slabs
      while the least important data are
      	(5) infrequently used or use-once files
      
      For a typical desktop, one major problem is busty and large amount of (5)
      use-once files flushing out the working set.
      
      Inside the working set, (4) dcache/icache have already been too sticky ;-)
      So we only have to care (2) anonymous and (1)(3) file pages.
      
      anonymous pages
      ===============
      
      Anonymous pages are effectively immune to the streaming IO attack, because we
      now have separate file/anon LRU lists. When the use-once files crowd into the
      file LRU, the list's "quality" is significantly lowered. Therefore the scan
      balance policy in get_scan_ratio() will choose to scan the (low quality) file
      LRU much more frequently than the anon LRU.
      
      file pages
      ==========
      
      Rik proposed to *not* scan the active file LRU when the inactive list grows
      larger than active list. This guarantees that when there are use-once streaming
      IO, and the working set is not too large(so that active_size < inactive_size),
      the active file LRU will *not* be scanned at all. So the not-too-large working
      set can be well protected.
      
      But there are also situations where the file working set is a bit large so that
      (active_size >= inactive_size), or the streaming IOs are not purely use-once.
      In these cases, the active list will be scanned slowly. Because the current
      shrink_active_list() policy is to deactivate active pages regardless of their
      referenced bits. The deactivated pages become susceptible to the streaming IO
      attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
      the deactivated pages don't have enough time to get re-referenced. Because a
      user tend to switch between windows in intervals from seconds to minutes.
      
      This patch holds mapped executable pages in the active list as long as they
      are referenced during each full scan of the active list.  Because the active
      list is normally scanned much slower, they get longer grace time (eg. 100s)
      for further references, which better matches the pace of user operations.
      
      Therefore this patch greatly prolongs the in-cache time of executable code,
      when there are moderate memory pressures.
      
      	before patch: guaranteed to be cached if reference intervals < I
      	after  patch: guaranteed to be cached if reference intervals < I+A
      		      (except when randomly reclaimed by the lumpy reclaim)
      where
      	A = time to fully scan the   active file LRU
      	I = time to fully scan the inactive file LRU
      
      Note that normally A >> I.
      
      side effects
      ============
      
      This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
      but in a much smaller and well targeted scope.
      
      One may worry about some one to abuse the PROT_EXEC heuristic.  But as
      Andrew Morton stated, there are other tricks to getting that sort of boost.
      
      Another concern is the PROT_EXEC mapped pages growing large in rare cases,
      and therefore hurting reclaim efficiency. But a sane application targeted for
      large audience will never use PROT_EXEC for data mappings. If some home made
      application tries to abuse that bit, it shall be aware of the consequences.
      If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
      
      benchmarks
      ==========
      
      1) memory tight desktop
      
      1.1) brief summary
      
      - clock time and major faults are reduced by 50%;
      - pswpin numbers are reduced to ~1/3.
      
      That means X desktop responsiveness is doubled under high memory/swap pressure.
      
      1.2) test scenario
      
      - nfsroot gnome desktop with 512M physical memory
      - run some programs, and switch between the existing windows
        after starting each new program.
      
      1.3) progress timing (seconds)
      
        before       after    programs
          0.02        0.02    N xeyes
          0.75        0.76    N firefox
          2.02        1.88    N nautilus
          3.36        3.17    N nautilus --browser
          5.26        4.89    N gthumb
          7.12        6.47    N gedit
          9.22        8.16    N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
         13.58       12.55    N xterm
         15.87       14.57    N mlterm
         18.63       17.06    N gnome-terminal
         21.16       18.90    N urxvt
         26.24       23.48    N gnome-system-monitor
         28.72       26.52    N gnome-help
         32.15       29.65    N gnome-dictionary
         39.66       36.12    N /usr/games/sol
         43.16       39.27    N /usr/games/gnometris
         48.65       42.56    N /usr/games/gnect
         53.31       47.03    N /usr/games/gtali
         58.60       52.05    N /usr/games/iagno
         65.77       55.42    N /usr/games/gnotravex
         70.76       61.47    N /usr/games/mahjongg
         76.15       67.11    N /usr/games/gnome-sudoku
         86.32       75.15    N /usr/games/glines
         92.21       79.70    N /usr/games/glchess
        103.79       88.48    N /usr/games/gnomine
        113.84       96.51    N /usr/games/gnotski
        124.40      102.19    N /usr/games/gnibbles
        137.41      114.93    N /usr/games/gnobots2
        155.53      125.02    N /usr/games/blackjack
        179.85      135.11    N /usr/games/same-gnome
        224.49      154.50    N /usr/bin/gnome-window-properties
        248.44      162.09    N /usr/bin/gnome-default-applications-properties
        282.62      173.29    N /usr/bin/gnome-at-properties
        323.72      188.21    N /usr/bin/gnome-typing-monitor
        363.99      199.93    N /usr/bin/gnome-at-visual
        394.21      206.95    N /usr/bin/gnome-sound-properties
        435.14      224.49    N /usr/bin/gnome-at-mobility
        463.05      234.11    N /usr/bin/gnome-keybinding-properties
        503.75      248.59    N /usr/bin/gnome-about-me
        554.00      276.27    N /usr/bin/gnome-display-properties
        615.48      304.39    N /usr/bin/gnome-network-preferences
        693.03      342.01    N /usr/bin/gnome-mouse-properties
        759.90      388.58    N /usr/bin/gnome-appearance-properties
        937.90      508.47    N /usr/bin/gnome-control-center
       1109.75      587.57    N /usr/bin/gnome-keyboard-properties
       1399.05      758.16    N : oocalc
       1524.64      830.03    N : oodraw
       1684.31      900.03    N : ooimpress
       1874.04      993.91    N : oomath
       2115.12     1081.89    N : ooweb
       2369.02     1161.99    N : oowriter
      
      Note that the last ": oo*" commands are actually commented out.
      
      1.4) vmstat numbers (some relevant ones are marked with *)
      
                                  before    after
       nr_free_pages              1293      3898
       nr_inactive_anon           59956     53460
       nr_active_anon             26815     30026
       nr_inactive_file           2657      3218
       nr_active_file             2019      2806
       nr_unevictable             4         4
       nr_mlock                   4         4
       nr_anon_pages              26706     27859
      *nr_mapped                  3542      4469
       nr_file_pages              72232     67681
       nr_dirty                   1         0
       nr_writeback               123       19
       nr_slab_reclaimable        3375      3534
       nr_slab_unreclaimable      11405     10665
       nr_page_table_pages        8106      7864
       nr_unstable                0         0
       nr_bounce                  0         0
      *nr_vmscan_write            394776    230839
       nr_writeback_temp          0         0
       numa_hit                   6843353   3318676
       numa_miss                  0         0
       numa_foreign               0         0
       numa_interleave            1719      1719
       numa_local                 6843353   3318676
       numa_other                 0         0
      *pgpgin                     5954683   2057175
      *pgpgout                    1578276   922744
      *pswpin                     1486615   512238
      *pswpout                    394568    230685
       pgalloc_dma                277432    56602
       pgalloc_dma32              6769477   3310348
       pgalloc_normal             0         0
       pgalloc_movable            0         0
       pgfree                     7048396   3371118
       pgactivate                 2036343   1471492
       pgdeactivate               2189691   1612829
       pgfault                    3702176   3100702
      *pgmajfault                 452116    201343
       pgrefill_dma               12185     7127
       pgrefill_dma32             334384    653703
       pgrefill_normal            0         0
       pgrefill_movable           0         0
       pgsteal_dma                74214     22179
       pgsteal_dma32              3334164   1638029
       pgsteal_normal             0         0
       pgsteal_movable            0         0
       pgscan_kswapd_dma          1081421   1216199
       pgscan_kswapd_dma32        58979118  46002810
       pgscan_kswapd_normal       0         0
       pgscan_kswapd_movable      0         0
       pgscan_direct_dma          2015438   1086109
       pgscan_direct_dma32        55787823  36101597
       pgscan_direct_normal       0         0
       pgscan_direct_movable      0         0
       pginodesteal               3461      7281
       slabs_scanned              564864    527616
       kswapd_steal               2889797   1448082
       kswapd_inodesteal          14827     14835
       pageoutrun                 43459     21562
       allocstall                 9653      4032
       pgrotated                  384216    228631
      
      1.5) free numbers at the end of the tests
      
      before patch:
                                   total       used       free     shared    buffers     cached
                      Mem:           474        467          7          0          0        236
                      -/+ buffers/cache:        230        243
                      Swap:         1023        418        605
      
      after patch:
                                   total       used       free     shared    buffers     cached
                      Mem:           474        457         16          0          0        236
                      -/+ buffers/cache:        221        253
                      Swap:         1023        404        619
      
      2) memory flushing in a file server
      
      2.1) brief summary
      
      The number of major faults from 50 to 3 during 10% cache hot reads.
      
      That means this patch successfully stops major faults when the active file
      list is slowly scanned when there are partially cache hot streaming IO.
      
      2.2) test scenario
      
      Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
      pages will be activated:
      
              for i in `seq 0 100 10000000`; do echo $i 110;  done > pattern-hot-10
              iotrace.rb --load pattern-hot-10 --play /b/sparse
      	vmmon  nr_mapped nr_active_file nr_inactive_file   pgmajfault pgdeactivate pgfree
      
      and monitor /proc/vmstat during the time. The test box has 2G memory.
      
      I carried out tests on fresh booted console as well as X desktop, and
      fetched the vmstat numbers on
      
      (1) begin:     shortly after the big read IO starts;
      (2) end:       just before the big read IO stops;
      (3) restore:   the big read IO stops and the zsh working set restored
      (4) restore X: after IO, switch back and forth between the urxvt and firefox
                     windows to restore their working set.
      
      2.3) console mode results
      
              nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
      
      2.6.29 VM_EXEC protection ON:
      begin:       2481             2237             8694              630                0           574299
      end:          275           231976           233914              633           776271         20933042
      restore:      370           232154           234524              691           777183         20958453
      
      2.6.29 VM_EXEC protection ON (second run):
      begin:       2434             2237             8493              629                0           574195
      end:          284           231970           233536              632           771918         20896129
      restore:      399           232218           234789              690           774526         20957909
      
      2.6.30-rc4-mm VM_EXEC protection OFF:
      begin:       2479             2344             9659              210                0           579643
      end:          284           232010           234142              260           772776         20917184
      restore:      379           232159           234371              301           774888         20967849
      
      The above console numbers show that
      
      - The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
        I'd attribute that improvement to the mmap readahead improvements :-)
      
      - The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
        That's a huge improvement - which means with the VM_EXEC protection logic,
        active mmap pages is pretty safe even under partially cache hot streaming IO.
      
      - when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
        under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
        That roughly means the active mmap pages get 20.8 more chances to get
        re-referenced to stay in memory.
      
      - The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
        dropped pages are mostly inactive ones. The patch has almost no impact in
        this aspect, that means it won't unnecessarily increase memory pressure.
        (In contrast, your 20% mmap protection ratio will keep them all, and
        therefore eliminate the extra 41 major faults to restore working set
        of zsh etc.)
      
      The iotrace.rb read throughput is
      	151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
      which means the inactive list is rotated at the speed of 250MB/s,
      so a full scan of which takes about 3.5 seconds, while a full scan
      of active file list takes about 77 seconds.
      
      2.4) X mode results
      
      We can reach roughly the same conclusions for X desktop:
      
              nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
      
      2.6.30-rc4-mm VM_EXEC protection ON:
      begin:       9740             8920            64075              561                0           678360
      end:          768           218254           220029              565           798953         21057006
      restore:      857           218543           220987              606           799462         21075710
      restore X:   2414           218560           225344              797           799462         21080795
      
      2.6.30-rc4-mm VM_EXEC protection OFF:
      begin:       9368             5035            26389              554                0           633391
      end:          770           218449           221230              661           646472         17832500
      restore:     1113           218466           220978              710           649881         17905235
      restore X:   2687           218650           225484              947           802700         21083584
      
      - the absolute nr_mapped drops considerably (to 1/13 of the original size)
        during the streaming IO.
      - the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
        during the whole process.
      
      Cc: Elladan <elladan@eskimo.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cab4754
    • W
      vmscan: report vm_flags in page_referenced() · 6fe6b7e3
      Wu Fengguang 提交于
      Collect vma->vm_flags of the VMAs that actually referenced the page.
      
      This is preparing for more informed reclaim heuristics, eg.  to protect
      executable file pages more aggressively.  For now only the VM_EXEC bit
      will be used by the caller.
      
      Thanks to Johannes, Peter and Minchan for all the good tips.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fe6b7e3
    • M
      mm: add a gfp-translate script to help understand page allocation failure reports · 608e8e66
      Mel Gorman 提交于
      The page allocation failure messages include a line that looks like
      
      page allocation failure. order:1, mode:0x4020
      
      The mode is easy to translate but irritating for the lazy and a bit error
      prone.  This patch adds a very simple helper script gfp-translate for the
      mode: portion of the page allocation failure messages.  An example usage
      looks like
      
        mel@machina:~/linux-2.6 $ scripts/gfp-translate 0x4020
        Source: /home/mel/linux-2.6
        Parsing: 0x4020
        #define __GFP_HIGH	(0x20)	/* Should access emergency pools? */
        #define __GFP_COMP	(0x4000) /* Add compound page metadata */
      
      The script is not a work of art but it has come in handy for me a few
      times so I thought I would share.
      
      [akpm@linux-foundation.org: clarify an error message]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      608e8e66
    • S
      mm cleanup: shmem_file_setup: 'char *' -> 'const char *' for name argument · 168f5ac6
      Sergei Trofimovich 提交于
      As function shmem_file_setup does not modify/allocate/free/pass given
      filename - mark it as const.
      Signed-off-by: NSergei Trofimovich <slyfox@inbox.ru>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      168f5ac6
    • M
      mm: remove file argument from swap_readpage() · aca8bf32
      Minchan Kim 提交于
      The file argument resulted from address_space's readpage long time ago.
      
      We don't use it any more.  Let's remove unnecessary argement.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aca8bf32
    • M
      mm: remove annotation of gfp_mask in add_to_swap · 8192da6a
      Minchan Kim 提交于
      Hugh removed add_to_swap's gfp_mask argument.  (mm: remove gfp_mask from
      add_to_swap) So we have to remove annotation of gfp_mask of the function.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8192da6a
    • Y
      page-allocator: clear N_HIGH_MEMORY map before we set it again · 73d60b7f
      Yinghai Lu 提交于
      SRAT tables may contains nodes of very small size.  The arch code may
      decide to not activate such a node.  However, currently the early boot
      code sets N_HIGH_MEMORY for such nodes.  These nodes therefore seem to be
      active although these nodes have no present pages.
      
      For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too
      Signed-off-by: NYinghai Lu <Yinghai@kernel.org>
      Tested-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73d60b7f
    • M
      mm: remove __invalidate_mapping_pages variant · 28697355
      Mike Waychison 提交于
      Remove __invalidate_mapping_pages atomic variant now that its sole caller
      can sleep (fixed in eccb95ce ("vfs: fix
      lock inversion in drop_pagecache_sb()")).
      
      This fixes softlockups that can occur while in the drop_caches path.
      Signed-off-by: NMike Waychison <mikew@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28697355
    • D
      oom: invoke oom killer for __GFP_NOFAIL · 82553a93
      David Rientjes 提交于
      The oom killer must be invoked regardless of the order if the allocation
      is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to free
      some memory.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82553a93
    • D
      oom: avoid unnecessary mm locking and scanning for OOM_DISABLE · 4d8b9135
      David Rientjes 提交于
      This moves the check for OOM_DISABLE to the badness heuristic so it is
      only necessary to hold task_lock() once.  If the mm is OOM_DISABLE, the
      score is 0, which is also correctly exported via /proc/pid/oom_score.
      This requires that tasks with badness scores of 0 are prohibited from
      being oom killed, which makes sense since they would not allow for future
      memory freeing anyway.
      
      Since the oom_adj value is a characteristic of an mm and not a task, it is
      no longer necessary to check the oom_adj value for threads sharing the
      same memory (except when simply issuing SIGKILLs for threads in other
      thread groups).
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d8b9135
    • D
      oom: move oom_adj value from task_struct to mm_struct · 2ff05b2b
      David Rientjes 提交于
      The per-task oom_adj value is a characteristic of its mm more than the
      task itself since it's not possible to oom kill any thread that shares the
      mm.  If a task were to be killed while attached to an mm that could not be
      freed because another thread were set to OOM_DISABLE, it would have
      needlessly been terminated since there is no potential for future memory
      freeing.
      
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct mm_struct.  This requires task_lock() on a
      task to check its oom_adj value to protect against exec, but it's already
      necessary to take the lock when dereferencing the mm to find the total VM
      size for the badness heuristic.
      
      This fixes a livelock if the oom killer chooses a task and another thread
      sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
      because oom_kill_task() repeatedly returns 1 and refuses to kill the
      chosen task while select_bad_process() will repeatedly choose the same
      task during the next retry.
      
      Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
      oom_kill_task() to check for threads sharing the same memory will be
      removed in the next patch in this series where it will no longer be
      necessary.
      
      Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
      these threads are immune from oom killing already.  They simply report an
      oom_adj value of OOM_DISABLE.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff05b2b
    • K
      mm: reuse unused swap entry if necessary · c9e44410
      KAMEZAWA Hiroyuki 提交于
      Presently we can know a swap entry is just used as SwapCache via swap_map,
      without looking up swap cache.
      
      Then, we have a chance to reuse swap-cache-only swap entries in
      get_swap_pages().
      
      This patch tries to free swap-cache-only swap entries if swap is not
      enough.
      
      Note: We hit following path when swap_cluster code cannot find a free
      cluster.  Then, vm_swap_full() is not only condition to allow the kernel
      to reclaim unused swap.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9e44410
    • K
      mm: modify swap_map and add SWAP_HAS_CACHE flag · 355cfa73
      KAMEZAWA Hiroyuki 提交于
      This is a part of the patches for fixing memcg's swap accountinf leak.
      But, IMHO, not a bad patch even if no memcg.
      
      There are 2 kinds of references to swap.
       - reference from swap entry
       - reference from swap cache
      
      Then,
      
       - If there is swap cache && swap's refcnt is 1, there is only swap cache.
        (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL
      
      This counting logic have worked well for a long time.  But considering
      that we cannot know there is a _real_ reference or not by swap_map[],
      current usage of counter is not very good.
      
      This patch adds a flag SWAP_HAS_CACHE and recored information that a swap
      entry has a cache or not.  This will remove -1 magic used in swapfile.c
      and be a help to avoid unnecessary find_get_page().
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355cfa73
    • K
      mm: add swap cache interface for swap reference · cb4b86ba
      KAMEZAWA Hiroyuki 提交于
      In a following patch, the usage of swap cache is recorded into swap_map.
      This patch is for necessary interface changes to do that.
      
      2 interfaces:
      
        - swapcache_prepare()
        - swapcache_free()
      
      are added for allocating/freeing refcnt from swap-cache to existing swap
      entries.  But implementation itself is not changed under this patch.  At
      adding swapcache_free(), memcg's hook code is moved under
      swapcache_free().  This is better than using scattered hooks.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb4b86ba
    • K
      mm: remove CONFIG_UNEVICTABLE_LRU config option · 68377659
      KOSAKI Motohiro 提交于
      Currently, nobody wants to turn UNEVICTABLE_LRU off.  Thus this
      configurability is unnecessary.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68377659
    • M
      page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens · bce7394a
      Minchan Kim 提交于
      Solve two problems.
      
      Whenever memory hotplug sucessfully happens, zone->present_pages
      have to be changed.
      
      1) Now memory hotplug calls setup_per_zone_wmark_min only when
         online_pages called, not offline_pages.
      
         It breaks balance.
      
      2) If zone->present_pages is changed, we also have to change
         zone->inactive_ratio.  That's because inactive_ratio depends on
         zone->present_pages.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce7394a
    • M
      page-allocator: add inactive ratio calculation function of each zone · 96cb4df5
      Minchan Kim 提交于
      Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s
      loop into a a separate function, calculate_zone_inactive_ratio().  This
      function will be used in a later patch
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96cb4df5
    • M
      page-allocator: clean up functions related to pages_min · bc75d33f
      Minchan Kim 提交于
      Change the names of two functions. It doesn't affect behavior.
      
      Presently, setup_per_zone_pages_min() changes low, high of zone as well as
      min.  So a better name is setup_per_zone_wmarks().  That's because Mel
      changed zone->pages_[hig/low/min] to zone->watermark array in "page
      allocator: replace the watermark-related union in struct zone with a
      watermark[] array".
      
       * setup_per_zone_pages_min => setup_per_zone_wmarks
      
      Of course, we have to change init_per_zone_pages_min, too.  There are not
      pages_min any more.
      
       * init_per_zone_pages_min => init_per_zone_wmark_min
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc75d33f
    • C
      page-allocator: use integer fields lookup for gfp_zone and check for errors in... · b70d94ee
      Christoph Lameter 提交于
      page-allocator: use integer fields lookup for gfp_zone and check for errors in flags passed to the page allocator
      
      This simplifies the code in gfp_zone() and also keeps the ability of the
      compiler to use constant folding to get rid of gfp_zone processing.
      
      The lookup of the zone is done using a bitfield stored in an integer.  So
      the code in gfp_zone is a simple extraction of bits from a constant
      bitfield.  The compiler is generating a load of a constant into a register
      and then performs a shift and mask operation to get the zone from a gfp_t.
       No cachelines are touched and no branches have to be predicted by the
      compiler.
      
      We are doing some macro tricks here to convince the compiler to always do
      the constant folding if possible.
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70d94ee
    • M
      mm: check the argument of kunmap on architectures without highmem · 31c91132
      Matthew Wilcox 提交于
      If you're using a non-highmem architecture, passing an argument with the
      wrong type to kunmap() doesn't give you a warning because the ifdef
      doesn't check the type.
      
      Using a static inline function solves the problem nicely.
      Reported-by: NDavid Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31c91132
    • M
      vmscan: prevent shrinking of active anon lru list in case of no swap space V3 · 69c85481
      MinChan Kim 提交于
      shrink_zone() can deactivate active anon pages even if we don't have a
      swap device.  Many embedded products don't have a swap device.  So the
      deactivation of anon pages is unnecessary.
      
      This patch prevents unnecessary deactivation of anon lru pages.  But, it
      don't prevent aging of anon pages to swap out.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69c85481
    • B
      migration: only migrate_prep() once per move_pages() · 35282a2d
      Brice Goglin 提交于
      migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz).
      Commit 3140a227 improved move_pages()
      throughput by breaking it into chunks, but it also made migrate_prep() be
      called once per chunk (every 128pages or so) instead of once per
      move_pages().
      
      This patch reverts to calling migrate_prep() only once per chunk as we did
      before 2.6.29.  It is also a followup to commit
      0aedadf9 ("mm: move migrate_prep out from
      under mmap_sem").
      
      This improves migration throughput on the above machine from 600MB/s to
      750MB/s.
      Signed-off-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35282a2d
    • R
      mm, PM/Freezer: Disable OOM killer when tasks are frozen · 7f33d49a
      Rafael J. Wysocki 提交于
      Currently, the following scenario appears to be possible in theory:
      
      * Tasks are frozen for hibernation or suspend.
      * Free pages are almost exhausted.
      * Certain piece of code in the suspend code path attempts to allocate
        some memory using GFP_KERNEL and allocation order less than or
        equal to PAGE_ALLOC_COSTLY_ORDER.
      * __alloc_pages_internal() cannot find a free page so it invokes the
        OOM killer.
      * The OOM killer attempts to kill a task, but the task is frozen, so
        it doesn't die immediately.
      * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
        to find a free page and invokes the OOM killer.
      * No progress can be made.
      
      Although it is now hard to trigger during hibernation due to the memory
      shrinking carried out by the hibernation code, it is theoretically
      possible to trigger during suspend after the memory shrinking has been
      removed from that code path.  Moreover, since memory allocations are
      going to be used for the hibernation memory shrinking, it will be even
      more likely to happen during hibernation.
      
      To prevent it from happening, introduce the oom_killer_disabled switch
      that will cause __alloc_pages_internal() to fail in the situations in
      which the OOM killer would have been called and make the freezer set
      this switch after tasks have been successfully frozen.
      
      [akpm@linux-foundation.org: be nicer to the namespace]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Fengguang Wu <fengguang.wu@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f33d49a
    • N
      mm: madvise(): correct return code · 75927af8
      Nick Piggin 提交于
      The posix_madvise() function succeeds (and does nothing) when called with
      parameters (NULL, 0, -1); according to LSB tests, it should fail with
      EINVAL because -1 is not a valid flag.
      
      When called with a valid address and size, it correctly fails.
      
      So perform an initial check for valid flags first.
      Reported-by: NJiri Dluhos <jdluhos@novell.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-and-Tested-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75927af8
    • A
      page-allocator: warn if __GFP_NOFAIL is used for a large allocation · dab48dab
      Andrew Morton 提交于
      __GFP_NOFAIL is a bad fiction.  Allocations _can_ fail, and callers should
      detect and suitably handle this (and not by lamely moving the infinite
      loop up to the caller level either).
      
      Attempting to use __GFP_NOFAIL for a higher-order allocation is even
      worse, so add a once-off runtime check for this to slap people around for
      even thinking about trying it.
      
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dab48dab
    • M
      videobuf-dma-contig: zero copy USERPTR support · 720b17e7
      Magnus Damm 提交于
      Since videobuf-dma-contig is designed to handle physically contiguous
      memory, this patch modifies the videobuf-dma-contig code to only accept a
      user space pointer to physically contiguous memory.  For now only
      VM_PFNMAP vmas are supported, so forget hotplug.
      
      On SuperH Mobile we use this with our sh_mobile_ceu_camera driver together
      with various multimedia accelerator blocks that are exported to user space
      using UIO.  The UIO kernel code exports physically contiguous memory to
      user space and lets the user space application mmap() this memory and pass
      a pointer using the USERPTR interface for V4L2 zero copy operation.
      
      With this approach we support zero copy capture, hardware scaling and
      various forms of hardware encoding and decoding.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMagnus Damm <damm@igel.co.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Acked-by: NMauro Carvalho Chehab <mchehab@infradead.org>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      720b17e7
    • J
      mm: introduce follow_pfn() · 3b6748e2
      Johannes Weiner 提交于
      Analoguous to follow_phys(), add a helper that looks up the PFN at a
      user virtual address in an IO mapping or a raw PFN mapping.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NMagnus Damm <magnus.damm@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b6748e2
    • J
      mm: use generic follow_pte() in follow_phys() · 03668a4d
      Johannes Weiner 提交于
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NMagnus Damm <magnus.damm@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03668a4d
    • J
      mm: introduce follow_pte() · f8ad0f49
      Johannes Weiner 提交于
      A generic readonly page table lookup helper to map an address space and an
      address from it to a pte.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NMagnus Damm <magnus.damm@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8ad0f49
    • C
      mm: setup_per_zone_inactive_ratio - fix comment and make it __init · e9bb35df
      Cyrill Gorcunov 提交于
      The caller of setup_per_zone_inactive_ratio is an __init function.  There
      is no need to keep the callee after it completed as well.  Also fix a
      comment.
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9bb35df
    • C
      mm: setup_per_zone_inactive_ratio - do not call for int_sqrt if not needed · 5c87eada
      Cyrill Gorcunov 提交于
      int_sqrt() returns 0 if its argument is zero so call it if only needed.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c87eada
    • W
      vmscan: ZVC updates in shrink_active_list() can be done once · af166777
      Wu Fengguang 提交于
      This effectively lifts the unit of updates to nr_inactive_* and
      pgdeactivate from PAGEVEC_SIZE=14 to SWAP_CLUSTER_MAX=32, or
      MAX_ORDER_NR_PAGES=1024 for reclaim_zone().
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af166777
    • W
      vmscan: don't export nr_saved_scan in /proc/zoneinfo · 08d9ae7c
      Wu Fengguang 提交于
      The lru->nr_saved_scan's are not meaningful counters for even kernel
      developers.  They typically are smaller than 32 and are always 0 for large
      lists.  So remove them from /proc/zoneinfo.
      
      Hopefully this interface change won't break too many scripts.
      /proc/zoneinfo is too unstructured to be script friendly, and I wonder the
      affected scripts - if there are any - are still bleeding since the not
      long ago commit "vmscan: split LRU lists into anon & file sets", which
      also touched the "scanned" line :)
      
      If we are to re-export accumulated vmscan counts in the future, they can
      go to new lines in /proc/zoneinfo instead of the current form, or to
      /sys/devices/system/node/node0/meminfo?
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08d9ae7c
    • W
      vmscan: cleanup the scan batching code · 6e08a369
      Wu Fengguang 提交于
      The vmscan batching logic is twisting.  Move it into a standalone function
      nr_scan_try_batch() and document it.  No behavior change.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e08a369