1. 04 7月, 2013 1 次提交
    • M
      mm: vmscan: limit the number of pages kswapd reclaims at each priority · 75485363
      Mel Gorman 提交于
      This series does not fix all the current known problems with reclaim but
      it addresses one important swapping bug when there is background IO.
      
      Changelog since V3
       - Drop the slab shrink changes in light of Glaubers series and
         discussions highlighted that there were a number of potential
         problems with the patch.					(mel)
       - Rebased to 3.10-rc1
      
      Changelog since V2
       - Preserve ratio properly for proportional scanning		(kamezawa)
      
      Changelog since V1
       - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
       - Reformat comment in shrink_page_list				(andi)
       - Clarify some comments					(dhillf)
       - Rework how the proportional scanning is preserved
       - Add PageReclaim check before kswapd starts writeback
       - Reset sc.nr_reclaimed on every full zone scan
      
      Kswapd and page reclaim behaviour has been screwy in one way or the
      other for a long time.  Very broadly speaking it worked in the far past
      because machines were limited in memory so it did not have that many
      pages to scan and it stalled congestion_wait() frequently to prevent it
      going completely nuts.  In recent times it has behaved very
      unsatisfactorily with some of the problems compounded by the removal of
      stall logic and the introduction of transparent hugepage support with
      high-order reclaims.
      
      There are many variations of bugs that are rooted in this area.  One
      example is reports of a large copy operations or backup causing the
      machine to grind to a halt or applications pushed to swap.  Sometimes in
      low memory situations a large percentage of memory suddenly gets
      reclaimed.  In other cases an application starts and kswapd hits 100%
      CPU usage for prolonged periods of time and so on.  There is now talk of
      introducing features like an extra free kbytes tunable to work around
      aspects of the problem instead of trying to deal with it.  It's
      compounded by the problem that it can be very workload and machine
      specific.
      
      This series aims at addressing some of the worst of these problems
      without attempting to fundmentally alter how page reclaim works.
      
      Patches 1-2 limits the number of pages kswapd reclaims while still obeying
      	the anon/file proportion of the LRUs it should be scanning.
      
      Patches 3-4 control how and when kswapd raises its scanning priority and
      	deletes the scanning restart logic which is tricky to follow.
      
      Patch 5 notes that it is too easy for kswapd to reach priority 0 when
      	scanning and then reclaim the world. Down with that sort of thing.
      
      Patch 6 notes that kswapd starts writeback based on scanning priority which
      	is not necessarily related to dirty pages. It will have kswapd
      	writeback pages if a number of unqueued dirty pages have been
      	recently encountered at the tail of the LRU.
      
      Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
      	to reduce LRU churn and the likelihood that it'll reclaim young
      	clean pages or push applications to swap. It will cause kswapd
      	to block on IO if it detects that pages being reclaimed under
      	writeback are recycling through the LRU before the IO completes.
      
      Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
      	are applied.
      
      This was tested using memcached+memcachetest while some background IO
      was in progress as implemented by the parallel IO tests implement in MM
      Tests.
      
      memcachetest benchmarks how many operations/second memcached can service
      and it is run multiple times.  It starts with no background IO and then
      re-runs the test with larger amounts of IO in the background to roughly
      simulate a large copy in progress.  The expectation is that the IO
      should have little or no impact on memcachetest which is running
      entirely in memory.
      
                                              3.10.0-rc1                  3.10.0-rc1
                                                 vanilla            lessdisrupt-v4
      Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
      Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
      Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
      Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
      Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
      Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
      Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
      Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
      Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
      Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
      Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
      Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
      Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
      Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
      
      Note how the vanilla kernels performance collapses when there is enough
      IO taking place in the background.  This drop in performance is part of
      what users complain of when they start backups.  Note how the swapin and
      major fault figures indicate that processes were being pushed to swap
      prematurely.  With the series applied, there is no noticable performance
      drop and while there is still some swap activity, it's tiny.
      
      20 iterations of this test were run in total and averaged.  Every 5
      iterations, additional IO was generated in the background using dd to
      measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
      subblock refer to the amount of IO going on in the background at each
      iteration.  So memcachetest-2385M is reporting how many
      transactions/second memcachetest recorded on average over 5 iterations
      while there was 2385M of IO going on in the ground.  There are six
      blocks of information reported here
      
      memcachetest is the transactions/second reported by memcachetest. In
      	the vanilla kernel note that performance drops from around
      	22K/sec to just under 4K/second when there is 2385M of IO going
      	on in the background. This is one type of performance collapse
      	users complain about if a large cp or backup starts in the
      	background
      
      io-duration refers to how long it takes for the background IO to
      	complete. It's showing that with the patched kernel that the IO
      	completes faster while not interfering with the memcache
      	workload
      
      swaptotal is the total amount of swap traffic. With the patched kernel,
      	the total amount of swapping is much reduced although it is
      	still not zero.
      
      swapin in this case is an indication as to whether we are swap trashing.
      	The closer the swapin/swapout ratio is to 1, the worse the
      	trashing is.  Note with the patched kernel that there is no swapin
      	activity indicating that all the pages swapped were really inactive
      	unused pages.
      
      minorfaults are just minor faults. An increased number of minor faults
      	can indicate that page reclaim is unmapping the pages but not
      	swapping them out before they are faulted back in. With the
      	patched kernel, there is only a small change in minor faults
      
      majorfaults are just major faults in the target workload and a high
      	number can indicate that a workload is being prematurely
      	swapped. With the patched kernel, major faults are much reduced. As
      	there are no swapin's recorded so it's not being swapped. The likely
      	explanation is that that libraries or configuration files used by
      	the workload during startup get paged out by the background IO.
      
      Overall with the series applied, there is no noticable performance drop
      due to background IO and while there is still some swap activity, it's
      tiny and the lack of swapins imply that the swapped pages were inactive
      and unused.
      
                                  3.10.0-rc1  3.10.0-rc1
                                     vanilla lessdisrupt-v4
      Page Ins                       1234608      101892
      Page Outs                     12446272    11810468
      Swap Ins                        283406           0
      Swap Outs                       698469       27882
      Direct pages scanned                 0      136480
      Kswapd pages scanned           6266537     5369364
      Kswapd pages reclaimed         1088989      930832
      Direct pages reclaimed               0      120901
      Kswapd efficiency                  17%         17%
      Kswapd velocity               5398.371    4635.115
      Direct efficiency                 100%         88%
      Direct velocity                  0.000     117.817
      Percentage direct scans             0%          2%
      Page writes by reclaim         1655843     4009929
      Page writes file                957374     3982047
      Page writes anon                698469       27882
      Page reclaim immediate            5245        1745
      Page rescued immediate               0           0
      Slabs scanned                    33664       25216
      Direct inode steals                  0           0
      Kswapd inode steals              19409         778
      Kswapd skipped wait                  0           0
      THP fault alloc                     35          30
      THP collapse alloc                 472         401
      THP splits                          27          22
      THP fault fallback                   0           0
      THP collapse fail                    0           1
      Compaction stalls                    0           4
      Compaction success                   0           0
      Compaction failures                  0           4
      Page migrate success                 0           0
      Page migrate failure                 0           0
      Compaction pages isolated            0           0
      Compaction migrate scanned           0           0
      Compaction free scanned              0           0
      Compaction cost                      0           0
      NUMA PTE updates                     0           0
      NUMA hint faults                     0           0
      NUMA hint local faults               0           0
      NUMA pages migrated                  0           0
      AutoNUMA cost                        0           0
      
      Unfortunately, note that there is a small amount of direct reclaim due to
      kswapd no longer reclaiming the world.  ftrace indicates that the direct
      reclaim stalls are mostly harmless with the vast bulk of the stalls
      incurred by dd
      
           23 tclsh-3367
           38 memcachetest-13733
           49 memcachetest-12443
           57 tee-3368
         1541 dd-13826
         1981 dd-12539
      
      A consequence of the direct reclaim for dd is that the processes for the
      IO workload may show a higher system CPU usage.  There is also a risk that
      kswapd not reclaiming the world may mean that it stays awake balancing
      zones, does not stall on the appropriate events and continually scans
      pages it cannot reclaim consuming CPU.  This will be visible as continued
      high CPU usage but in my own tests I only saw a single spike lasting less
      than a second and I did not observe any problems related to reclaim while
      running the series on my desktop.
      
      This patch:
      
      The number of pages kswapd can reclaim is bound by the number of pages it
      scans which is related to the size of the zone and the scanning priority.
      In many cases the priority remains low because it's reset every
      SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
      number of pages it cannot reclaim, it will raise the priority and
      potentially discard a large percentage of the zone as sc->nr_to_reclaim is
      ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
      percentage of memory is suddenly freed.  It would be bad enough if this
      was just unused memory but because of how anon/file pages are balanced it
      is possible that applications get pushed to swap unnecessarily.
      
      This patch limits the number of pages kswapd will reclaim to the high
      watermark.  Reclaim will still overshoot due to it not being a hard limit
      as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
      prevents kswapd reclaiming the world at higher priorities.  The number of
      pages it reclaims is not adjusted for high-order allocations as kswapd
      will reclaim excessively if it is to balance zones for high-order
      allocations.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75485363
  2. 30 4月, 2013 3 次提交
    • S
      mm: thp: add split tail pages to shrink page list in page reclaim · 5bc7b8ac
      Shaohua Li 提交于
      In page reclaim, huge page is split.  split_huge_page() adds tail pages
      to LRU list.  Since we are reclaiming a huge page, it's better we
      reclaim all subpages of the huge page instead of just the head page.
      This patch adds split tail pages to shrink page list so the tail pages
      can be reclaimed soon.
      
      Before this patch, run a swap workload:
        thp_fault_alloc 3492
        thp_fault_fallback 608
        thp_collapse_alloc 6
        thp_collapse_alloc_failed 0
        thp_split 916
      
      With this patch:
        thp_fault_alloc 4085
        thp_fault_fallback 16
        thp_collapse_alloc 90
        thp_collapse_alloc_failed 0
        thp_split 1272
      
      fallback allocation is reduced a lot.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5bc7b8ac
    • A
      memcg: add memory.pressure_level events · 70ddf637
      Anton Vorontsov 提交于
      With this patch userland applications that want to maintain the
      interactivity/memory allocation cost can use the pressure level
      notifications.  The levels are defined like this:
      
      The "low" level means that the system is reclaiming memory for new
      allocations.  Monitoring this reclaiming activity might be useful for
      maintaining cache level.  Upon notification, the program (typically
      "Activity Manager") might analyze vmstat and act in advance (i.e.
      prematurely shutdown unimportant services).
      
      The "medium" level means that the system is experiencing medium memory
      pressure, the system might be making swap, paging out active file
      caches, etc.  Upon this event applications may decide to further analyze
      vmstat/zoneinfo/memcg or internal memory usage statistics and free any
      resources that can be easily reconstructed or re-read from a disk.
      
      The "critical" level means that the system is actively thrashing, it is
      about to out of memory (OOM) or even the in-kernel OOM killer is on its
      way to trigger.  Applications should do whatever they can to help the
      system.  It might be too late to consult with vmstat or any other
      statistics, so it's advisable to take an immediate action.
      
      The events are propagated upward until the event is handled, i.e.  the
      events are not pass-through.  Here is what this means: for example you
      have three cgroups: A->B->C.  Now you set up an event listener on
      cgroups A, B and C, and suppose group C experiences some pressure.  In
      this situation, only group C will receive the notification, i.e.  groups
      A and B will not receive it.  This is done to avoid excessive
      "broadcasting" of messages, which disturbs the system and which is
      especially bad if we are low on memory or thrashing.  So, organize the
      cgroups wisely, or propagate the events manually (or, ask us to
      implement the pass-through events, explaining why would you need them.)
      
      Performance wise, the memory pressure notifications feature itself is
      lightweight and does not require much of bookkeeping, in contrast to the
      rest of memcg features.  Unfortunately, as of current memcg
      implementation, pages accounting is an inseparable part and cannot be
      turned off.  The good news is that there are some efforts[1] to improve
      the situation; plus, implementing the same, fully API-compatible[2]
      interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
      option, so it will not require any changes on the userland side.
      
      [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
      [2] http://lkml.org/lkml/2013/2/21/454
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ddf637
    • H
      mm/vmscan.c: minor cleanup for kswapd · 2d42a40d
      Hillf Danton 提交于
      Local variable total_scanned is no longer used.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d42a40d
  3. 18 4月, 2013 1 次提交
  4. 24 2月, 2013 16 次提交
    • Z
      vmscan: change type of vm_total_pages to unsigned long · b21e0b90
      Zhang Yanfei 提交于
      This variable is calculated from nr_free_pagecache_pages so
      change its type to unsigned long.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b21e0b90
    • M
      mm: use up free swap space before reaching OOM kill · 0e50ce3b
      Minchan Kim 提交于
      Recently, Luigi reported there are lots of free swap space when OOM
      happens.  It's easily reproduced on zram-over-swap, where many instance
      of memory hogs are running and laptop_mode is enabled.  He said there
      was no problem when he disabled laptop_mode.  The problem when I
      investigate problem is following as.
      
      Assumption for easy explanation: There are no page cache page in system
      because they all are already reclaimed.
      
      1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
      2. shrink_inactive_list isolates victim pages from inactive anon lru list.
      3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
         pageout because sc->may_writepage is 0 so the page is rotated back into
         inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
      4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
         retry reclaim with higher priority.
      5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
         but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
         inactive anon lru list is full of dirty pages by 3 so it just returns
         without  any reclaim progress.
      6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
         Because sc->nr_scanned is increased by shrink_page_list but we don't call
         shrink_page_list in 5 due to short of isolated pages.
      
      Above loop is continued until OOM happens.
      
      The problem didn't happen before [1] was merged because old logic's
      isolatation in shrink_inactive_list was successful and tried to call
      shrink_page_list to pageout them but it still ends up failed to page out
      by may_writepage.  But important point is that sc->nr_scanned was
      increased although we couldn't swap out them so do_try_to_free_pages
      could set may_writepages.
      
      Since commit f80c0673 ("mm: zone_reclaim: make isolate_lru_page()
      filter-aware") was introduced, it's not a good idea any more to depends
      on only the number of scanned pages for setting may_writepage.  So this
      patch adds new trigger point of setting may_writepage as below
      DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
      in VM so it's good fit for our purpose which would be better to lose
      power saving or clickety rather than OOM killing.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NLuigi Semenzato <semenzato@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e50ce3b
    • J
      mm: refactor inactive_file_is_low() to use get_lru_size() · e3790144
      Johannes Weiner 提交于
      An inactive file list is considered low when its active counterpart is
      bigger, regardless of whether it is a global zone LRU list or a memcg
      zone LRU list.  The only difference is in how the LRU size is assessed.
      
      get_lru_size() does the right thing for both global and memcg reclaim
      situations.
      
      Get rid of inactive_file_is_low_global() and
      mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
      the numbers in common code.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3790144
    • S
      swap: add per-partition lock for swapfile · ec8acf20
      Shaohua Li 提交于
      swap_lock is heavily contended when I test swap to 3 fast SSD (even
      slightly slower than swap to 2 such SSD).  The main contention comes
      from swap_info_get().  This patch tries to fix the gap with adding a new
      per-partition lock.
      
      Global data like nr_swapfiles, total_swap_pages, least_priority and
      swap_list are still protected by swap_lock.
      
      nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
      theory, it's possible get_swap_page() finds no swap pages but actually
      there are free swap pages.  But sounds not a big problem.
      
      Accessing partition specific data (like scan_swap_map and so on) is only
      protected by swap_info_struct.lock.
      
      Changing swap_info_struct.flags need hold swap_lock and
      swap_info_struct.lock, because scan_scan_map() will check it.  read the
      flags is ok with either the locks hold.
      
      If both swap_lock and swap_info_struct.lock must be hold, we always hold
      the former first to avoid deadlock.
      
      swap_entry_free() can change swap_list.  To delete that code, we add a
      new highest_priority_index.  Whenever get_swap_page() is called, we
      check it.  If it's valid, we use it.
      
      It's a pity get_swap_page() still holds swap_lock().  But in practice,
      swap_lock() isn't heavily contended in my test with this patch (or I can
      say there are other much more heavier bottlenecks like TLB flush).  And
      BTW, looks get_swap_page() doesn't really need the lock.  We never free
      swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
      lock is we could swapout to some low priority swap, but we can quickly
      recover after several rounds of swap, so sounds not a big deal to me.
      But I'd prefer to fix this if it's a real problem.
      
      "swap: make each swap partition have one address_space" improved the
      swapout speed from 1.7G/s to 2G/s.  This patch further improves the
      speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
      so TLB flush isn't the biggest bottleneck before the patches.
      
      [arnd@arndb.de: fix it for nommu]
      [hughd@google.com: add missing unlock]
      [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8acf20
    • M
      mm: teach mm by current context info to not do I/O during memory allocation · 21caf2fc
      Ming Lei 提交于
      This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
      'struct task_struct'), so that the flag can be set by one task to avoid
      doing I/O inside memory allocation in the task's context.
      
      The patch trys to solve one deadlock problem caused by block device, and
      the problem may happen at least in the below situations:
      
      - during block device runtime resume, if memory allocation with
        GFP_KERNEL is called inside runtime resume callback of any one of its
        ancestors(or the block device itself), the deadlock may be triggered
        inside the memory allocation since it might not complete until the block
        device becomes active and the involed page I/O finishes.  The situation
        is pointed out first by Alan Stern.  It is not a good approach to
        convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
        subsystems may be involved(for example, PCI, USB and SCSI may be
        involved for usb mass stoarage device, network devices involved too in
        the iSCSI case)
      
      - during block device runtime suspend, because runtime resume need to
        wait for completion of concurrent runtime suspend.
      
      - during error handling of usb mass storage deivce, USB bus reset will
        be put on the device, so there shouldn't have any memory allocation with
        GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
        above may be triggered.  Unfortunately, any usb device may include one
        mass storage interface in theory, so it requires all usb interface
        drivers to handle the situation.  In fact, most usb drivers don't know
        how to handle bus reset on the device and don't provide .pre_set() and
        .post_reset() callback at all, so USB core has to unbind and bind driver
        for these devices.  So it is still not practical to resort to GFP_NOIO
        for solving the problem.
      
      Also the introduced solution can be used by block subsystem or block
      drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
      actual I/O transfer.
      
      It is not a good idea to convert all these GFP_KERNEL in the affected
      path into GFP_NOIO because these functions doing that may be implemented
      as library and will be called in many other contexts.
      
      In fact, memalloc_noio_flags() can convert some of current static
      GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
      at least almost all GFP_NOIO in USB subsystem can be converted into
      GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
      only happen in runtime resume/bus reset/block I/O transfer contexts
      generally.
      
      [1], several GFP_KERNEL allocation examples in runtime resume path
      
      - pci subsystem
      acpi_os_allocate
      	<-acpi_ut_allocate
      		<-ACPI_ALLOCATE_ZEROED
      			<-acpi_evaluate_object
      				<-__acpi_bus_set_power
      					<-acpi_bus_set_power
      						<-acpi_pci_set_power_state
      							<-platform_pci_set_power_state
      								<-pci_platform_power_transition
      									<-__pci_complete_power_transition
      										<-pci_set_power_state
      											<-pci_restore_standard_config
      												<-pci_pm_runtime_resume
      - usb subsystem
      usb_get_status
      	<-finish_port_resume
      		<-usb_port_resume
      			<-generic_resume
      				<-usb_resume_device
      					<-usb_resume_both
      						<-usb_runtime_resume
      
      - some individual usb drivers
      usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
      
      That is just what I have found.  Unfortunately, this allocation can only
      be found by human being now, and there should be many not found since
      any function in the resume path(call tree) may allocate memory with
      GFP_KERNEL.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21caf2fc
    • Z
      mm: don't wait on congested zones in balance_pgdat() · 258401a6
      Zlatko Calusic 提交于
      From: Zlatko Calusic <zlatko.calusic@iskon.hr>
      
      Commit 92df3a72 ("mm: vmscan: throttle reclaim if encountering too
      many dirty pages under writeback") introduced waiting on congested zones
      based on a sane algorithm in shrink_inactive_list().
      
      What this means is that there's no more need for throttling and
      additional heuristics in balance_pgdat().  So, let's remove it and tidy
      up the code.
      Signed-off-by: NZlatko Calusic <zlatko.calusic@iskon.hr>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      258401a6
    • J
      mm: use zone->present_pages instead of zone->managed_pages where appropriate · b40da049
      Jiang Liu 提交于
      Now we have zone->managed_pages for "pages managed by the buddy system
      in the zone", so replace zone->present_pages with zone->managed_pages if
      what the user really wants is number of allocatable pages.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b40da049
    • Z
      mm: avoid calling pgdat_balanced() needlessly · dafcb73e
      Zlatko Calusic 提交于
      Now that balance_pgdat() is slightly tidied up, thanks to more capable
      pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
      check the status, then break the loop if pgdat is balanced, just to be
      immediately called again.  The second call is completely unnecessary, of
      course.
      
      The patch introduces pgdat_is_balanced boolean, which helps resolve the
      above suboptimal behavior, with the added benefit of slightly better
      documenting one other place in the function where we jump and skip lots
      of code.
      Signed-off-by: NZlatko Calusic <zlatko.calusic@iskon.hr>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dafcb73e
    • M
      memcg,vmscan: do not break out targeted reclaim without reclaimed pages · a394cb8e
      Michal Hocko 提交于
      Targeted (hard resp soft) reclaim has traditionally tried to scan one
      group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
      pages) is reclaimed or all priorities are exhausted.  The reclaim is
      then retried until the limit is met.
      
      This approach, however, doesn't work well with deeper hierarchies where
      groups higher in the hierarchy do not have any or only very few pages
      (this usually happens if those groups do not have any tasks and they
      have only re-parented pages after some of their children is removed).
      Those groups are reclaimed with decreasing priority pointlessly as there
      is nothing to reclaim from them.
      
      An easiest fix is to break out of the memcg iteration loop in
      shrink_zone only if the whole hierarchy has been visited or sufficient
      pages have been reclaimed.  This is also more natural because the
      reclaimer expects that the hierarchy under the given root is reclaimed.
      As a result we can simplify the soft limit reclaim which does its own
      iteration.
      
      [yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
      [akpm@linux-foundation.org: use conventional comparison order]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NYing Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a394cb8e
    • A
      mm/vmscan.c:__zone_reclaim(): replace max_t() with max() · 62b726c1
      Andrew Morton 提交于
      "mm: vmscan: save work scanning (almost) empty LRU lists" made
      SWAP_CLUSTER_MAX an unsigned long.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62b726c1
    • J
      mm: vmscan: compaction works against zones, not lruvecs · 9b4f98cd
      Johannes Weiner 提交于
      The restart logic for when reclaim operates back to back with compaction
      is currently applied on the lruvec level.  But this does not make sense,
      because the container of interest for compaction is a zone as a whole,
      not the zone pages that are part of a certain memory cgroup.
      
      Negative impact is bounded.  For one, the code checks that the lruvec
      has enough reclaim candidates, so it does not risk getting stuck on a
      condition that can not be fulfilled.  And the unfairness of hammering on
      one particular memory cgroup to make progress in a zone will be
      amortized by the round robin manner in which reclaim goes through the
      memory cgroups.  Still, this can lead to unnecessary allocation
      latencies when the code elects to restart on a hard to reclaim or small
      group when there are other, more reclaimable groups in the zone.
      
      Move this logic to the zone level and restart reclaim for all memory
      cgroups in a zone when compaction requires more free pages from it.
      
      [akpm@linux-foundation.org: no need for min_t]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b4f98cd
    • J
      mm: vmscan: clean up get_scan_count() · 9a265114
      Johannes Weiner 提交于
      Reclaim pressure balance between anon and file pages is calculated
      through a tuple of numerators and a shared denominator.
      
      Exceptional cases that want to force-scan anon or file pages configure
      the numerators and denominator such that one list is preferred, which is
      not necessarily the most obvious way:
      
          fraction[0] = 1;
          fraction[1] = 0;
          denominator = 1;
          goto out;
      
      Make this easier by making the force-scan cases explicit and use the
      fractionals only in case they are calculated from reclaim history.
      
      [akpm@linux-foundation.org: avoid using unintialized_var()]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a265114
    • J
      mm: vmscan: improve comment on low-page cache handling · 11d16c25
      Johannes Weiner 提交于
      Fix comment style and elaborate on why anonymous memory is force-scanned
      when file cache runs low.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11d16c25
    • J
      mm: vmscan: clarify how swappiness, highest priority, memcg interact · 10316b31
      Johannes Weiner 提交于
      A swappiness of 0 has a slightly different meaning for global reclaim
      (may swap if file cache really low) and memory cgroup reclaim (never
      swap, ever).
      
      In addition, global reclaim at highest priority will scan all LRU lists
      equal to their size and ignore other balancing heuristics.  UNLESS
      swappiness forbids swapping, then the lists are balanced based on recent
      reclaim effectiveness.  UNLESS file cache is running low, then anonymous
      pages are force-scanned.
      
      This (total mess of a) behaviour is implicit and not obvious from the
      way the code is organized.  At least make it apparent in the code flow
      and document the conditions.  It will be it easier to come up with sane
      semantics later.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NSatoru Moriya <satoru.moriya@hds.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10316b31
    • J
      mm: vmscan: save work scanning (almost) empty LRU lists · d778df51
      Johannes Weiner 提交于
      In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
      amount of pages is scanned from the LRU lists on each iteration, to make
      progress.
      
      Do not make this minimum bigger than the respective LRU list size,
      however, and save some busy work trying to isolate and reclaim pages
      that are not there.
      
      Empty LRU lists are quite common with memory cgroups in NUMA
      environments because there exists a set of LRU lists for each zone for
      each memory cgroup, while the memory of a single cgroup is expected to
      stay on just one node.  The number of expected empty LRU lists is thus
      
        memcgs * (nodes - 1) * lru types
      
      Each attempt to reclaim from an empty LRU list does expensive size
      comparisons between lists, acquires the zone's lru lock etc.  Avoid
      that.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d778df51
    • J
      mm: memcg: only evict file pages when we have plenty · 7c5bd705
      Johannes Weiner 提交于
      Commit e9868505 ("mm, vmscan: only evict file pages when we have
      plenty") makes a point of not going for anonymous memory while there is
      still enough inactive cache around.
      
      The check was added only for global reclaim, but it is just as useful to
      reduce swapping in memory cgroup reclaim:
      
          200M-memcg-defconfig-j2
      
                                           vanilla                   patched
          Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
          User time              668.57 (  +0.00%)         668.73 (  +0.02%)
          System time            128.92 (  +0.00%)         129.53 (  +0.46%)
          Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
          Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
          Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
          Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
          THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
          THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
          THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c5bd705
  5. 04 1月, 2013 1 次提交
    • G
      MM: vmscan: remove __devinit attribute. · fcb35a9b
      Greg Kroah-Hartman 提交于
      CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
      markings need to be removed.
      
      This change removes the use of __devinit from the file.
      
      Based on patches originally written by Bill Pemberton, but redone by me
      in order to handle some of the coding style issues better, by hand.
      
      Cc: Bill Pemberton <wfp5p@virginia.edu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcb35a9b
  6. 29 12月, 2012 1 次提交
  7. 24 12月, 2012 1 次提交
  8. 20 12月, 2012 1 次提交
    • Z
      mm: do not sleep in balance_pgdat if there's no i/o congestion · cda73a10
      Zlatko Calusic 提交于
      On a 4GB RAM machine, where Normal zone is much smaller than DMA32 zone,
      the Normal zone gets fragmented in time.  This requires relatively more
      pressure in balance_pgdat to get the zone above the required watermark.
      Unfortunately, the congestion_wait() call in there slows it down for a
      completely wrong reason, expecting that there's a lot of
      writeback/swapout, even when there's none (much more common).  After a
      few days, when fragmentation progresses, this flawed logic translates to
      a very high CPU iowait times, even though there's no I/O congestion at
      all.  If THP is enabled, the problem occurs sooner, but I was able to
      see it even on !THP kernels, just by giving it a bit more time to occur.
      
      The proper way to deal with this is to not wait, unless there's
      congestion.  Thanks to Mel Gorman, we already have the function that
      perfectly fits the job.  The patch was tested on a machine which nicely
      revealed the problem after only 1 day of uptime, and it's been working
      great.
      Signed-off-by: NZlatko Calusic <zlatko.calusic@iskon.hr>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cda73a10
  9. 19 12月, 2012 2 次提交
  10. 13 12月, 2012 1 次提交
  11. 12 12月, 2012 3 次提交
  12. 09 12月, 2012 1 次提交
  13. 07 12月, 2012 1 次提交
  14. 01 12月, 2012 1 次提交
  15. 27 11月, 2012 1 次提交
    • M
      mm: vmscan: check for fatal signals iff the process was throttled · 50694c28
      Mel Gorman 提交于
      Commit 5515061d ("mm: throttle direct reclaimers if PF_MEMALLOC
      reserves are low and swap is backed by network storage") introduced a
      check for fatal signals after a process gets throttled for network
      storage.  The intention was that if a process was throttled and got
      killed that it should not trigger the OOM killer.  As pointed out by
      Minchan Kim and David Rientjes, this check is in the wrong place and too
      broad.  If a system is in am OOM situation and a process is exiting, it
      can loop in __alloc_pages_slowpath() and calling direct reclaim in a
      loop.  As the fatal signal is pending it returns 1 as if it is making
      forward progress and can effectively deadlock.
      
      This patch moves the fatal_signal_pending() check after throttling to
      throttle_direct_reclaim() where it belongs.  If the process is killed
      while throttled, it will return immediately without direct reclaim
      except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
      reserves.
      
      Minchan pointed out that it may be better to direct reclaim before
      returning to avoid using the reserves because there may be pages that
      can easily reclaim that would avoid using the reserves.  However, we do
      no such targetted reclaim and there is no guarantee that suitable pages
      are available.  As it is expected that this throttling happens when
      swap-over-NFS is used there is a possibility that the process will
      instead swap which may allocate network buffers from the PFMEMALLOC
      reserves.  Hence, in the swap-over-nfs case where a process can be
      throtted and be killed it can use the reserves to exit or it can
      potentially use reserves to swap a few pages and then exit.  This patch
      takes the option of using the reserves if necessary to allow the process
      exit quickly.
      
      If this patch passes review it should be considered a -stable candidate
      for 3.6.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50694c28
  16. 17 11月, 2012 1 次提交
    • M
      mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" · 96710098
      Mel Gorman 提交于
      Jiri Slaby reported the following:
      
      	(It's an effective revert of "mm: vmscan: scale number of pages
      	reclaimed by reclaim/compaction based on failures".) Given kswapd
      	had hours of runtime in ps/top output yesterday in the morning
      	and after the revert it's now 2 minutes in sum for the last 24h,
      	I would say, it's gone.
      
      The intention of the patch in question was to compensate for the loss of
      lumpy reclaim.  Part of the reason lumpy reclaim worked is because it
      aggressively reclaimed pages and this patch was meant to be a sane
      compromise.
      
      When compaction fails, it gets deferred and both compaction and
      reclaim/compaction is deferred avoid excessive reclaim.  However, since
      commit c6543459 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
      each time and continues reclaiming which was not taken into account when
      the patch was developed.
      
      Attempts to address the problem ended up just changing the shape of the
      problem instead of fixing it.  The release window gets closer and while
      a THP allocation failing is not a major problem, kswapd chewing up a lot
      of CPU is.
      
      This patch reverts commit 83fde0f2 ("mm: vmscan: scale number of
      pages reclaimed by reclaim/compaction based on failures") and will be
      revisited in the future.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Tested-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96710098
  17. 09 11月, 2012 1 次提交
  18. 09 10月, 2012 3 次提交
    • M
      CMA: migrate mlocked pages · e46a2879
      Minchan Kim 提交于
      Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
      contiguous memory space.
      
      This patch makes mlocked pages be migrated out.  Of course, it can affect
      realtime processes but in CMA usecase, contiguous memory allocation failing
      is far worse than access latency to an mlocked page being variable while
      CMA is running.  If someone wants to make the system realtime, he shouldn't
      enable CMA because stalls can still happen at random times.
      
      [akpm@linux-foundation.org: tweak comment text, per Mel]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46a2879
    • H
      mm: remove vma arg from page_evictable · 39b5f29a
      Hugh Dickins 提交于
      page_evictable(page, vma) is an irritant: almost all its callers pass
      NULL for vma.  Remove the vma arg and use mlocked_vma_newpage(vma, page)
      explicitly in the couple of places it's needed.  But in those places we
      don't even need page_evictable() itself!  They're dealing with a freshly
      allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39b5f29a
    • M
      mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity · 62997027
      Mel Gorman 提交于
      Compaction caches if a pageblock was scanned and no pages were isolated so
      that the pageblocks can be skipped in the future to reduce scanning.  This
      information is not cleared by the page allocator based on activity due to
      the impact it would have to the page allocator fast paths.  Hence there is
      a requirement that something clear the cache or pageblocks will be skipped
      forever.  Currently the cache is cleared if there were a number of recent
      allocation failures and it has not been cleared within the last 5 seconds.
      Time-based decisions like this are terrible as they have no relationship
      to VM activity and is basically a big hammer.
      
      Unfortunately, accurate heuristics would add cost to some hot paths so
      this patch implements a rough heuristic.  There are two cases where the
      cache is cleared.
      
      1. If a !kswapd process completes a compaction cycle (migrate and free
         scanner meet), the zone is marked compact_blockskip_flush. When kswapd
         goes to sleep, it will clear the cache. This is expected to be the
         common case where the cache is cleared. It does not really matter if
         kswapd happens to be asleep or going to sleep when the flag is set as
         it will be woken on the next allocation request.
      
      2. If there have been multiple failures recently and compaction just
         finished being deferred then a process will clear the cache and start a
         full scan.  This situation happens if there are multiple high-order
         allocation requests under heavy memory pressure.
      
      The clearing of the PG_migrate_skip bits and other scans is inherently
      racy but the race is harmless.  For allocations that can fail such as THP,
      they will simply fail.  For requests that cannot fail, they will retry the
      allocation.  Tests indicated that scanning rates were roughly similar to
      when the time-based heuristic was used and the allocation success rates
      were similar.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62997027