• M
    vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim · 90afa5de
    Mel Gorman 提交于
    A bug was brought to my attention against a distro kernel but it affects
    mainline and I believe problems like this have been reported in various
    guises on the mailing lists although I don't have specific examples at the
    moment.
    
    The reported problem was that malloc() stalled for a long time (minutes in
    some cases) if a large tmpfs mount was occupying a large percentage of
    memory overall.  The pages did not get cleaned or reclaimed by
    zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
    are uselessly scanned frequencly making the CPU spin at near 100%.
    
    This patchset intends to address that bug and bring the behaviour of
    zone_reclaim() more in line with expectations which were noticed during
    investigation.  It is based on top of mmotm and takes advantage of
    Kosaki's work with respect to zone_reclaim().
    
    Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
    	scan should go ahead. The broken heuristic is what was causing the
    	malloc() stall as it uselessly scanned the LRU constantly. Currently,
    	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
    	could not deal with tmpfs pages at all. This fixes up the heuristic so
    	that an unnecessary scan is more likely to be correctly avoided.
    
    Patch 2 notes that zone_reclaim() returning a failure automatically means
    	the zone is marked full. This is not always true. It could have
    	failed because the GFP mask or zone_reclaim_mode were unsuitable.
    
    Patch 3 introduces a counter zreclaim_failed that will increment each
    	time the zone_reclaim scan-avoidance heuristics fail. If that
    	counter is rapidly increasing, then zone_reclaim_mode should be
    	set to 0 as a temporarily resolution and a bug reported because
    	the scan-avoidance heuristic is still broken.
    
    This patch:
    
    On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim.  On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.
    
    There is a heuristic that determines if the scan is worthwhile but the
    problem is that the heuristic is not being properly applied and is
    basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
    proper detection can manfiest as high CPU usage as the LRU list is scanned
    uselessly.
    
    Historically, once enabled it was depending on NR_FILE_PAGES which may
    include swapcache pages that the reclaim_mode cannot deal with.  Patch
    vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
    Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
    pages that were not file-backed such as swapcache and made a calculation
    based on the inactive, active and mapped files.  This is far superior when
    zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
    reasonable starting figure.
    
    This patch alters how zone_reclaim() works out how many pages it might be
    able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
    the reclaim_mode it will either consider NR_FILE_PAGES as potential
    candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
    swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
    then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
    not set, then NR_FILE_MAPPED are not.
    
    [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
    [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
    Signed-off-by: NMel Gorman <mel@csn.ul.ie>
    Reviewed-by: NRik van Riel <riel@redhat.com>
    Acked-by: NChristoph Lameter <cl@linux-foundation.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: <stable@kernel.org>
    Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
    90afa5de
vm.txt 21.6 KB