• W
    vmscan: raise the bar to PAGEOUT_IO_SYNC stalls · e31f3698
    Wu Fengguang 提交于
    Fix "system goes unresponsive under memory pressure and lots of
    dirty/writeback pages" bug.
    
    	http://lkml.org/lkml/2010/4/4/86
    
    In the above thread, Andreas Mohr described that
    
    	Invoking any command locked up for minutes (note that I'm
    	talking about attempted additional I/O to the _other_,
    	_unaffected_ main system HDD - such as loading some shell
    	binaries -, NOT the external SSD18M!!).
    
    This happens when the two conditions are both meet:
    - under memory pressure
    - writing heavily to a slow device
    
    OOM also happens in Andreas' system.  The OOM trace shows that 3 processes
    are stuck in wait_on_page_writeback() in the direct reclaim path.  One in
    do_fork() and the other two in unix_stream_sendmsg().  They are blocked on
    this condition:
    
    	(sc->order && priority < DEF_PRIORITY - 2)
    
    which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
    also should use PAGEOUT_IO_SYNC) one year ago.  That condition may be too
    permissive.  In Andreas' case, 512MB/1024 = 512KB.  If the direct reclaim
    for the order-1 fork() allocation runs into a range of 512KB
    hard-to-reclaim LRU pages, it will be stalled.
    
    It's a severe problem in three ways.
    
    Firstly, it can easily happen in daily desktop usage.  vmscan priority can
    easily go below (DEF_PRIORITY - 2) on _local_ memory pressure.  Even if
    the system has 50% globally reclaimable pages, it still has good
    opportunity to have 0.1% sized hard-to-reclaim ranges.  For example, a
    simple dd can easily create a big range (up to 20%) of dirty pages in the
    LRU lists.  And order-1 to order-3 allocations are more than common with
    SLUB.  Try "grep -v '1 :' /proc/slabinfo" to get the list of high order
    slab caches.  For example, the order-1 radix_tree_node slab cache may
    stall applications at swap-in time; the order-3 inode cache on most
    filesystems may stall applications when trying to read some file; the
    order-2 proc_inode_cache may stall applications when trying to open a
    /proc file.
    
    Secondly, once triggered, it will stall unrelated processes (not doing IO
    at all) in the system.  This "one slow USB device stalls the whole system"
    avalanching effect is very bad.
    
    Thirdly, once stalled, the stall time could be intolerable long for the
    users.  When there are 20MB queued writeback pages and USB 1.1 is writing
    them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds.
    Not to mention it may be called multiple times.
    
    So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
    DEF_PRIORITY/3, or 6.25% LRU size.  As the default dirty throttle ratio is
    20%, it will hardly be triggered by pure dirty pages.  We'd better treat
    PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
    uncomfortably long (easily goes beyond 1s).
    
    The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
    which are easy to satisfy in 1TB memory boxes.  So, although 6.25% of
    memory could be an awful lot of pages to scan on a system with 1TB of
    memory, it won't really have to busy scan that much.
    
    Andreas tested an older version of this patch and reported that it mostly
    fixed his problem.  Mel Gorman helped improve it and KOSAKI Motohiro will
    fix it further in the next patch.
    Reported-by: NAndreas Mohr <andi@lisas.de>
    Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
    Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: NMel Gorman <mel@csn.ul.ie>
    Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
    e31f3698
vmscan.c 81.9 KB