1. 07 1月, 2009 40 次提交
    • H
      badpage: KERN_ALERT BUG instead of KERN_EMERG · 1e9e6365
      Hugh Dickins 提交于
      bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
      which I've followed in print_bad_pte().  These are serious system errors,
      on a par with BUGs, but they're not quite emergencies, and we do our best
      to carry on: say KERN_ALERT "BUG: " like the x86 oops does.
      
      And remove the "Trying to fix it up, but a reboot is needed" line: it's
      not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e9e6365
    • H
      badpage: ratelimit print_bad_pte and bad_page · d936cf9b
      Hugh Dickins 提交于
      print_bad_pte() and bad_page() might each need ratelimiting - especially
      for their dump_stacks, almost never of interest, yet not quite
      dispensible.  Correlating corruption across neighbouring entries can be
      very helpful, so allow a burst of 60 reports before keeping quiet for the
      remainder of that minute (or allow a steady drip of one report per
      second).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d936cf9b
    • H
      badpage: remove vma from page_remove_rmap · edc315fd
      Hugh Dickins 提交于
      Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
      And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
      page_dup_rmap(): we're trying to be more resilient about that than BUGs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edc315fd
    • H
      badpage: zap print_bad_pte on swap and file · 2509ef26
      Hugh Dickins 提交于
      Complete zap_pte_range()'s coverage of bad pagetable entries by calling
      print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
      That needs free_swap_and_cache() to tell it, which will also have shown
      one of those "swap_free" errors (but with much less information).
      
      Similar checks in fork's copy_one_pte()?  No, that would be more noisy
      than helpful: we'll see them when parent and child exec or exit.
      
      Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
      case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
      VM_FAULT_OOM rather than VM_FAULT_SIGBUS?  Well, okay, that is consistent
      with what happens if do_swap_page() operates a bad swap entry; but don't
      we have patches to be more careful about killing when VM_FAULT_OOM?
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2509ef26
    • H
      badpage: vm_normal_page use print_bad_pte · 22b31eec
      Hugh Dickins 提交于
      print_bad_pte() is so far being called only when zap_pte_range() finds
      negative page_mapcount, or there's a fault on a pte_file where it does not
      belong.  That's weak coverage when we suspect pagetable corruption.
      
      Originally, it was called when vm_normal_page() found an invalid pfn: but
      pfn_valid is expensive on some architectures and configurations, so 2.6.24
      put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
      2.6.26 replaced it by a VM_BUG_ON (likewise).
      
      Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
      than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
      __read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
      pfn against that.  We could call this pfn_plausible() or pfn_sane(), but I
      doubt we'll need it elsewhere: of course it's not reliable, but gives much
      stronger pagetable validation on many boxes.
      
      Also use print_bad_pte() when the pte_special bit is found outside a
      VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22b31eec
    • H
      badpage: replace page_remove_rmap Eeek and BUG · 3dc14741
      Hugh Dickins 提交于
      Now that bad pages are kept out of circulation, there is no need for the
      infamous page_remove_rmap() BUG() - once that page is freed, its negative
      mapcount will issue a "Bad page state" message and the page won't be
      freed.  Removing the BUG() allows more info, on subsequent pages, to be
      gathered.
      
      We do have more info about the page at this point than bad_page() can know
      - notably, what the pmd is, which might pinpoint something like low 64kB
      corruption - but page_remove_rmap() isn't given the address to find that.
      
      In practice, there is only one call to page_remove_rmap() which has ever
      reported anything, that from zap_pte_range() (usually on exit, sometimes
      on munmap).  It has all the info, so remove page_remove_rmap()'s "Eeek"
      message and leave it all to zap_pte_range().
      
      mm/memory.c already has a hardly used print_bad_pte() function, showing
      some of the appropriate info: extend it to show what we want for the rmap
      case: pte info, page info (when there is a page) and vma info to compare.
      zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
      use if it works that out for itself.
      
      Some of this info is also shown in bad_page()'s "Bad page state" message.
      Keep them separate, but adjust them to match each other as far as
      possible.  Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
      there too.
      
      print_bad_pte() show current->comm unconditionally (though it should get
      repeated in the usually irrelevant stack trace): sorry, I misled Nick
      Piggin to make it conditional on vm_mm == current->mm, but current->mm is
      already NULL in the exit case.  Usually current->comm is good, though
      exceptionally it may not be that of the mm (when "swapoff" for example).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dc14741
    • H
      badpage: keep any bad page out of circulation · 8cc3b392
      Hugh Dickins 提交于
      Until now the bad_page() checkers have special-cased PageReserved, keeping
      those pages out of circulation thereafter.  Now extend the special case to
      all: we want to keep ANY page with bad state out of circulation - the
      "free" page may well be in use by something.
      
      Leave the bad state of those pages untouched, for examination by
      debuggers; except for PageBuddy - leaving that set would risk bringing the
      page back.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cc3b392
    • H
      badpage: simplify page_alloc flag check+clear · 79f4b7bf
      Hugh Dickins 提交于
      Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
      a page: check the same flags as before when freeing, clear ALL the flags
      (unless PageReserved) when freeing, check ALL flags off when allocating.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f4b7bf
    • D
      fs: truncate blocks outside i_size after O_DIRECT write error · 0f64415d
      Dmitri Monakhov 提交于
      In case of error extending write may have instantiated a few blocks
      outside i_size.  We need to trim these blocks.  We have to do it
      *regardless* to blocksize.  At least ext2, ext3 and reiserfs interpret
      (i_size < biggest block) condition as error.  Fsck will complain about
      wrong i_size.  Then fsck will fix the error by changing i_size according
      to the biggest block.  This is bad because this blocks contain garbage
      from previous write attempt.  And result in data corruption.
      
      ####TESTCASE_BEGIN
      $touch /mnt/test/BIG_FILE
      ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
      open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
      write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
      ## size and block sould't be changed because write op failed.
      $stat /mnt/test/BIG_FILE
      File: `/mnt/test/BIG_FILE'
      Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
      <<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx
      Device: fe07h/65031d Inode: 14 Links: 1
      Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
      Access: 2007-01-24 20:03:38.000000000 +0300
      Modify: 2007-01-24 20:03:38.000000000 +0300
      Change: 2007-01-24 20:03:39.000000000 +0300
      
      #fsck.ext3 -f /dev/VG/test
      e2fsck 1.39 (29-May-2006)
      Pass 1: Checking inodes, blocks, and sizes
      Inode 14, i_size is 0, should be 56556544. Fix<y>? yes
      Pass 2: Checking directory structure
      ....
      #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
      index af0558d..4e88bea 100644
      
      [akpm@linux-foundation.org: use i_size_read()]
      Signed-off-by: NDmitri Monakhov <dmonakhov@openvz.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f64415d
    • K
      mm: kill zone_is_near_oom() · 09f445e7
      KOSAKI Motohiro 提交于
      zone_is_near_oom() is unused.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09f445e7
    • K
      vmscan: improve reclaim throughput to bail out patch · 01dbe5c9
      KOSAKI Motohiro 提交于
      The vmscan bail out patch move nr_reclaimed variable to struct
      scan_control.  Unfortunately, indirect access can easily happen cache
      miss.
      
      if heavy memory pressure happend, that's ok.
      cache miss already plenty. it is not observable.
      
      but, if memory pressure is lite, performance degression is obserbable.
      
      I compared following three pattern (it was mesured 10 times each)
      
      hackbench 125 process 3000
      hackbench 130 process 3000
      hackbench 135 process 3000
      
                  2.6.28-rc6                       bail-out
      
      	125	130	135		125	130	135
            ==============================================================
      	71.866	75.86	81.274		93.414	73.254	193.382
      	74.145	78.295	77.27		74.897	75.021	80.17
      	70.305	77.643	75.855		70.134	77.571	79.896
      	74.288	73.986	75.955		77.222	78.48	80.619
      	72.029	79.947	78.312		75.128	82.172	79.708
      	71.499	77.615	77.042		74.177	76.532	77.306
      	76.188	74.471	83.562		73.839	72.43	79.833
      	73.236	75.606	78.743		76.001	76.557	82.726
      	69.427	77.271	76.691		76.236	79.371	103.189
      	72.473	76.978	80.643		69.128	78.932	75.736
      
      avg	72.545	76.767	78.534		76.017	77.03	93.256
      std	1.89	1.71	2.41		6.29	2.79	34.16
      min	69.427	73.986	75.855		69.128	72.43	75.736
      max	76.188	79.947	83.562		93.414	82.172	193.382
      
      about 4-5% degression.
      
      Then, this patch introduces a temporary local variable.
      
      result:
      
                  2.6.28-rc6                       this patch
      
      num	125	130	135		125	130	135
            ==============================================================
      	71.866	75.86	81.274		67.302	68.269	77.161
      	74.145	78.295	77.27   	72.616	72.712	79.06
      	70.305	77.643	75.855  	72.475	75.712	77.735
      	74.288	73.986	75.955  	69.229	73.062	78.814
      	72.029	79.947	78.312  	71.551	74.392	78.564
      	71.499	77.615	77.042  	69.227	74.31	78.837
      	76.188	74.471	83.562  	70.759	75.256	76.6
      	73.236	75.606	78.743  	69.966	76.001	78.464
      	69.427	77.271	76.691  	69.068	75.218	80.321
      	72.473	76.978	80.643  	72.057	77.151	79.068
      
      avg	72.545	76.767	78.534 		70.425	74.2083	78.462
      std 	1.89	1.71	2.41    	1.66	2.34	1.00
      min 	69.427	73.986	75.855  	67.302	68.269	76.6
      max 	76.188	79.947	83.562  	72.616	77.151	80.321
      
      OK. the degression is disappeared.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01dbe5c9
    • R
      vmscan: bail out of direct reclaim after swap_cluster_max pages · a79311c1
      Rik van Riel 提交于
      When the VM is under pressure, it can happen that several direct reclaim
      processes are in the pageout code simultaneously.  It also happens that
      the reclaiming processes run into mostly referenced, mapped and dirty
      pages in the first round.
      
      This results in multiple direct reclaim processes having a lower
      pageout priority, which corresponds to a higher target of pages to
      scan.
      
      This in turn can result in each direct reclaim process freeing
      many pages.  Together, they can end up freeing way too many pages.
      
      This kicks useful data out of memory (in some cases more than half
      of all memory is swapped out).  It also impacts performance by
      keeping tasks stuck in the pageout code for too long.
      
      A 30% improvement in hackbench has been observed with this patch.
      
      The fix is relatively simple: in shrink_zone() we can check how many
      pages we have already freed, direct reclaim tasks break out of the
      scanning loop if they have already freed enough pages and have reached
      a lower priority level.
      
      We do not break out of shrink_zone() when priority == DEF_PRIORITY,
      to ensure that equal pressure is applied to every zone in the common
      case.
      
      However, in order to do this we do need to know how many pages we already
      freed, so move nr_reclaimed into scan_control.
      
      akpm: a historical interlude...
      
      We tried this in 2004:
      
      :commit e468e46a9bea3297011d5918663ce6d19094cf87
      :Author: akpm <akpm>
      :Date:   Thu Jun 24 15:53:52 2004 +0000
      :
      :[PATCH] vmscan.c: dont reclaim too many pages
      :
      :    The shrink_zone() logic can, under some circumstances, cause far too many
      :    pages to be reclaimed.  Say, we're scanning at high priority and suddenly hit
      :    a large number of reclaimable pages on the LRU.
      :    Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
      
      And we reverted it in 2006:
      
      :commit 210fe530
      :Author: Andrew Morton <akpm@osdl.org>
      :Date:   Fri Jan 6 00:11:14 2006 -0800
      :
      :    [PATCH] vmscan: balancing fix
      :
      :    Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
      :
      :      The shrink_zone() logic can, under some circumstances, cause far too many
      :      pages to be reclaimed.  Say, we're scanning at high priority and suddenly
      :      hit a large number of reclaimable pages on the LRU.
      :
      :      Change things so we bale out when SWAP_CLUSTER_MAX pages have been
      :      reclaimed.
      :
      :    Problem is, this change caused significant imbalance in inter-zone scan
      :    balancing by truncating scans of larger zones.
      :
      :    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
      :    balancing algorithm would require that if we're scanning 100 pages of
      :    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
      :    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
      :    reclaimed.  Thus effectively causing smaller zones to be scanned relatively
      :    harder than large ones.
      :
      :    Now I need to remember what the workload was which caused me to write this
      :    patch originally, then fix it up in a different way...
      
      And we haven't demonstrated that whatever problem caused that reversion is
      not being reintroduced by this change in 2008.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a79311c1
    • H
      hugetlb: fix sparse warnings · ebdd4aea
      Hannes Eder 提交于
      Fix the following sparse warnings:
      
        mm/hugetlb.c:375:3: warning: returning void-valued expression
        mm/hugetlb.c:408:3: warning: returning void-valued expression
      Signed-off-by: NHannes Eder <hannes@hanneseder.net>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebdd4aea
    • H
      swapfile: let others seed random · f0d7a4b3
      Hugh Dickins 提交于
      Remove the srandom32((u32)get_seconds()) from non-rotational swapon:
      there's been a coincidental discussion of earlier randomization, assume
      that goes ahead, let swapon be a client rather than stirring for itself.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0d7a4b3
    • H
      swapfile: change discard pgoff_t to sector_t · 858a2990
      Hugh Dickins 提交于
      Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to
      sector_t: given the constraints on swap offsets (in particular, the 5 bits
      of swap type accommodated in the same unsigned long), pgoff_t was actually
      safe as is, but it certainly looked worrying when shifted left.
      
      [akpm@linux-foundation.org: fix shift overflow]
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858a2990
    • H
      swapfile: swap allocation cycle if nonrot · c60aa176
      Hugh Dickins 提交于
      Though attempting to find free clusters (Andrea), swap allocation has
      always restarted its searches from the beginning of the swap area (sct),
      to reduce seek times between swap pages, by not scattering them all over
      the partition.
      
      But on a solidstate swap device, seeks are cheap, and block remapping to
      level the wear may be limited by zones: in that case it's better to cycle
      around the whole partition.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c60aa176
    • H
      swapfile: swapon randomize if nonrot · 20137a49
      Hugh Dickins 提交于
      Swap allocation has always started from the beginning of the swap area;
      but if we're dealing with a solidstate swap device which can only remap
      blocks within limited zones, that would sooner wear out the first zone.
      
      Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
      randomize the cluster_next starting position for allocation.
      
      If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
      with an "SS" at the right end of the kernel's "Adding ...  swap" message
      (so that if it's both nonrot and discardable, "SSD" will be shown there).
      Perhaps something should be shown in /proc/swaps (swapon -s), but we have
      to be more cautious before making any addition to that format.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20137a49
    • H
      swapfile: swap allocation use discard · 7992fde7
      Hugh Dickins 提交于
      When scan_swap_map() finds a free cluster of swap pages to allocate,
      discard the old contents of the cluster if the device supports discard.
      But don't bother when swap is so fragmented that we allocate single pages.
      
      Be careful about racing allocations made while we're scanning for a
      cluster; and hold up allocations made while we're discarding.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7992fde7
    • H
      swapfile: swapon use discard (trim) · 6a6ba831
      Hugh Dickins 提交于
      When adding swap, all the old data on swap can be forgotten: sys_swapon()
      discard all but the header page of the swap partition (or every extent but
      the header of the swap file), to give a solidstate swap device the
      opportunity to optimize its wear-levelling.
      
      If that succeeds, note SWP_DISCARDABLE for later use, and report it with a
      "D" at the right end of the kernel's "Adding ...  swap" message.  Perhaps
      something should be shown in /proc/swaps (swapon -s), but we have to be
      more cautious before making any addition to that format.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a6ba831
    • H
      swapfile: rearrange scan and swap_info · ebebbbe9
      Hugh Dickins 提交于
      Before making functional changes, rearrange scan_swap_map() to simplify
      subsequent diffs.  Actually, there is one functional change in there:
      leave cluster_nr negative while scanning for a new cluster - resetting it
      early increased the likelihood that when we have difficulty finding a free
      cluster, another task may come in and try doing exactly the same - just a
      waste of cpu.
      
      Before making functional changes, rearrange struct swap_info_struct
      slightly: flags will be needed as an unsigned long (for wait_on_bit), next
      is a good int to pair with prio, old_block_size is uninteresting so shift
      it to the end.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebebbbe9
    • H
      swapfile: remove v0 SWAP-SPACE message · 81e33971
      Hugh Dickins 提交于
      The kernel has not supported v0 SWAP-SPACE since 2.5.22: I think we can
      now safely drop its "version 0 swap is no longer supported" message - just
      say "Unable to find swap-space signature" as usual.  This removes one
      level of indentation from a stretch of sys_swapon().
      
      I'd have liked to be specific, saying "Unable to find SWAPSPACE2
      signature", but it's just too confusing that the version 1 signature shows
      the number 2.
      
      Irrelevant nearby cleanup: kmap(page) already gives page_address(page).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81e33971
    • H
      swapfile: remove surplus whitespace · 886bb7e9
      Hugh Dickins 提交于
      Remove trailing whitespace from swapfile.c, and odd swap_show() alignment.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      886bb7e9
    • H
      swapfile: remove SWP_ACTIVE mask · 22c6f8fd
      Hugh Dickins 提交于
      Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22c6f8fd
    • H
      swapfile: swapon needs larger size type · 73fd8748
      Hugh Dickins 提交于
      sys_swapon()'s swapfilesize (better renamed swapfilepages) is declared as
      an int, but should be an unsigned long like the maxpages it's compared
      against: on 64-bit (with 4kB pages) a swapfile of 2^44 bytes was rejected
      with "Swap area shorter than signature indicates".
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73fd8748
    • K
      mm: make vread() and vwrite() declaration · 69beeb1d
      KOSAKI Motohiro 提交于
      Sparse output following warnings.
      
      mm/vmalloc.c:1436:6: warning: symbol 'vread' was not declared. Should it be static?
      mm/vmalloc.c:1474:6: warning: symbol 'vwrite' was not declared. Should it be static?
      
      However, it is used by /dev/kmem. fixed here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69beeb1d
    • K
      mm: make setup_per_zone_inactive_ratio() static · efab8186
      KOSAKI Motohiro 提交于
      Sparse output following warning.
      
      mm/page_alloc.c:4301:6: warning: symbol 'setup_per_zone_inactive_ratio' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efab8186
    • K
      mm: make scan_zone_unevictable_pages() static · 14b90b22
      KOSAKI Motohiro 提交于
      sparse output following warning
      
      	mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14b90b22
    • K
      mm: make scan_all_zones_unevictable_pages() static · ff30153b
      KOSAKI Motohiro 提交于
      sparse output following warning.
      
      	mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff30153b
    • K
      mm: make mem_cgroup_resize_limit() static · d38d2a75
      KOSAKI Motohiro 提交于
      Sparse output following warnings.
      
      mm/memcontrol.c:782:5: warning: symbol 'mem_cgroup_resize_limit' was not
      declared.  Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d38d2a75
    • K
      mm: make maddr __iomem · 2bc7273b
      KOSAKI Motohiro 提交于
      sparse output following warnings.
      
      mm/memory.c:2936:8: warning: incorrect type in assignment (different address spaces)
      mm/memory.c:2936:8:    expected void *maddr
      mm/memory.c:2936:8:    got void [noderef] <asn:2>
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bc7273b
    • K
      mm: make init_section_page_cgroup() static · feb16694
      KOSAKI Motohiro 提交于
      Sparse output following warning.
      
      mm/page_cgroup.c:100:15: warning: symbol 'init_section_page_cgroup' was
      not declared.  Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feb16694
    • K
      memcg: reclaim shouldn't change zone->recent_rotated statistics · 077cbc58
      KOSAKI Motohiro 提交于
      memcg reclaim shouldn't change zone->recent_rotated statistics.  If
      memcgroup reclaim changes zone statistics, global reclaim can get a bit
      confused.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      077cbc58
    • H
      mm: optimize get_scan_ratio for no swap · b962716b
      Hugh Dickins 提交于
      Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP.  Yes, the gcc
      optimizer gives us that, when nr_swap_pages is #defined as 0L.  Move usual
      declaration to swapfile.c: it never belonged in page_alloc.c.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b962716b
    • H
      mm: add add_to_swap stub · 60371d97
      Hugh Dickins 提交于
      If we add a failing stub for add_to_swap(), then we can remove the #ifdef
      CONFIG_SWAP from mm/vmscan.c.
      
      This was intended as a source cleanup, but looking more closely, it turns
      out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
      page, whereas now it goes to the more suitable activate_locked, like the
      CONFIG_SWAP nr_swap_pages 0 case.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60371d97
    • H
      mm: remove gfp_mask from add_to_swap · ac47b003
      Hugh Dickins 提交于
      Remove gfp_mask argument from add_to_swap(): it's misleading because its
      only caller, shrink_page_list(), is not atomic at that point; and in due
      course (implementing discard) we'll sometimes want to allocate some memory
      with GFP_NOIO (as is used in swap_writepage) when allocating swap.
      
      No change to the gfp_mask passed down to add_to_swap_cache(): still use
      __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
      though it's not obvious if that's the best combination to ask for here.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac47b003
    • H
      mm: remove try_to_munlock from vmscan · 63d6c5ad
      Hugh Dickins 提交于
      An unfortunate feature of the Unevictable LRU work was that reclaiming an
      anonymous page involved an extra scan through the anon_vma: to check that
      the page is evictable before allocating swap, because the swap could not
      be freed reliably soon afterwards.
      
      Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's
      not an issue any more: remove try_to_munlock() call from
      shrink_page_list(), leaving it to try_to_munmap() to discover if the page
      is one to be culled to the unevictable list - in which case then
      try_to_free_swap().
      
      Update unevictable-lru.txt to remove comments on the try_to_munlock() in
      shrink_page_list(), and shorten some lines over 80 columns.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63d6c5ad
    • H
      mm: try_to_unuse check removing right swap · 68bdc8d6
      Hugh Dickins 提交于
      There's a possible race in try_to_unuse() which Nick Piggin led me to two
      years ago.  Where it does lock_page() after read_swap_cache_async(), what
      if another task removed that page from swapcache just before we locked it?
      
      It would sail though the (*swap_map > 1) tests doing nothing (because it
      could not have been removed from swapcache before its swap references were
      gone), until it reaches the delete_from_swap_cache(page) near the bottom.
      
      Now imagine that this page has been allocated to swap on a different swap
      area while we dropped page lock (perhaps at the top, perhaps in unuse_mm):
      we could wrongly remove from swap cache before the page has been written
      to swap, so a subsequent do_swap_page() would read in stale data from
      swap.
      
      I think this case could not happen before: remove_exclusive_swap_page()
      refused while page count was raised.  But now with reuse_swap_page() and
      try_to_free_swap() removing from swap cache without minding page count, I
      think it could happen - the previous patch argued that it was safe because
      try_to_unuse() already ignored page count, but overlooked that it might be
      breaking the assumptions in try_to_unuse() itself.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68bdc8d6
    • H
      mm: try_to_free_swap replaces remove_exclusive_swap_page · a2c43eed
      Hugh Dickins 提交于
      remove_exclusive_swap_page(): its problem is in living up to its name.
      
      It doesn't matter if someone else has a reference to the page (raised
      page_count); it doesn't matter if the page is mapped into userspace
      (raised page_mapcount - though that hints it may be worth keeping the
      swap): all that matters is that there be no more references to the swap
      (and no writeback in progress).
      
      swapoff (try_to_unuse) has been removing pages from swapcache for years,
      with no concern for page count or page mapcount, and we used to have a
      comment in lookup_swap_cache() recognizing that: if you go for a page of
      swapcache, you'll get the right page, but it could have been removed from
      swapcache by the time you get page lock.
      
      So, give up asking for exclusivity: get rid of
      remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
      remove_exclusive_swap_page_count() which were spawned for the recent LRU
      work: replace them by the simpler try_to_free_swap() which just checks
      page_swapcount().
      
      Similarly, remove the page_count limitation from free_swap_and_count(),
      but assume that it's worth holding on to the swap if page is mapped and
      swap nowhere near full.  Add a vm_swap_full() test in free_swap_cache()?
      It would be consistent, but I think we probably have enough for now.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2c43eed
    • H
      mm: reuse_swap_page replaces can_share_swap_page · 7b1fe597
      Hugh Dickins 提交于
      A good place to free up old swap is where do_wp_page(), or do_swap_page(),
      is about to redirty the page: the data on disk is then stale and won't be
      read again; and if we do decide to write the page out later, using the
      previous swap location makes an unnecessary disk seek very likely.
      
      So give can_share_swap_page() the side-effect of delete_from_swap_cache()
      when it safely can.  And can_share_swap_page() was always a misleading
      name, the more so if it has a side-effect: rename it reuse_swap_page().
      
      Irrelevant cleanup nearby: remove swap_token_default_timeout definition
      from swap.h: it's used nowhere.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b1fe597
    • H
      mm: wp lock page before deciding cow · ab967d86
      Hugh Dickins 提交于
      An application may rely on get_user_pages() to give it pages writable from
      userspace and shared with a driver, GUP breaking COW if necessary.  It may
      mprotect() the pages' writability, off and on, from time to time.
      
      Normally this works fine (so long as the app does not fork); but just
      occasionally, under memory pressure, a readonly pte in a newly writable
      area is COWed unnecessarily, breaking the link with the driver: because
      do_wp_page() does trylock_page, and falls back to COW whenever that fails.
      
      For reliable behaviour in the unshared case, when the trylock_page fails,
      now unlock pagetable, lock page and relock pagetable, before deciding
      whether Copy-On-Write is really necessary.
      
      Reported-by: Zhou Yingchao
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab967d86