1. 20 10月, 2008 6 次提交
    • N
      mlock: mlocked pages are unevictable · b291f000
      Nick Piggin 提交于
      Make sure that mlocked pages also live on the unevictable LRU, so kswapd
      will not scan them over and over again.
      
      This is achieved through various strategies:
      
      1) add yet another page flag--PG_mlocked--to indicate that
         the page is locked for efficient testing in vmscan and,
         optionally, fault path.  This allows early culling of
         unevictable pages, preventing them from getting to
         page_referenced()/try_to_unmap().  Also allows separate
         accounting of mlock'd pages, as Nick's original patch
         did.
      
         Note:  Nick's original mlock patch used a PG_mlocked
         flag.  I had removed this in favor of the PG_unevictable
         flag + an mlock_count [new page struct member].  I
         restored the PG_mlocked flag to eliminate the new
         count field.
      
      2) add the mlock/unevictable infrastructure to mm/mlock.c,
         with internal APIs in mm/internal.h.  This is a rework
         of Nick's original patch to these files, taking into
         account that mlocked pages are now kept on unevictable
         LRU list.
      
      3) update vmscan.c:page_evictable() to check PageMlocked()
         and, if vma passed in, the vm_flags.  Note that the vma
         will only be passed in for new pages in the fault path;
         and then only if the "cull unevictable pages in fault
         path" patch is included.
      
      4) add try_to_unlock() to rmap.c to walk a page's rmap and
         ClearPageMlocked() if no other vmas have it mlocked.
         Reuses as much of try_to_unmap() as possible.  This
         effectively replaces the use of one of the lru list links
         as an mlock count.  If this mechanism let's pages in mlocked
         vmas leak through w/o PG_mlocked set [I don't know that it
         does], we should catch them later in try_to_unmap().  One
         hopes this will be rare, as it will be relatively expensive.
      
      Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      splitlru: introduce __get_user_pages():
      
        New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
        because current get_user_pages() can't grab PROT_NONE pages theresore it
        cause PROT_NONE pages can't munlock.
      
      [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
      [akpm@linux-foundation.org: untangle patch interdependencies]
      [akpm@linux-foundation.org: fix things after out-of-order merging]
      [hugh@veritas.com: fix page-flags mess]
      [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
      [kosaki.motohiro@jp.fujitsu.com: build fix]
      [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
      [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b291f000
    • L
      Unevictable LRU Page Statistics · 7b854121
      Lee Schermerhorn 提交于
      Report unevictable pages per zone and system wide.
      
      Kosaki Motohiro added support for memory controller unevictable
      statistics.
      
      [riel@redhat.com: fix printk in show_free_areas()]
      [akpm@linux-foundation.org: fix units in /proc/vmstats]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b854121
    • R
      vmscan: second chance replacement for anonymous pages · 556adecb
      Rik van Riel 提交于
      We avoid evicting and scanning anonymous pages for the most part, but
      under some workloads we can end up with most of memory filled with
      anonymous pages.  At that point, we suddenly need to clear the referenced
      bits on all of memory, which can take ages on very large memory systems.
      
      We can reduce the maximum number of pages that need to be scanned by not
      taking the referenced state into account when deactivating an anonymous
      page.  After all, every anonymous page starts out referenced, so why
      check?
      
      If an anonymous page gets referenced again before it reaches the end of
      the inactive list, we move it back to the active list.
      
      To keep the maximum amount of necessary work reasonable, we scale the
      active to inactive ratio with the size of memory, using the formula
      active:inactive ratio = sqrt(memory in GB * 10).
      
      Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
      instead of by the amount of memory present in the system.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
      [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      556adecb
    • R
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel 提交于
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
    • R
      define page_file_cache() function · b2e18538
      Rik van Riel 提交于
      Define page_file_cache() function to answer the question:
      	is page backed by a file?
      
      Originally part of Rik van Riel's split-lru patch.  Extracted to make
      available for other, independent reclaim patches.
      
      Moved inline function to linux/mm_inline.h where it will be needed by
      subsequent "split LRU" and "noreclaim" patches.
      
      Unfortunately this needs to use a page flag, since the PG_swapbacked state
      needs to be preserved all the way to the point where the page is last
      removed from the LRU.  Trying to derive the status from other info in the
      page resulted in wrong VM statistics in earlier split VM patchsets.
      
      The total number of page flags in use on a 32 bit machine after this patch
      is 19.
      
      [akpm@linux-foundation.org: fix up out-of-order merge fallout]
      [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMinChan Kim <minchan.kim@gmail.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2e18538
    • C
      vmscan: Use an indexed array for LRU variables · b69408e8
      Christoph Lameter 提交于
      Currently we are defining explicit variables for the inactive and active
      list.  An indexed array can be more generic and avoid repeating similar
      code in several places in the reclaim code.
      
      We are saving a few bytes in terms of code size:
      
      Before:
      
         text    data     bss     dec     hex filename
      4097753  573120 4092484 8763357  85b7dd vmlinux
      
      After:
      
         text    data     bss     dec     hex filename
      4097729  573120 4092484 8763333  85b7c5 vmlinux
      
      Having an easy way to add new lru lists may ease future work on the
      reclaim code.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b69408e8
  2. 17 10月, 2008 1 次提交
  3. 03 10月, 2008 1 次提交
    • A
      mm: handle initialising compound pages at orders greater than MAX_ORDER · 6babc32c
      Andy Whitcroft 提交于
      When we initialise a compound page we initialise the page flags and head
      page pointer for all base pages spanned by that page.  When we initialise
      a gigantic page (a page of order greater than or equal to MAX_ORDER) we
      have to initialise more than MAX_ORDER_NR_PAGES pages.  Currently we
      assume that all elements of the mem_map in this page are contigious in
      memory.  However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
      and with SPARSEMEM enabled they will not be contigious.  This leads us to
      walk off the end of the first section and scribble on everything which
      follows, BAD.
      
      When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
      section of the mem_map.  As gigantic pages can only be maximally aligned
      we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
      the start of the page.
      
      This is a bug fix for the gigantic page support in hugetlbfs.
      
      Credit to Mel Gorman for spotting the issue.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6babc32c
  4. 03 9月, 2008 2 次提交
  5. 13 8月, 2008 1 次提交
  6. 31 7月, 2008 1 次提交
  7. 28 7月, 2008 1 次提交
  8. 25 7月, 2008 12 次提交
  9. 08 7月, 2008 6 次提交
  10. 04 7月, 2008 1 次提交
  11. 26 6月, 2008 1 次提交
  12. 10 6月, 2008 2 次提交
  13. 03 6月, 2008 1 次提交
  14. 25 5月, 2008 3 次提交
    • H
      memory hotplug: fix early allocation handling · cd94b9db
      Heiko Carstens 提交于
      Trying to add memory via add_memory() from within an initcall function
      results in
      
      bootmem alloc of 163840 bytes failed!
      Kernel panic - not syncing: Out of memory
      
      This is caused by zone_wait_table_init() which uses system_state to decide
      if it should use the bootmem allocator or not.
      
      When initcalls are handled the system_state is still SYSTEM_BOOTING but
      the bootmem allocator doesn't work anymore.  So the allocation will fail.
      
      To fix this use slab_is_available() instead as indicator like we do it
      everywhere else.
      
      [akpm@linux-foundation.org: coding-style fix]
      Reviewed-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd94b9db
    • A
      zonelists: handle a node zonelist with no applicable entries · 7eb54824
      Andy Whitcroft 提交于
      When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
      panics when trying node local allocations:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000034c
       IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e
       *pdpt = 00000000013a7001 *pde = 0000000000000000
       Oops: 0000 [#1] SMP
       Modules linked in:
      
       Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
       EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0
       EIP is at get_page_from_freelist+0x4a/0x18e
       EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
       ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
        DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
       Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
       Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
              f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
              00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
       Call Trace:
        [<c10426e4>] __alloc_pages_internal+0x99/0x378
        [<c10429ca>] __alloc_pages+0x7/0x9
        [<c105e0e8>] kmem_getpages+0x66/0xef
        [<c105ec55>] cache_grow+0x8f/0x123
        [<c105f117>] ____cache_alloc_node+0xb9/0xe4
        [<c105f427>] kmem_cache_alloc_node+0x92/0xd2
        [<c122118c>] setup_cpu_cache+0xaf/0x177
        [<c105e6ca>] kmem_cache_create+0x2c8/0x353
        [<c13853af>] kmem_cache_init+0x1ce/0x3ad
        [<c13755c5>] start_kernel+0x178/0x1ee
      
      This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
      page.  In this system there is only ZONE_DMA and ZONE_NORMAL memory on
      node 0, all other nodes are mapped above 4GB physical.  Here is a dump
      of the zonelists from this system:
      
          zonelists pgdat=c1400000
           0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
           1: c14006c0:2 c1400360:1 c1400000:0
          zonelists pgdat=f7c00000
           0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
           1: f7c006c0:2
          zonelists pgdat=f7e00000
           0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
           1: f7e006c0:2
      
      When performing a node local allocation we call get_page_from_freelist()
      looking for a page.  It in turn calls first_zones_zonelist() which returns
      a preferred_zone.  Where there are no applicable zones this will be NULL.
      However we use this unconditionally, leading to this panic.
      
      Where there are no applicable zones there is no possibility of a successful
      allocation, so simply fail the allocation.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7eb54824
    • J
      mm: don't drop a partial page in a zone's memory map size · f7232154
      Johannes Weiner 提交于
      In a zone's present pages number, account for all pages occupied by the
      memory map, including a partial.
      Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7232154
  15. 15 5月, 2008 1 次提交
    • H
      memory_hotplug: always initialize pageblock bitmap · 76cdd58e
      Heiko Carstens 提交于
      Trying to online a new memory section that was added via memory hotplug
      sometimes results in crashes when the new pages are added via __free_page.
       Reason for that is that the pageblock bitmap isn't initialized and hence
      contains random stuff.  That means that get_pageblock_migratetype()
      returns also random stuff and therefore
      
      	list_add(&page->lru,
      		&zone->free_area[order].free_list[migratetype]);
      
      in __free_one_page() tries to do a list_add to something that isn't even
      necessarily a list.
      
      This happens since 86051ca5 ("mm: fix
      usemap initialization") which makes sure that the pageblock bitmap gets
      only initialized for pages present in a zone.  Unfortunately for hot-added
      memory the zones "grow" after the memmap and the pageblock memmap have
      been initialized.  Which means that the new pages have an unitialized
      bitmap.  To solve this the calls to grow_zone_span() and grow_pgdat_span()
      are moved to __add_zone() just before the initialization happens.
      
      The patch also moves the two functions since __add_zone() is the only
      caller and I didn't want to add a forward declaration.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76cdd58e