1. 22 9月, 2009 1 次提交
  2. 17 6月, 2009 4 次提交
  3. 18 5月, 2009 1 次提交
    • M
      [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 · eb33575c
      Mel Gorman 提交于
      pfn_valid() is meant to be able to tell if a given PFN has valid memmap
      associated with it or not. In FLATMEM, it is expected that holes always
      have valid memmap as long as there is valid PFNs either side of the hole.
      In SPARSEMEM, it is assumed that a valid section has a memmap for the
      entire section.
      
      However, ARM and maybe other embedded architectures in the future free
      memmap backing holes to save memory on the assumption the memmap is never
      used. The page_zone linkages are then broken even though pfn_valid()
      returns true. A walker of the full memmap must then do this additional
      check to ensure the memmap they are looking at is sane by making sure the
      zone and PFN linkages are still valid. This is expensive, but walkers of
      the full memmap are extremely rare.
      
      This was caught before for FLATMEM and hacked around but it hits again for
      SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
      are totally screwed. This looks like a hatchet job but the reality is that
      any clean solution would end up consumning all the memory saved by punching
      these unexpected holes in the memmap. For example, we tried marking the
      memmap within the section invalid but the section size exceeds the size of
      the hole in most cases so pfn_valid() starts returning false where valid
      memmap exists. Shrinking the size of the section would increase memory
      consumption offsetting the gains.
      
      This patch identifies when an architecture is punching unexpected holes
      in the memmap that the memory model cannot automatically detect and sets
      ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
      which is the model sub-architecture this has been reported on but may expand
      later. When set, walkers of the full memmap must call memmap_valid_within()
      for each PFN and passing in what it expects the page and zone to be for
      that PFN. If it finds the linkages to be broken, it assumes the memmap is
      invalid for that PFN.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      eb33575c
  4. 01 4月, 2009 1 次提交
  5. 13 3月, 2009 1 次提交
    • R
      numa, cpumask: move numa_node_id default implementation to topology.h · 082edb7b
      Rusty Russell 提交于
      Impact: cleanup, potential bugfix
      
      Not sure what changed to expose this, but clearly that numa_node_id()
      doesn't belong in mmzone.h (the inline in gfp.h is probably overkill, too).
      
      In file included from include/linux/topology.h:34,
                       from arch/x86/mm/numa.c:2:
      /home/rusty/patches-cpumask/linux-2.6/arch/x86/include/asm/topology.h:64:1: warning: "numa_node_id" redefined
      In file included from include/linux/topology.h:32,
                       from arch/x86/mm/numa.c:2:
      include/linux/mmzone.h:770:1: warning: this is the location of the previous definition
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Mike Travis <travis@sgi.com>
      LKML-Reference: <200903132343.37661.rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      082edb7b
  6. 19 2月, 2009 1 次提交
  7. 09 1月, 2009 1 次提交
  8. 20 10月, 2008 6 次提交
    • K
      memcg: allocate all page_cgroup at boot · 52d4b9ac
      KAMEZAWA Hiroyuki 提交于
      Allocate all page_cgroup at boot and remove page_cgroup poitner from
      struct page.  This patch adds an interface as
      
       struct page_cgroup *lookup_page_cgroup(struct page*)
      
      All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.
      
      Remove page_cgroup pointer reduces the amount of memory by
       - 4 bytes per PAGE_SIZE.
       - 8 bytes per PAGE_SIZE
      if memory controller is disabled. (even if configured.)
      
      On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
      On my x86-64 server with 48GB of memory, this saves 96MB of memory.
      I think this reduction makes sense.
      
      By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
      This means
        - we're not necessary to be afraid of kmalloc faiulre.
          (this can happen because of gfp_mask type.)
        - we can avoid calling kmalloc/kfree.
        - we can avoid allocating tons of small objects which can be fragmented.
        - we can know what amount of memory will be used for this extra-lru handling.
      
      I added printk message as
      
      	"allocated %ld bytes of page_cgroup"
              "please try cgroup_disable=memory option if you don't want"
      
      maybe enough informative for users.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52d4b9ac
    • N
      vmstat: mlocked pages statistics · 5344b7e6
      Nick Piggin 提交于
      Add NR_MLOCK zone page state, which provides a (conservative) count of
      mlocked pages (actually, the number of mlocked pages moved off the LRU).
      
      Reworked by lts to fit in with the modified mlock page support in the
      Reclaim Scalability series.
      
      [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
      [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5344b7e6
    • L
      Unevictable LRU Infrastructure · 894bc310
      Lee Schermerhorn 提交于
      When the system contains lots of mlocked or otherwise unevictable pages,
      the pageout code (kswapd) can spend lots of time scanning over these
      pages.  Worse still, the presence of lots of unevictable pages can confuse
      kswapd into thinking that more aggressive pageout modes are required,
      resulting in all kinds of bad behaviour.
      
      Infrastructure to manage pages excluded from reclaim--i.e., hidden from
      vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
      maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
      them from vmscan.
      
      Kosaki Motohiro added the support for the memory controller unevictable
      lru list.
      
      Pages on the unevictable list have both PG_unevictable and PG_lru set.
      Thus, PG_unevictable is analogous to and mutually exclusive with
      PG_active--it specifies which LRU list the page is on.
      
      The unevictable infrastructure is enabled by a new mm Kconfig option
      [CONFIG_]UNEVICTABLE_LRU.
      
      A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
      not a page may be evictable.  Subsequent patches will add the various
      !evictable tests.  We'll want to keep these tests light-weight for use in
      shrink_active_list() and, possibly, the fault path.
      
      To avoid races between tasks putting pages [back] onto an LRU list and
      tasks that might be moving the page from non-evictable to evictable state,
      the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
      -- tests the "evictability" of a page after placing it on the LRU, before
      dropping the reference.  If the page has become unevictable,
      putback_lru_page() will redo the 'putback', thus moving the page to the
      unevictable list.  This way, we avoid "stranding" evictable pages on the
      unevictable list.
      
      [akpm@linux-foundation.org: fix fallout from out-of-order merge]
      [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
      [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
      [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
      [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
      [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
      [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
      [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NBenjamin Kidwell <benjkidwell@yahoo.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894bc310
    • R
      vmscan: second chance replacement for anonymous pages · 556adecb
      Rik van Riel 提交于
      We avoid evicting and scanning anonymous pages for the most part, but
      under some workloads we can end up with most of memory filled with
      anonymous pages.  At that point, we suddenly need to clear the referenced
      bits on all of memory, which can take ages on very large memory systems.
      
      We can reduce the maximum number of pages that need to be scanned by not
      taking the referenced state into account when deactivating an anonymous
      page.  After all, every anonymous page starts out referenced, so why
      check?
      
      If an anonymous page gets referenced again before it reaches the end of
      the inactive list, we move it back to the active list.
      
      To keep the maximum amount of necessary work reasonable, we scale the
      active to inactive ratio with the size of memory, using the formula
      active:inactive ratio = sqrt(memory in GB * 10).
      
      Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
      instead of by the amount of memory present in the system.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
      [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      556adecb
    • R
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel 提交于
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
    • C
      vmscan: Use an indexed array for LRU variables · b69408e8
      Christoph Lameter 提交于
      Currently we are defining explicit variables for the inactive and active
      list.  An indexed array can be more generic and avoid repeating similar
      code in several places in the reclaim code.
      
      We are saving a few bytes in terms of code size:
      
      Before:
      
         text    data     bss     dec     hex filename
      4097753  573120 4092484 8763357  85b7dd vmlinux
      
      After:
      
         text    data     bss     dec     hex filename
      4097729  573120 4092484 8763333  85b7c5 vmlinux
      
      Having an easy way to add new lru lists may ease future work on the
      reclaim code.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b69408e8
  9. 14 9月, 2008 1 次提交
    • M
      mm: mark the correct zone as full when scanning zonelists · 5bead2a0
      Mel Gorman 提交于
      The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
      scanning zonelists to keep track of where in the zonelist it is.  The
      zoneref that is returned corresponds to the the next zone that is to be
      scanned, not the current one.  It was intended to be treated as an opaque
      list.
      
      When the page allocator is scanning a zonelist, it marks elements in the
      zonelist corresponding to zones that are temporarily full.  As the
      zonelist is being updated, it uses the cursor here;
      
        if (NUMA_BUILD)
              zlc_mark_zone_full(zonelist, z);
      
      This is intended to prevent rescanning in the near future but the zoneref
      cursor does not correspond to the zone that has been found to be full.
      This is an easy misunderstanding to make so this patch corrects the
      problem by changing zoneref cursor to be the current zone being scanned
      instead of the next one.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.26.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5bead2a0
  10. 25 5月, 2008 1 次提交
  11. 30 4月, 2008 2 次提交
  12. 28 4月, 2008 8 次提交
    • Y
      memory hotplug: register section/node id to free · 04753278
      Yasunori Goto 提交于
      This patch set is to free pages which is allocated by bootmem for
      memory-hotremove.  Some structures of memory management are allocated by
      bootmem.  ex) memmap, etc.
      
      To remove memory physically, some of them must be freed according to
      circumstance.  This patch set makes basis to free those pages, and free
      memmaps.
      
      Basic my idea is using remain members of struct page to remember information
      of users of bootmem (section number or node id).  When the section is
      removing, kernel can confirm it.  By this information, some issues can be
      solved.
      
        1) When the memmap of removing section is allocated on other
           section by bootmem, it should/can be free.
        2) When the memmap of removing section is allocated on the
           same section, it shouldn't be freed. Because the section has to be
           logical memory offlined already and all pages must be isolated against
           page allocater. If it is freed, page allocator may use it which will
           be removed physically soon.
        3) When removing section has other section's memmap,
           kernel will be able to show easily which section should be removed
           before it for user. (Not implemented yet)
        4) When the above case 2), the page isolation will be able to check and skip
           memmap's page when logical memory offline (offline_pages()).
           Current page isolation code fails in this case because this page is
           just reserved page and it can't distinguish this pages can be
           removed or not. But, it will be able to do by this patch.
           (Not implemented yet.)
        5) The node information like pgdat has similar issues. But, this
           will be able to be solved too by this.
           (Not implemented yet, but, remembering node id in the pages.)
      
      Fortunately, current bootmem allocator just keeps PageReserved flags,
      and doesn't use any other members of page struct. The users of
      bootmem doesn't use them too.
      
      This patch:
      
      This is to register information which is node or section's id.  Kernel can
      distinguish which node/section uses the pages allcated by bootmem.  This is
      basis for hot-remove sections or nodes.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04753278
    • C
      mm: Get rid of __ZONE_COUNT · 97965478
      Christoph Lameter 提交于
      It was used to compensate because MAX_NR_ZONES was not available to the
      #ifdefs.  Export MAX_NR_ZONES via the new mechanism and get rid of
      __ZONE_COUNT.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97965478
    • C
      pageflags: get rid of FLAGS_RESERVED · 9223b419
      Christoph Lameter 提交于
      NR_PAGEFLAGS specifies the number of page flags we are using.  From that we
      can calculate the number of bits leftover that can be used for zone, node (and
      maybe the sections id).  There is no need anymore for FLAGS_RESERVED if we use
      NR_PAGEFLAGS.
      
      Use the new methods to make NR_PAGEFLAGS available via the preprocessor.
      NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields.
      These field widths have to be available to the preprocessor.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9223b419
    • A
      mm: make early_pfn_to_nid() a C function · b4544568
      Andrew Morton 提交于
      Fix this (sparc64)
      
      mm/sparse-vmemmap.c: In function `vmemmap_verify':
      mm/sparse-vmemmap.c:64: warning: unused variable `pfn'
      
      by switching to a C function which touches its arg.
      
      (reason 3,555 why macros are bad)
      
      Also, the `nid' arg was misnamed.
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4544568
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • M
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman 提交于
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • M
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman 提交于
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • H
      remove sparse warning for mmzone.h · ddc81ed2
      Harvey Harrison 提交于
      include/linux/mmzone.h:640:22: warning: potentially expensive pointer subtraction
      
      Calculate the offset into the node_zones array rather than the index
      using casts to (char *) and comparing against the index * sizeof(struct zone).
      
      On X86_32 this saves a sar, but code size increases by one byte per
      is_highmem() use due to 32-bit cmps rather than 16 bit cmps.
      
      Before:
       207:   2b 80 8c 07 00 00       sub    0x78c(%eax),%eax
       20d:   c1 f8 0b                sar    $0xb,%eax
       210:   83 f8 02                cmp    $0x2,%eax
       213:   74 16                   je     22b <kmap_atomic_prot+0x144>
       215:   83 f8 03                cmp    $0x3,%eax
       218:   0f 85 8f 00 00 00       jne    2ad <kmap_atomic_prot+0x1c6>
       21e:   83 3d 00 00 00 00 02    cmpl   $0x2,0x0
       225:   0f 85 82 00 00 00       jne    2ad <kmap_atomic_prot+0x1c6>
       22b:   64 a1 00 00 00 00       mov    %fs:0x0,%eax
      
      After:
       207:   2b 80 8c 07 00 00       sub    0x78c(%eax),%eax
       20d:   3d 00 10 00 00          cmp    $0x1000,%eax
       212:   74 18                   je     22c <kmap_atomic_prot+0x145>
       214:   3d 00 18 00 00          cmp    $0x1800,%eax
       219:   0f 85 8f 00 00 00       jne    2ae <kmap_atomic_prot+0x1c7>
       21f:   83 3d 00 00 00 00 02    cmpl   $0x2,0x0
       226:   0f 85 82 00 00 00       jne    2ae <kmap_atomic_prot+0x1c7>
       22c:   64 a1 00 00 00 00       mov    %fs:0x0,%eax
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddc81ed2
  13. 22 4月, 2008 1 次提交
  14. 06 2月, 2008 1 次提交
  15. 17 10月, 2007 10 次提交
    • D
      mm: test and set zone reclaim lock before starting reclaim · d773ed6b
      David Rientjes 提交于
      Introduces new zone flag interface for testing and setting flags:
      
      	int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)
      
      Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
      called, this flag is test and set before starting zone reclaim.  Zone reclaim
      starts in __alloc_pages() when a zone's watermark fails and the system is in
      zone_reclaim_mode.  If it's already in reclaim, there's no need to start again
      so it is simply considered full for that allocation attempt.
      
      There is a change of behavior with regard to concurrent zone shrinking.  It is
      now possible for try_to_free_pages() or kswapd to already be shrinking a
      particular zone when __alloc_pages() starts zone reclaim.  In this case, it is
      possible for two concurrent threads to invoke shrink_zone() for a single zone.
      
      This change forbids a zone to be in zone reclaim twice, which was always the
      behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
      when starting zone reclaim.
      
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d773ed6b
    • D
      oom: add per-zone locking · 098d7f12
      David Rientjes 提交于
      OOM killer synchronization should be done with zone granularity so that memory
      policy and cpuset allocations may have their corresponding zones locked and
      allow parallel kills for other OOM conditions that may exist elsewhere in the
      system.  DMA allocations can be targeted at the zone level, which would not be
      possible if locking was done in nodes or globally.
      
      Synchronization shall be done with a variation of "trylocks." The goal is to
      put the current task to sleep and restart the failed allocation attempt later
      if the trylock fails.  Otherwise, the OOM killer is invoked.
      
      Each zone in the zonelist that __alloc_pages() was called with is checked for
      the newly-introduced ZONE_OOM_LOCKED flag.  If any zone has this flag present,
      the "trylock" to serialize the OOM killer fails and returns zero.  Otherwise,
      all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function
      returns non-zero.
      
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      098d7f12
    • D
      oom: change all_unreclaimable zone member to flags · e815af95
      David Rientjes 提交于
      Convert the int all_unreclaimable member of struct zone to unsigned long
      flags.  This can now be used to specify several different zone flags such as
      all_unreclaimable and reclaim_in_progress, which can now be removed and
      converted to a per-zone flag.
      
      Flags are set and cleared as follows:
      
      	zone_set_flag(struct zone *zone, zone_flags_t flag)
      	zone_clear_flag(struct zone *zone, zone_flags_t flag)
      
      Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
      which have the same semantics as the old zone->all_unreclaimable and
      zone->reclaim_in_progress, respectively.  Also converts all current users that
      set or clear either flag to use the new interface.
      
      Helper functions are defined to test the flags:
      
      	int zone_is_all_unreclaimable(const struct zone *zone)
      	int zone_is_reclaim_locked(const struct zone *zone)
      
      All flag operators are of the atomic variety because there are currently
      readers that are implemented that do not take zone->lock.
      
      [akpm@linux-foundation.org: add needed include]
      Cc: Andrea Arcangeli <andrea@suse.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e815af95
    • K
      memory unplug: page isolation · a5d76b54
      KAMEZAWA Hiroyuki 提交于
      Implement generic chunk-of-pages isolation method by using page grouping ops.
      
      This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
       - MIGRATE_TYPES increases.
       - bitmap for migratetype is enlarged.
      
      pages of MIGRATE_ISOLATE migratetype will not be allocated even if it is free.
      By this, you can isolated *freed* pages from users. How-to-free pages is not
      a purpose of this patch. You may use reclaim and migrate codes to free pages.
      
      If start_isolate_page_range(start,end) is called,
       - migratetype of the range turns to be MIGRATE_ISOLATE  if
         its type is MIGRATE_MOVABLE. (*) this check can be updated if other
         memory reclaiming works make progress.
       - MIGRATE_ISOLATE is not on migratetype fallback list.
       - All free pages and will-be-freed pages are isolated.
      To check all pages in the range are isolated or not,  use test_pages_isolated(),
      To cancel isolation, use undo_isolate_page_range().
      
      Changes V6 -> V7
       - removed unnecessary #ifdef
      
      There are HOLES_IN_ZONE handling codes...I'm glad if we can remove them..
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5d76b54
    • M
      Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo · 467c996c
      Mel Gorman 提交于
      This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
       The information is collected only on request so there is no runtime overhead.
       The statistics are in three parts:
      
      The first part prints information on the size of blocks that pages are
      being grouped on and looks like
      
      Page block order: 10
      Pages per block:  1024
      
      The second part is a more detailed version of /proc/buddyinfo and looks like
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type  Reclaimable      1      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Reserve      0      4      4      0      0      0      0      1      0      1      0
      Node    0, zone   Normal, type    Unmovable    111      8      4      4      2      3      1      0      0      0      0
      Node    0, zone   Normal, type  Reclaimable    293     89      8      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Movable      1      6     13      9      7      6      3      0      0      0      0
      Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      4
      
      The third part looks like
      
      Number of blocks type     Unmovable  Reclaimable      Movable      Reserve
      Node 0, zone      DMA            0            1            2            1
      Node 0, zone   Normal            3           17           94            4
      
      To walk the zones within a node with interrupts disabled, walk_zones_in_node()
      is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
      /proc/pagetypeinfo to reduce code duplication.  It seems specific to what
      vmstat.c requires but could be broken out as a general utility function in
      mmzone.c if there were other other potential users.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      467c996c
    • M
      Do not depend on MAX_ORDER when grouping pages by mobility · d9c23400
      Mel Gorman 提交于
      Currently mobility grouping works at the MAX_ORDER_NR_PAGES level.  This makes
      sense for the majority of users where this is also the huge page size.
      However, on platforms like ia64 where the huge page size is runtime
      configurable it is desirable to group at a lower order.  On x86_64 and
      occasionally on x86, the hugepage size may not always be MAX_ORDER_NR_PAGES.
      
      This patch groups pages together based on the value of HUGETLB_PAGE_ORDER.  It
      uses a compile-time constant if possible and a variable where the huge page
      size is runtime configurable.
      
      It is assumed that grouping should be done at the lowest sensible order and
      that the user would not want to override this.  If this is not true,
      page_block order could be forced to a variable initialised via a boot-time
      kernel parameter.
      
      One potential issue with this patch is that IA64 now parses hugepagesz with
      early_param() instead of __setup().  __setup() is called after the memory
      allocator has been initialised and the pageblock bitmaps already setup.  In
      tests on one IA64 there did not seem to be any problem with using
      early_param() and in fact may be more correct as it guarantees the parameter
      is handled before the parsing of hugepages=.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9c23400
    • M
      don't group high order atomic allocations · 64c5e135
      Mel Gorman 提交于
      Grouping high-order atomic allocations together was intended to allow
      bursty users of atomic allocations to work such as e1000 in situations
      where their preallocated buffers were depleted.  This did not work in at
      least one case with a wireless network adapter needing order-1 allocations
      frequently.  To resolve that, the free pages used for min_free_kbytes were
      moved to separate contiguous blocks with the patch
      bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.
      
      It is felt that keeping the free pages in the same contiguous blocks should
      be sufficient for bursty short-lived high-order atomic allocations to
      succeed, maybe even with the e1000.  Even if there is a failure, increasing
      the value of min_free_kbytes will free pages as contiguous bloks in
      contrast to the standard buddy allocator which makes no attempt to keep the
      minimum number of free pages contiguous.
      
      This patch backs out grouping high order atomic allocations together to
      determine if it is really needed or not.  If a new report comes in about
      high-order atomic allocations failing, the feature can be reintroduced to
      determine if it fixes the problem or not.  As a side-effect, this patch
      reduces by 1 the number of bits required to track the mobility type of
      pages within a MAX_ORDER_NR_PAGES block.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64c5e135
    • M
      remove PAGE_GROUP_BY_MOBILITY · ac0e5b7a
      Mel Gorman 提交于
      Grouping pages by mobility can be disabled at compile-time. This was
      considered undesirable by a number of people. However, in the current stack of
      patches, it is not a simple case of just dropping the configurable patch as it
      would cause merge conflicts.  This patch backs out the configuration option.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac0e5b7a
    • M
      Bias the location of pages freed for min_free_kbytes in the same MAX_ORDER_NR_PAGES blocks · 56fd56b8
      Mel Gorman 提交于
      The standard buddy allocator always favours the smallest block of pages.
      The effect of this is that the pages free to satisfy min_free_kbytes tends
      to be preserved since boot time at the same location of memory ffor a very
      long time and as a contiguous block.  When an administrator sets the
      reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks
      that remain free.  This allows the occasional high atomic allocation to
      succeed up until the point the blocks are split.  In practice, it is
      difficult to split these blocks but when they do split, the benefit of
      having min_free_kbytes for contiguous blocks disappears.  Additionally,
      increasing min_free_kbytes once the system has been running for some time
      has no guarantee of creating contiguous blocks.
      
      On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large
      blocks when there are no free pages of the appropriate type available.  A
      side-effect of this is that all blocks in memory tends to be used up and
      the contiguous free blocks from boot time are not preserved like in the
      vanilla allocator.  This can cause a problem if a new caller is unwilling
      to reclaim or does not reclaim for long enough.
      
      A failure scenario was found for a wireless network device allocating
      order-1 atomic allocations but the allocations were not intense or frequent
      enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC.
      This was reproduced on a desktop by booting with mem=256mb, forcing the
      driver to allocate at order-1, running a bittorrent client (downloading a
      debian ISO) and building a kernel with -j2.
      
      This patch addresses the problem on the desktop machine booted with
      mem=256mb.  It works by setting aside a reserve of MAX_ORDER_NR_PAGES
      blocks, the number of which depends on the value of min_free_kbytes.  These
      blocks are only fallen back to when there is no other free pages.  Then the
      smallest possible page is used just like the normal buddy allocator instead
      of the largest possible page to preserve contiguous pages The pages in free
      lists in the reserve blocks are never taken for another migrate type.  The
      results is that even if min_free_kbytes is set to a low value, contiguous
      blocks will be preserved in the MIGRATE_RESERVE blocks.
      
      This works better than the vanilla allocator because if min_free_kbytes is
      increased, a new reserve block will be chosen based on the location of
      reclaimable pages and the block will free up as contiguous pages.  In the
      vanilla allocator, no effort is made to target a block of pages to free as
      contiguous pages and min_free_kbytes pages are scattered randomly.
      
      This effect has been observed on the test machine.  min_free_kbytes was set
      initially low but it was kept as a contiguous free block within
      MIGRATE_RESERVE.  min_free_kbytes was then set to a higher value and over a
      period of time, the free blocks were within the reserve and coalescing.
      How long it takes to free up depends on how quickly LRU is rotating.
      Amusingly, this means that more activity will free the blocks faster.
      
      This mechanism potentially replaces MIGRATE_HIGHALLOC as it may be more
      effective than grouping contiguous free pages together.  It all depends on
      whether the number of active atomic high allocations exceeds
      min_free_kbytes or not.  If the number of active allocations exceeds
      min_free_kbytes, it's worth it but maybe in that situation, min_free_kbytes
      should be set higher.  Once there are no more reports of allocation
      failures, a patch will be submitted that backs out MIGRATE_HIGHALLOC and
      see if the reports stay missing.
      
      Credit to Mariusz Kozlowski for discovering the problem, describing the
      failure scenario and testing patches and scenarios.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56fd56b8
    • M
      Fix corruption of memmap on IA64 SPARSEMEM when mem_section is not a power of 2 · 5c0e3066
      Mel Gorman 提交于
      There are problems in the use of SPARSEMEM and pageblock flags that causes
      problems on ia64.
      
      The first part of the problem is that units are incorrect in
      SECTION_BLOCKFLAGS_BITS computation.  This results in a map_section's
      section_mem_map being treated as part of a bitmap which isn't good.  This
      was evident with an invalid virtual address when mem_init attempted to free
      bootmem pages while relinquishing control from the bootmem allocator.
      
      The second part of the problem occurs because the pageblock flags bitmap is
      be located with the mem_section.  The SECTIONS_PER_ROOT computation using
      sizeof (mem_section) may not be a power of 2 depending on the size of the
      bitmap.  This renders masks and other such things not power of 2 base.
      This issue was seen with SPARSEMEM_EXTREME on ia64.  This patch moves the
      bitmap outside of mem_section and uses a pointer instead in the
      mem_section.  The bitmaps are allocated when the section is being
      initialised.
      
      Note that sparse_early_usemap_alloc() does not use alloc_remap() like
      sparse_early_mem_map_alloc().  The allocation required for the bitmap on
      x86, the only architecture that uses alloc_remap is typically smaller than
      a cache line.  alloc_remap() pads out allocations to the cache size which
      would be a needless waste.
      
      Credit to Bob Picco for identifying the original problem and effecting a
      fix for the SECTION_BLOCKFLAGS_BITS calculation.  Credit to Andy Whitcroft
      for devising the best way of allocating the bitmaps only when required for
      the section.
      
      [wli@holomorphy.com: warning fix]
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NWilliam Irwin <bill.irwin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c0e3066