1. 01 7月, 2006 2 次提交
    • C
      [PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation · 2244b95a
      Christoph Lameter 提交于
      Per zone counter infrastructure
      
      The counters that we currently have for the VM are split per processor.  The
      processor however has not much to do with the zone these pages belong to.  We
      cannot tell f.e.  how many ZONE_DMA pages are dirty.
      
      So we are blind to potentially inbalances in the usage of memory in various
      zones.  F.e.  in a NUMA system we cannot tell how many pages are dirty on a
      particular node.  If we knew then we could put measures into the VM to balance
      the use of memory between different zones and different nodes in a NUMA
      system.  For example it would be possible to limit the dirty pages per node so
      that fast local memory is kept available even if a process is dirtying huge
      amounts of pages.
      
      Another example is zone reclaim.  We do not know how many unmapped pages exist
      per zone.  So we just have to try to reclaim.  If it is not working then we
      pause and try again later.  It would be better if we knew when it makes sense
      to reclaim unmapped pages from a zone.  This patchset allows the determination
      of the number of unmapped pages per zone.  We can remove the zone reclaim
      interval with the counters introduced here.
      
      Futhermore the ability to have various usage statistics available will allow
      the development of new NUMA balancing algorithms that may be able to improve
      the decision making in the scheduler of when to move a process to another node
      and hopefully will also enable automatic page migration through a user space
      program that can analyse the memory load distribution and then rebalance
      memory use in order to increase performance.
      
      The counter framework here implements differential counters for each processor
      in struct zone.  The differential counters are consolidated when a threshold
      is exceeded (like done in the current implementation for nr_pageache), when
      slab reaping occurs or when a consolidation function is called.
      
      Consolidation uses atomic operations and accumulates counters per zone in the
      zone structure and also globally in the vm_stat array.  VM functions can
      access the counts by simply indexing a global or zone specific array.
      
      The arrangement of counters in an array also simplifies processing when output
      has to be generated for /proc/*.
      
      Counters can be updated by calling inc/dec_zone_page_state or
      _inc/dec_zone_page_state analogous to *_page_state.  The second group of
      functions can be called if it is known that interrupts are disabled.
      
      Special optimized increment and decrement functions are provided.  These can
      avoid certain checks and use increment or decrement instructions that an
      architecture may provide.
      
      We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
      do DMA to all memory and therefore ZONE_NORMAL will not be populated.  This is
      only currently set for IA64 SGI SN2 and currently only affects
      node_page_state().  In the best case node_page_state can be reduced to
      retrieving a single counter for the one zone on the node.
      
      [akpm@osdl.org: cleanups]
      [akpm@osdl.org: export vm_stat[] for filesystems]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2244b95a
    • C
      [PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.h · f6ac2354
      Christoph Lameter 提交于
      NOTE: ZVC are *not* the lightweight event counters.  ZVCs are reliable whereas
      event counters do not need to be.
      
      Zone based VM statistics are necessary to be able to determine what the state
      of memory in one zone is.  In a NUMA system this can be helpful for local
      reclaim and other memory optimizations that may be able to shift VM load in
      order to get more balanced memory use.
      
      It is also useful to know how the computing load affects the memory
      allocations on various zones.  This patchset allows the retrieval of that data
      from userspace.
      
      The patchset introduces a framework for counters that is a cross between the
      existing page_stats --which are simply global counters split per cpu-- and the
      approach of deferred incremental updates implemented for nr_pagecache.
      
      Small per cpu 8 bit counters are added to struct zone.  If the counter exceeds
      certain thresholds then the counters are accumulated in an array of
      atomic_long in the zone and in a global array that sums up all zone values.
      The small 8 bit counters are next to the per cpu page pointers and so they
      will be in high in the cpu cache when pages are allocated and freed.
      
      Access to VM counter information for a zone and for the whole machine is then
      possible by simply indexing an array (Thanks to Nick Piggin for pointing out
      that approach).  The access to the total number of pages of various types does
      no longer require the summing up of all per cpu counters.
      
      Benefits of this patchset right now:
      
      - Ability for UP and SMP configuration to determine how memory
        is balanced between the DMA, NORMAL and HIGHMEM zones.
      
      - loops over all processors are avoided in writeback and
        reclaim paths. We can avoid caching the writeback information
        because the needed information is directly accessible.
      
      - Special handling for nr_pagecache removed.
      
      - zone_reclaim_interval vanishes since VM stats can now determine
        when it is worth to do local reclaim.
      
      - Fast inline per node page state determination.
      
      - Accurate counters in /sys/devices/system/node/node*/meminfo. Current
        counters are counting simply which processor allocated a page somewhere
        and guestimate based on that. So the counters were not useful to show
        the actual distribution of page use on a specific zone.
      
      - The swap_prefetch patch requires per node statistics in order to
        figure out when processors of a node can prefetch. This patch provides
        some of the needed numbers.
      
      - Detailed VM counters available in more /proc and /sys status files.
      
      References to earlier discussions:
      V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
      V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
      V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
      V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2
      
      Performance tests with AIM7 did not show any regressions.  Seems to be a tad
      faster even.  Tested on ia64/NUMA.  Builds fine on i386, SMP / UP.  Includes
      fixes for s390/arm/uml arch code.
      
      This patch:
      
      Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.
      
      Create vmstat.c/vmstat.h by separating the counter code and the proc
      functions.
      
      Move the vm_stat_text array before zoneinfo_show.
      
      [akpm@osdl.org: s390 build fix]
      [akpm@osdl.org: HOTPLUG_CPU build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f6ac2354
  2. 28 6月, 2006 3 次提交
  3. 27 6月, 2006 1 次提交
  4. 26 6月, 2006 1 次提交
  5. 23 6月, 2006 9 次提交
  6. 22 5月, 2006 2 次提交
    • B
      [PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary · e984bb43
      Bob Picco 提交于
      Andy added code to buddy allocator which does not require the zone's
      endpoints to be aligned to MAX_ORDER.  An issue is that the buddy allocator
      requires the node_mem_map's endpoints to be MAX_ORDER aligned.  Otherwise
      __page_find_buddy could compute a buddy not in node_mem_map for partial
      MAX_ORDER regions at zone's endpoints.  page_is_buddy will detect that
      these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
      allocator and not part of zone).  Of course the negative here is we could
      waste a little memory but the positive is eliminating all the old checks
      for zone boundary conditions.
      
      SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
      when SPARSEMEM is configured.  ia64 VIRTUAL_MEM_MAP doesn't need the logic
      either because the holes and endpoints are handled differently.  This
      leaves checking alloc_remap and other arches which privately allocate for
      node_mem_map.
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e984bb43
    • P
      [PATCH] Cpuset: might sleep checking zones allowed fix · bdd804f4
      Paul Jackson 提交于
      Fix a couple of infrequently encountered 'sleeping function called from
      invalid context' in the cpuset hooks in __alloc_pages.  Could sleep while
      interrupts disabled.
      
      The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c
      __alloc_pages() to determine if a zone is allowed in the current tasks
      cpuset.  This routine can sleep, for certain GFP_KERNEL allocations, if the
      zone is on a memory node not allowed in the current cpuset, but might be
      allowed in a parent cpuset.
      
      But we can't sleep in __alloc_pages() if in interrupt, nor if called for a
      GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags).
      
      The rule was intended to be:
        Don't call cpuset_zone_allowed() if you can't sleep, unless you
        pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
        the code that might scan up ancestor cpusets and sleep.
      
      This rule was being violated in a couple of places, due to a bogus change
      made (by myself, pj) to __alloc_pages() as part of the November 2005 effort
      to cleanup its logic, and also due to a later fix to constrain which swap
      daemons were awoken.
      
      The bogus change can be seen at:
        http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
        [PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags
      
      This was first noticed on a tight memory system, in code that was disabling
      interrupts and doing allocation requests with __GFP_WAIT not set, which
      resulted in __might_sleep() writing complaints to the log "Debug: sleeping
      function called ...", when the code in cpuset_zone_allowed() tried to take
      the callback_sem cpuset semaphore.
      
      We haven't seen a system hang on this 'might_sleep' yet, but we are at
      decent risk of seeing it fairly soon, especially since the additional
      cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in
      March 2006.
      
      Special thanks to Dave Chinner, for figuring this out, and a tip of the hat
      to Nick Piggin who warned me of this back in Nov 2005, before I was ready
      to listen.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bdd804f4
  7. 16 5月, 2006 1 次提交
  8. 26 4月, 2006 1 次提交
  9. 20 4月, 2006 1 次提交
  10. 11 4月, 2006 2 次提交
    • H
      [PATCH] overcommit: add calculate_totalreserve_pages() · cb45b0e9
      Hideo AOKI 提交于
      These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
      __vm_enough_memory().
      
      - why the kernel needed patching
      
        When the kernel can't allocate anonymous pages in practice, currnet
        OVERCOMMIT_GUESS could return success. This implementation might be
        the cause of oom kill in memory pressure situation.
      
        If the Linux runs with page reservation features like
        /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
        the oom kill occurs easily.
      
      - the overall design approach in the patch
      
        When the OVERCOMMET_GUESS algorithm calculates number of free pages,
        the reserved free pages are regarded as non-free pages.
      
        This change helps to avoid the pitfall that the number of free pages
        become less than the number which the kernel tries to keep free.
      
      - testing results
      
        I tested the patches using my test kernel module.
      
        If the patches aren't applied to the kernel, __vm_enough_memory()
        returns success in the situation but autual page allocation is
        failed.
      
        On the other hand, if the patches are applied to the kernel, memory
        allocation failure is avoided since __vm_enough_memory() returns
        failure in the situation.
      
        I checked that on i386 SMP 16GB memory machine. I haven't tested on
        nommu environment currently.
      
      This patch adds totalreserve_pages for __vm_enough_memory().
      
      Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
      pages_high in each zone. Finally, the function stores the sum of each
      zone to totalreserve_pages.
      
      The totalreserve_pages is calculated when the VM is initilized.
      And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
      or /proc/sys/vm/min_free_kbytes are changed.
      Signed-off-by: NHideo Aoki <haoki@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cb45b0e9
    • N
      [PATCH] Fix buddy list race that could lead to page lru list corruptions · 676165a8
      Nick Piggin 提交于
      Rohit found an obscure bug causing buddy list corruption.
      
      page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
      to determine whether or not a free page's buddy is itself free and in the
      buddy lists.
      
      Each of the conjuncts may be true at different times due to unrelated
      conditions, so the non-atomic page_is_buddy test may find each conjunct to
      be true even if they were not both true at the same time (ie. the page was
      not on the buddy lists).
      Signed-off-by: NMartin Bligh <mbligh@google.com>
      Signed-off-by: NRohit Seth <rohitseth@google.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      676165a8
  11. 28 3月, 2006 4 次提交
  12. 26 3月, 2006 2 次提交
  13. 24 3月, 2006 1 次提交
  14. 22 3月, 2006 9 次提交
  15. 10 3月, 2006 1 次提交
    • C
      [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. · 8fce4d8e
      Christoph Lameter 提交于
      The cache reaper currently tries to free all alien caches and all remote
      per cpu pages in each pass of cache_reap.  For a machines with large number
      of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
      Interrupts are disabled while reclaiming creating unacceptable delays.
      
      This patch changes that behavior by adding a per cpu reap_node variable.
      Instead of attempting to free all caches, we free only one alien cache and
      the per cpu pages from one remote node.  That reduces the time spend in
      cache_reap.  However, doing so will lengthen the time it takes to
      completely drain all remote per cpu pagesets and all alien caches.  The
      time needed will grow with the number of nodes in the system.  All caches
      are drained when they overflow their respective capacity.  So the drawback
      here is only that a bit of memory may be wasted for awhile longer.
      
      Details:
      
      1. Rename drain_remote_pages to drain_node_pages to allow the specification
         of the node to drain of pcp pages.
      
      2. Add additional functions init_reap_node, next_reap_node for NUMA
         that manage a per cpu reap_node counter.
      
      3. Add a reap_alien function that reaps only from the current reap_node.
      
      For us this seems to be a critical issue.  Holdoffs of an average of ~7ms
      cause some HPC benchmarks to slow down significantly.  F.e.  NAS parallel
      slows down dramatically.  NAS parallel has a 12-16 seconds runtime w/o rotor
      compared to 5.8 secs with the rotor patches.  It gets down to 5.05 secs with
      the additional interrupt holdoff reductions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8fce4d8e