1. 19 1月, 2006 8 次提交
    • C
      [PATCH] mm: optimize numa policy handling in slab allocator · 86c562a9
      Christoph Lameter 提交于
      Move the interrupt check from slab_node into ___cache_alloc and adds an
      "unlikely()" to avoid pipeline stalls on some architectures.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      86c562a9
    • C
      [PATCH] NUMA policies in the slab allocator V2 · dc85da15
      Christoph Lameter 提交于
      This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
      imbalance in memory allocation during bootup.
      
      The slab allocator in 2.6.13 is not numa aware and simply calls
      alloc_pages().  This means that memory policies may control the behavior of
      alloc_pages().  During bootup the memory policy is set to MPOL_INTERLEAVE
      resulting in the spreading out of allocations during bootup over all
      available nodes.  The slab allocator in 2.6.13 has only a single list of
      slab pages.  As a result the per cpu slab cache and the spinlock controlled
      page lists may contain slab entries from off node memory.  The slab
      allocator in 2.6.13 makes no effort to discern the locality of an entry on
      its lists.
      
      The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
      explicitly by calling alloc_pages_node().  The NUMA slab allocator manages
      slab entries by having lists of available slab pages for each node.  The
      per cpu slab cache can only contain slab entries associated with the node
      local to the processor.  This guarantees that the default allocation mode
      of the slab allocator always assigns local memory if available.
      
      Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
      anymore.  In 2.6.14 all node unspecific slab allocations are performed on
      the boot processor.  This means that most of key data structures are
      allocated on one node.  Most processors will have to refer to these
      structures making the boot node a potential bottleneck.  This may reduce
      performance and cause unnecessary memory pressure on the boot node.
      
      This patch implements NUMA policies in the slab layer.  There is the need
      of explicit application of NUMA memory policies by the slab allcator itself
      since the NUMA slab allocator does no longer let the page_allocator control
      locality.
      
      The check for policies is made directly at the beginning of __cache_alloc
      using current->mempolicy.  The memory policy is already frequently checked
      by the page allocator (alloc_page_vma() and alloc_page_current()).  So it
      is highly likely that the cacheline is present.  For MPOL_INTERLEAVE
      kmalloc() will spread out each request to one node after another so that an
      equal distribution of allocations can be obtained during bootup.
      
      It is not possible to push the policy check to lower layers of the NUMA
      slab allocator since the per cpu caches are now only containing slab
      entries from the current node.  If the policy says that the local node is
      not to be preferred or forbidden then there is no point in checking the
      slab cache or local list of slab pages.  The allocation better be directed
      immediately to the lists containing slab entries for the allowed set of
      nodes.
      
      This way of applying policy also fixes another strange behavior in 2.6.13.
      alloc_pages() is controlled by the memory allocation policy of the current
      process.  It could therefore be that one process is running with
      MPOL_INTERLEAVE and would f.e.  obtain a new page following that policy
      since no slab entries are in the lists anymore.  A page can typically be
      used for multiple slab entries but lets say that the current process is
      only using one.  The other entries are then added to the slab lists.  These
      are now non local entries in the slab lists despite of the possible
      availability of local pages that would provide faster access and increase
      the performance of the application.
      
      Another process without MPOL_INTERLEAVE may now run and expect a local slab
      entry from kmalloc().  However, there are still these free slab entries
      from the off node page obtained from the other process via MPOL_INTERLEAVE
      in the cache.  The process will then get an off node slab entry although
      other slab entries may be available that are local to that process.  This
      means that the policy if one process may contaminate the locality of the
      slab caches for other processes.
      
      This patch in effect insures that a per process policy is followed for the
      allocation of slab entries and that there cannot be a memory policy
      influence from one process to another.  A process with default policy will
      always get a local slab entry if one is available.  And the process using
      memory policies will get its memory arranged as requested.  Off-node slab
      allocation will require the use of spinlocks and will make the use of per
      cpu caches not possible.  A process using memory policies to redirect
      allocations offnode will have to cope with additional lock overhead in
      addition to the latency added by the need to access a remote slab entry.
      
      Changes V1->V2
      - Remove #ifdef CONFIG_NUMA by moving forward declaration into
        prior #ifdef CONFIG_NUMA section.
      
      - Give the function determining the node number to use a saner
        name.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dc85da15
    • I
      [PATCH] sem2mutex: mm/slab.c · fc0abb14
      Ingo Molnar 提交于
      Convert mm/swapfile.c's swapon_sem to swapon_mutex.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fc0abb14
    • C
      [PATCH] Zone reclaim: Reclaim logic · 9eeff239
      Christoph Lameter 提交于
      Some bits for zone reclaim exists in 2.6.15 but they are not usable.  This
      patch fixes them up, removes unused code and makes zone reclaim usable.
      
      Zone reclaim allows the reclaiming of pages from a zone if the number of
      free pages falls below the watermarks even if other zones still have enough
      pages available.  Zone reclaim is of particular importance for NUMA
      machines.  It can be more beneficial to reclaim a page than taking the
      performance penalties that come with allocating a page on a remote zone.
      
      Zone reclaim is enabled if the maximum distance to another node is higher
      than RECLAIM_DISTANCE, which may be defined by an arch.  By default
      RECLAIM_DISTANCE is 20.  20 is the distance to another node in the same
      component (enclosure or motherboard) on IA64.  The meaning of the NUMA
      distance information seems to vary by arch.
      
      If zone reclaim is not successful then no further reclaim attempts will
      occur for a certain time period (ZONE_RECLAIM_INTERVAL).
      
      This patch was discussed before. See
      
      http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
      http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
      http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
      http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9eeff239
    • C
      [PATCH] Zone reclaim: resurrect may_swap · f1fd1067
      Christoph Lameter 提交于
      Zone reclaim has a huge impact on NUMA performance (f.e.  our maximum
      throughput with XFS is raised from 4GB to 6GB/sec / page cache contamination
      of numa nodes destroys locality if one just does a large copy operation which
      results in performance dropping for good until reboot).
      
      This patch:
      
      Resurrect may_swap in struct scan_control
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f1fd1067
    • C
      [PATCH] Simplify migrate_page_add · fc301289
      Christoph Lameter 提交于
      Simplify migrate_page_add after feedback from Hugh.  This also allows us to
      drop one parameter from migrate_page_add.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fc301289
    • N
      [PATCH] mm: migration page refcounting fix · 053837fc
      Nick Piggin 提交于
      Migration code currently does not take a reference to target page
      properly, so between unlocking the pte and trying to take a new
      reference to the page with isolate_lru_page, anything could happen to
      it.
      
      Fix this by holding the pte lock until we get a chance to elevate the
      refcount.
      
      Other small cleanups while we're here.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      053837fc
    • A
      [PATCH] mm: dirty_exceeded speedup · e236a166
      Andrew Morton 提交于
      Ravikiran reports that this variable is bouncing all around nodes on NUMA
      machines, causing measurable performance problems.  Fix that up by only
      writing to it when it actually changed.
      
      And put it in a new cacheline to prevent it sharing with other things (this
      happened).
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e236a166
  2. 17 1月, 2006 1 次提交
  3. 15 1月, 2006 2 次提交
    • P
      [PATCH] cpuset oom lock fix · 505970b9
      Paul Jackson 提交于
      The problem, reported in:
      
        http://bugzilla.kernel.org/show_bug.cgi?id=5859
      
      and by various other email messages and lkml posts is that the cpuset hook
      in the oom (out of memory) code can try to take a cpuset semaphore while
      holding the tasklist_lock (a spinlock).
      
      One must not sleep while holding a spinlock.
      
      The fix seems easy enough - move the cpuset semaphore region outside the
      tasklist_lock region.
      
      This required a few lines of mechanism to implement.  The oom code where
      the locking needs to be changed does not have access to the cpuset locks,
      which are internal to kernel/cpuset.c only.  So I provided a couple more
      cpuset interface routines, available to the rest of the kernel, which
      simple take and drop the lock needed here (cpusets callback_sem).
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      505970b9
    • R
      [PATCH] Add tmpfs options for memory placement policies · 7339ff83
      Robin Holt 提交于
      Anything that writes into a tmpfs filesystem is liable to disproportionately
      decrease the available memory on a particular node.  Since there's no telling
      what sort of application (e.g.  dd/cp/cat) might be dropping large files
      there, this lets the admin choose the appropriate default behavior for their
      site's situation.
      
      Introduce a tmpfs mount option which allows specifying a memory policy and
      a second option to specify the nodelist for that policy.  With the default
      policy, tmpfs will behave as it does today.  This patch adds support for
      preferred, bind, and interleave policies.
      
      The default policy will cause pages to be added to tmpfs files on the node
      which is doing the writing.  Some jobs expect a single process to create
      and manage the tmpfs files.  This results in a node which has a
      significantly reduced number of free pages.
      
      With this patch, the administrator can specify the policy and nodes for
      that policy where they would prefer allocations.
      
      This patch was originally written by Brent Casavant and Hugh Dickins.  I
      added support for the bind and preferred policies and the mpol_nodelist
      mount option.
      Signed-off-by: NBrent Casavant <bcasavan@sgi.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7339ff83
  4. 13 1月, 2006 3 次提交
  5. 12 1月, 2006 4 次提交
  6. 11 1月, 2006 3 次提交
  7. 10 1月, 2006 2 次提交
  8. 09 1月, 2006 17 次提交
    • V
      [PATCH] fadvise: return ESPIPE on FIFO/pipe · 87ba81db
      Valentine Barshak 提交于
      The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be
      fully POSIX-compliant.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      87ba81db
    • O
      [PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait) · 28fd1298
      OGAWA Hirofumi 提交于
      This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
      
      See mm/filemap.c:
      
      And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
      
      Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
      returns error.  However, even if filemap_fdatawrite() returned an
      error, it may have submitted the partially data pages to the device.
      (e.g. in the case of -ENOSPC)
      
      <quotation>
      Andrew Morton writes,
      
      If filemap_fdatawrite() returns an error, this might be due to some
      I/O problem: dead disk, unplugged cable, etc.  Given the generally
      crappy quality of the kernel's handling of such exceptions, there's a
      good chance that the filemap_fdatawait() will get stuck in D state
      forever.
      </quotation>
      
      So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
      
      Trond, could you please review the nfs part?  Especially I'm not sure,
      nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
      Acked-by: NTrond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      28fd1298
    • O
      [PATCH] export/change sync_page_range/_nolock() · 268fc16e
      OGAWA Hirofumi 提交于
      This exports/changes the sync_page_range/_nolock().  The fatfs needs
      sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
      to "loff_t count".
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      268fc16e
    • P
      [PATCH] cpuset: rebind vma mempolicies fix · 4225399a
      Paul Jackson 提交于
      Fix more of longstanding bug in cpuset/mempolicy interaction.
      
      NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
      to just the Memory Nodes allowed by that cpuset.  The kernel maintains
      internal state for each mempolicy, tracking what nodes are used for the
      MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
      
      When a tasks cpuset memory placement changes, whether because the cpuset
      changed, or because the task was attached to a different cpuset, then the
      tasks mempolicies have to be rebound to the new cpuset placement, so as to
      preserve the cpuset-relative numbering of the nodes in that policy.
      
      An earlier fix handled such mempolicy rebinding for mempolicies attached to a
      task.
      
      This fix rebinds mempolicies attached to vma's (address ranges in a tasks
      address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
      updating vma's, the rebinding of vma mempolicies has to be done when the
      cpuset memory placement is changed, at which time mmap_sem can be safely
      acquired.  The tasks mempolicy is rebound later, when the task next attempts
      to allocate memory and notices that its task->cpuset_mems_generation is
      out-of-date with its cpusets mems_generation.
      
      Because walking the tasklist to find all tasks attached to a changing cpuset
      requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
      affected tasks while doing the tasklist scan.  In general, one cannot acquire
      a semaphore (which can sleep) while already holding a spinlock (such as
      tasklist_lock).  So a list of mm references has to be built up during the
      tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
      acquired, and the vma's in that mm rebound.
      
      Once the tasklist lock is dropped, affected tasks may fork new tasks, before
      their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
      point to the cpuset being rebound (there can only be one; cpuset modifications
      are done under a global 'manage_sem' semaphore), and the mpol_copy code that
      is used to copy a tasks mempolicies during fork catches such forking tasks,
      and ensures their children are also rebound.
      
      When a task is moved to a different cpuset, it is easier, as there is only one
      task involved.  It's mm->vma's are scanned, using the same
      mpol_rebind_policy() as used above.
      
      It may happen that both the mpol_copy hook and the update done via the
      tasklist scan update the same mm twice.  This is ok, as the mempolicies of
      each vma in an mm keep track of what mems_allowed they are relative to, and
      safely no-op a second request to rebind to the same nodes.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4225399a
    • P
      [PATCH] cpuset: numa_policy_rebind cleanup · 74cb2155
      Paul Jackson 提交于
      Cleanup, reorganize and make more robust the mempolicy.c code to rebind
      mempolicies relative to the containing cpuset after a tasks memory placement
      changes.
      
      The real motivator for this cleanup patch is to lay more groundwork for the
      upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
      after the containing cpuset memory placement changes.
      
      NUMA mempolicies are constrained by the cpuset their task is a member of.
      When either (1) a task is moved to a different cpuset, or (2) the 'mems'
      mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
      node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
      be recalculated, relative to their new cpuset placement.
      
      The old code used an unreliable method of determining what was the old
      mems_allowed constraining the mempolicy.  It just looked at the tasks
      mems_allowed value.  This sort of worked with the present code, that just
      rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
      referring to the old nodes.  But in an upcoming patch, the vma mempolicies
      will be rebound as well.  Then the order in which the various task and vma
      mempolicies are updated will no longer be deterministic, and one can no longer
      count on the task->mems_allowed holding the old value for as long as needed.
      It's not even clear if the current code was guaranteed to work reliably for
      task mempolicies.
      
      So I added a mems_allowed field to each mempolicy, stating exactly what
      mems_allowed the policy is relative to, and updated synchronously and reliably
      anytime that the mempolicy is rebound.
      
      Also removed a useless wrapper routine, numa_policy_rebind(), and had its
      caller, cpuset_update_task_memory_state(), call directly to the rewritten
      policy_rebind() routine, and made that rebind routine extern instead of
      static, and added a "mpol_" prefix to its name, making it
      mpol_rebind_policy().
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      74cb2155
    • P
      [PATCH] cpuset: implement cpuset_mems_allowed · 909d75a3
      Paul Jackson 提交于
      Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
      needed, to obtain the mems_allowed vector of a cpuset, and replaced the
      workaround in sys_migrate_pages() to call this new method.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      909d75a3
    • P
      [PATCH] cpuset: combine refresh_mems and update_mems · cf2a473c
      Paul Jackson 提交于
      The important code paths through alloc_pages_current() and alloc_page_vma(),
      by which most kernel page allocations go, both called
      cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
      -Both- of these latter two routines did a tasklock, got the tasks cpuset
      pointer, and checked for out of date cpuset->mems_generation.
      
      That was a silly duplication of code and waste of CPU cycles on an important
      code path.
      
      Consolidated those two routines into a single routine, called
      cpuset_update_task_memory_state(), since it updates more than just
      mems_allowed.
      
      Changed all callers of either routine to call the new consolidated routine.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cf2a473c
    • P
      [PATCH] cpuset: memory pressure meter · 3e0d98b9
      Paul Jackson 提交于
      Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
      that the tasks in a cpuset call try_to_free_pages(), the synchronous
      (direct) memory reclaim code.
      
      This enables batch managers monitoring jobs running in dedicated cpusets to
      efficiently detect what level of memory pressure that job is causing.
      
      This is useful both on tightly managed systems running a wide mix of
      submitted jobs, which may choose to terminate or reprioritize jobs that are
      trying to use more memory than allowed on the nodes assigned them, and with
      tightly coupled, long running, massively parallel scientific computing jobs
      that will dramatically fail to meet required performance goals if they
      start to use more memory than allowed to them.
      
      This patch just provides a very economical way for the batch manager to
      monitor a cpuset for signs of memory pressure.  It's up to the batch
      manager or other user code to decide what to do about it and take action.
      
      ==> Unless this feature is enabled by writing "1" to the special file
          /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
          code of __alloc_pages() for this metric reduces to simply noticing
          that the cpuset_memory_pressure_enabled flag is zero.  So only
          systems that enable this feature will compute the metric.
      
      Why a per-cpuset, running average:
      
          Because this meter is per-cpuset, rather than per-task or mm, the
          system load imposed by a batch scheduler monitoring this metric is
          sharply reduced on large systems, because a scan of the tasklist can be
          avoided on each set of queries.
      
          Because this meter is a running average, instead of an accumulating
          counter, a batch scheduler can detect memory pressure with a single
          read, instead of having to read and accumulate results for a period of
          time.
      
          Because this meter is per-cpuset rather than per-task or mm, the
          batch scheduler can obtain the key information, memory pressure in a
          cpuset, with a single read, rather than having to query and accumulate
          results over all the (dynamically changing) set of tasks in the cpuset.
      
      A per-cpuset simple digital filter (requires a spinlock and 3 words of data
      per-cpuset) is kept, and updated by any task attached to that cpuset, if it
      enters the synchronous (direct) page reclaim code.
      
      A per-cpuset file provides an integer number representing the recent
      (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
      in the cpuset, in units of reclaims attempted per second, times 1000.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e0d98b9
    • P
      [PATCH] cpuset: mempolicy one more nodemask conversion · 5966514d
      Paul Jackson 提交于
      Finish converting mm/mempolicy.c from bitmaps to nodemasks.  The previous
      conversion had left one routine using bitmaps, since it involved a
      corresponding change to kernel/cpuset.c
      
      Fix that interface by replacing with a simple macro that calls nodes_subset(),
      or if !CONFIG_CPUSET, returns (1).
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5966514d
    • M
      [PATCH] slob: introduce the SLOB allocator · 10cef602
      Matt Mackall 提交于
      configurable replacement for slab allocator
      
      This adds a CONFIG_SLAB option under CONFIG_EMBEDDED.  When CONFIG_SLAB is
      disabled, the kernel falls back to using the 'SLOB' allocator.
      
      SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
      similar to the original Linux kmalloc allocator that SLAB replaced.  It's
      signicantly smaller code and is more memory efficient.  But like all
      similar allocators, it scales poorly and suffers from fragmentation more
      than SLAB, so it's only appropriate for small systems.
      
      It's been tested extensively in the Linux-tiny tree.  I've also
      stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
      recommended).
      
      Here's a comparison for otherwise identical builds, showing SLOB saving
      nearly half a megabyte of RAM:
      
      $ size vmlinux*
         text    data     bss     dec     hex filename
      3336372  529360  190812 4056544  3de5e0 vmlinux-slab
      3323208  527948  190684 4041840  3dac70 vmlinux-slob
      
      $ size mm/{slab,slob}.o
         text    data     bss     dec     hex filename
        13221     752      48   14021    36c5 mm/slab.o
         1896      52       8    1956     7a4 mm/slob.o
      
      /proc/meminfo:
                        SLAB          SLOB      delta
      MemTotal:        27964 kB      27980 kB     +16 kB
      MemFree:         24596 kB      25092 kB    +496 kB
      Buffers:            36 kB         36 kB       0 kB
      Cached:           1188 kB       1188 kB       0 kB
      SwapCached:          0 kB          0 kB       0 kB
      Active:            608 kB        600 kB      -8 kB
      Inactive:          808 kB        812 kB      +4 kB
      HighTotal:           0 kB          0 kB       0 kB
      HighFree:            0 kB          0 kB       0 kB
      LowTotal:        27964 kB      27980 kB     +16 kB
      LowFree:         24596 kB      25092 kB    +496 kB
      SwapTotal:           0 kB          0 kB       0 kB
      SwapFree:            0 kB          0 kB       0 kB
      Dirty:               4 kB         12 kB      +8 kB
      Writeback:           0 kB          0 kB       0 kB
      Mapped:            560 kB        556 kB      -4 kB
      Slab:             1756 kB          0 kB   -1756 kB
      CommitLimit:     13980 kB      13988 kB      +8 kB
      Committed_AS:     4208 kB       4208 kB       0 kB
      PageTables:         28 kB         28 kB       0 kB
      VmallocTotal:  1007312 kB    1007312 kB       0 kB
      VmallocUsed:        48 kB         48 kB       0 kB
      VmallocChunk:  1007264 kB    1007264 kB       0 kB
      
      (this work has been sponsored in part by CELF)
      
      From: Ingo Molnar <mingo@elte.hu>
      
         Fix 32-bitness bugs in mm/slob.c.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      10cef602
    • M
      [PATCH] slob: introduce mm/util.c for shared functions · 30992c97
      Matt Mackall 提交于
      Add mm/util.c for functions common between SLAB and SLOB.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      30992c97
    • R
      [PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros · 22fc6ecc
      Ravikiran G Thirumalai 提交于
      ____cacheline_maxaligned_in_smp is currently used to align critical structures
      and avoid false sharing.  It uses per-arch L1_CACHE_SHIFT_MAX and people find
      L1_CACHE_SHIFT_MAX useless.
      
      However, we have been using ____cacheline_maxaligned_in_smp to align
      structures on the internode cacheline size.  As per Andi's suggestion,
      following patch kills ____cacheline_maxaligned_in_smp and introduces
      INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
      Arches needing L3/Internode cacheline alignment can define
      INTERNODE_CACHE_SHIFT in the arch asm/cache.h.  Patch replaces
      ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp
      
      With this patch, L1_CACHE_SHIFT_MAX can be killed
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Signed-off-by: NShai Fultheim <shai@scalex86.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      22fc6ecc
    • K
      [PATCH] Optimise oom kill of current task · 2f659f46
      Kirill Korotaev 提交于
      When oom_killer kills current there's no need to call
      schedule_timeout_interruptible() since task must die ASAP.
      Signed-Off-By: NPavel Emelianov <xemul@sw.ru>
      Signed-Off-By: NKirill Korotaev <dev@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2f659f46
    • C
      [PATCH] Move page migration related functions near do_migrate_pages() · 6ce3c4c0
      Christoph Lameter 提交于
      Group page migration functions in mempolicy.c
      
      Add a forward declaration for migrate_page_add (like gather_stats()) and use
      our new found mobility to group all page migration related function around
      do_migrate_pages().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6ce3c4c0
    • C
      [PATCH] mempolicies: unexport get_vma_policy() · 48fce342
      Christoph Lameter 提交于
      Since the numa_maps functionality is now in mempolicy.c we no longer need to
      export get_vma_policy().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      48fce342
    • C
      [PATCH] Drop page table lock before calling migrate_page_add() · 132beacf
      Christoph Lameter 提交于
      migrate_page_add cannot be called with a spinlock held (calls
      isolate_lru_page which calles schedule_on_each_cpu).  Drop ptl lock in
      check_pte_range before calling migrate_page_add().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      132beacf
    • C
      [PATCH] Fold numa_maps into mempolicies.c · 1a75a6c8
      Christoph Lameter 提交于
      First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2
      
      - Use the check_range() in mempolicy.c to gather statistics.
      
      - Improve the numa_maps code in general and fix some comments.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1a75a6c8