1. 30 4月, 2013 1 次提交
    • A
      mm: limit growth of 3% hardcoded other user reserve · c9b1d098
      Andrew Shewmaker 提交于
      Add user_reserve_kbytes knob.
      
      Limit the growth of the memory reserved for other user processes to
      min(3% current process size, user_reserve_pages).  Only about 8MB is
      necessary to enable recovery in the default mode, and only a few hundred
      MB are required even when overcommit is disabled.
      
      user_reserve_pages defaults to min(3% free pages, 128MB)
      
      I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
      then adding the RSS of each.
      
      This only affects OVERCOMMIT_NEVER mode.
      
      Background
      
      1. user reserve
      
      __vm_enough_memory reserves a hardcoded 3% of the current process size for
      other applications when overcommit is disabled.  This was done so that a
      user could recover if they launched a memory hogging process.  Without the
      reserve, a user would easily run into a message such as:
      
      bash: fork: Cannot allocate memory
      
      2. admin reserve
      
      Additionally, a hardcoded 3% of free memory is reserved for root in both
      overcommit 'guess' and 'never' modes.  This was intended to prevent a
      scenario where root-cant-log-in and perform recovery operations.
      
      Note that this reserve shrinks, and doesn't guarantee a useful reserve.
      
      Motivation
      
      The two hardcoded memory reserves should be updated to account for current
      memory sizes.
      
      Also, the admin reserve would be more useful if it didn't shrink too much.
      
      When the current code was originally written, 1GB was considered
      "enterprise".  Now the 3% reserve can grow to multiple GB on large memory
      systems, and it only needs to be a few hundred MB at most to enable a user
      or admin to recover a system with an unwanted memory hogging process.
      
      I've found that reducing these reserves is especially beneficial for a
      specific type of application load:
      
       * single application system
       * one or few processes (e.g. one per core)
       * allocating all available memory
       * not initializing every page immediately
       * long running
      
      I've run scientific clusters with this sort of load.  A long running job
      sometimes failed many hours (weeks of CPU time) into a calculation.  They
      weren't initializing all of their memory immediately, and they weren't
      using calloc, so I put systems into overcommit 'never' mode.  These
      clusters run diskless and have no swap.
      
      However, with the current reserves, a user wishing to allocate as much
      memory as possible to one process may be prevented from using, for
      example, almost 2GB out of 32GB.
      
      The effect is less, but still significant when a user starts a job with
      one process per core.  I have repeatedly seen a set of processes
      requesting the same amount of memory fail because one of them could not
      allocate the amount of memory a user would expect to be able to allocate.
      For example, Message Passing Interfce (MPI) processes, one per core.  And
      it is similar for other parallel programming frameworks.
      
      Changing this reserve code will make the overcommit never mode more useful
      by allowing applications to allocate nearly all of the available memory.
      
      Also, the new admin_reserve_kbytes will be safer than the current behavior
      since the hardcoded 3% of available memory reserve can shrink to something
      useless in the case where applications have grabbed all available memory.
      
      Risks
      
      * "bash: fork: Cannot allocate memory"
      
        The downside of the first patch-- which creates a tunable user reserve
        that is only used in overcommit 'never' mode--is that an admin can set
        it so low that a user may not be able to kill their process, even if
        they already have a shell prompt.
      
        Of course, a user can get in the same predicament with the current 3%
        reserve--they just have to launch processes until 3% becomes negligible.
      
      * root-cant-log-in problem
      
        The second patch, adding the tunable rootuser_reserve_pages, allows
        the admin to shoot themselves in the foot by setting it too small.  They
        can easily get the system into a state where root-can't-log-in.
      
        However, the new admin_reserve_kbytes will be safer than the current
        behavior since the hardcoded 3% of available memory reserve can shrink
        to something useless in the case where applications have grabbed all
        available memory.
      
      Alternatives
      
       * Memory cgroups provide a more flexible way to limit application memory.
      
         Not everyone wants to set up cgroups or deal with their overhead.
      
       * We could create a fourth overcommit mode which provides smaller reserves.
      
         The size of useful reserves may be drastically different depending
         on the whether the system is embedded or enterprise.
      
       * Force users to initialize all of their memory or use calloc.
      
         Some users don't want/expect the system to overcommit when they malloc.
         Overcommit 'never' mode is for this scenario, and it should work well.
      
      The new user and admin reserve tunables are simple to use, with low
      overhead compared to cgroups.  The patches preserve current behavior where
      3% of memory is less than 128MB, except that the admin reserve doesn't
      shrink to an unusable size under pressure.  The code allows admins to tune
      for embedded and enterprise usage.
      
      FAQ
      
       * How is the root-cant-login problem addressed?
         What happens if admin_reserve_pages is set to 0?
      
         Root is free to shoot themselves in the foot by setting
         admin_reserve_kbytes too low.
      
         On x86_64, the minimum useful reserve is:
           8MB for overcommit 'guess'
         128MB for overcommit 'never'
      
         admin_reserve_pages defaults to min(3% free memory, 8MB)
      
         So, anyone switching to 'never' mode needs to adjust
         admin_reserve_pages.
      
       * How do you calculate a minimum useful reserve?
      
         A user or the admin needs enough memory to login and perform
         recovery operations, which includes, at a minimum:
      
         sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
      
         For overcommit 'guess', we can sum resident set sizes (RSS)
         because we only need enough memory to handle what the recovery
         programs will typically use. On x86_64 this is about 8MB.
      
         For overcommit 'never', we can take the max of their virtual sizes (VSZ)
         and add the sum of their RSS. We use VSZ instead of RSS because mode
         forces us to ensure we can fulfill all of the requested memory allocations--
         even if the programs only use a fraction of what they ask for.
         On x86_64 this is about 128MB.
      
         When swap is enabled, reserves are useful even when they are as
         small as 10MB, regardless of overcommit mode.
      
         When both swap and overcommit are disabled, then the admin should
         tune the reserves higher to be absolutley safe. Over 230MB each
         was safest in my testing.
      
       * What happens if user_reserve_pages is set to 0?
      
         Note, this only affects overcomitt 'never' mode.
      
         Then a user will be able to allocate all available memory minus
         admin_reserve_kbytes.
      
         However, they will easily see a message such as:
      
         "bash: fork: Cannot allocate memory"
      
         And they won't be able to recover/kill their application.
         The admin should be able to recover the system if
         admin_reserve_kbytes is set appropriately.
      
       * What's the difference between overcommit 'guess' and 'never'?
      
         "Guess" allows an allocation if there are enough free + reclaimable
         pages. It has a hardcoded 3% of free pages reserved for root.
      
         "Never" allows an allocation if there is enough swap + a configurable
         percentage (default is 50) of physical RAM. It has a hardcoded 3% of
         free pages reserved for root, like "Guess" mode. It also has a
         hardcoded 3% of the current process size reserved for additional
         applications.
      
       * Why is overcommit 'guess' not suitable even when an app eventually
         writes to every page? It takes free pages, file pages, available
         swap pages, reclaimable slab pages into consideration. In other words,
         these are all pages available, then why isn't overcommit suitable?
      
         Because it only looks at the present state of the system. It
         does not take into account the memory that other applications have
         malloced, but haven't initialized yet. It overcommits the system.
      
      Test Summary
      
      There was little change in behavior in the default overcommit 'guess'
      mode with swap enabled before and after the patch. This was expected.
      
      Systems run most predictably (i.e. no oom kills) in overcommit 'never'
      mode with swap enabled. This also allowed the most memory to be allocated
      to a user application.
      
      Overcommit 'guess' mode without swap is a bad idea. It is easy to
      crash the system. None of the other tested combinations crashed.
      This matches my experience on the Roadrunner supercomputer.
      
      Without the tunable user reserve, a system in overcommit 'never' mode
      and without swap does not allow the admin to recover, although the
      admin can.
      
      With the new tunable reserves, a system in overcommit 'never' mode
      and without swap can be configured to:
      
      1. maximize user-allocatable memory, running close to the edge of
      recoverability
      
      2. maximize recoverability, sacrificing allocatable memory to
      ensure that a user cannot take down a system
      
      Test Description
      
      Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
      
      System is booted into multiuser console mode, with unnecessary services
      turned off. Caches were dropped before each test.
      
      Hogs are user memtester processes that attempt to allocate all free memory
      as reported by /proc/meminfo
      
      In overcommit 'never' mode, memory_ratio=100
      
      Test Results
      
      3.9.0-rc1-mm1
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5432/5432       no     yes             yes
      guess        yes    4      5444/5444       1      yes             yes
      guess        no     1      5302/5449       no     yes             yes
      guess        no     4      -               crash  no              no
      
      never        yes    1      5460/5460       1      yes             yes
      never        yes    4      5460/5460       1      yes             yes
      never        no     1      5218/5432       no     no              yes
      never        no     4      5203/5448       no     no              yes
      
      3.9.0-rc1-mm1-tunablereserves
      
      User and Admin Recovery show their respective reserves, if applicable.
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5419/5419       no     - yes           8MB yes
      guess        yes    4      5436/5436       1      - yes           8MB yes
      guess        no     1      5440/5440       *      - yes           8MB yes
      guess        no     4      -               crash  - no            8MB no
      
      * process would successfully mlock, then the oom killer would pick it
      
      never        yes    1      5446/5446       no     10MB yes        20MB yes
      never        yes    4      5456/5456       no     10MB yes        20MB yes
      never        no     1      5387/5429       no     128MB no        8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      
      never        no     1      5359/5448       no     10MB no         10MB barely
      
      never        no     1      5323/5428       no     0MB no          10MB barely
      never        no     1      5332/5428       no     0MB no          50MB yes
      never        no     1      5293/5429       no     0MB no          90MB yes
      
      never        no     1      5001/5427       no     230MB yes       338MB yes
      never        no     4*     4998/5424       no     230MB yes       338MB yes
      
      * more memtesters were launched, able to allocate approximately another 100MB
      
      Future Work
      
       - Test larger memory systems.
      
       - Test an embedded image.
      
       - Test other architectures.
      
       - Time malloc microbenchmarks.
      
       - Would it be useful to be able to set overcommit policy for
         each memory cgroup?
      
       - Some lines are slightly above 80 chars.
         Perhaps define a macro to convert between pages and kb?
         Other places in the kernel do this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_user_reserve() static]
      Signed-off-by: NAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b1d098
  2. 04 8月, 2012 1 次提交
  3. 01 8月, 2012 3 次提交
  4. 06 4月, 2011 1 次提交
  5. 28 10月, 2010 1 次提交
  6. 10 8月, 2010 1 次提交
  7. 28 6月, 2010 1 次提交
  8. 25 5月, 2010 2 次提交
  9. 13 3月, 2010 1 次提交
    • K
      memcg: handle panic_on_oom=always case · daaf1e68
      KAMEZAWA Hiroyuki 提交于
      Presently, if panic_on_oom=2, the whole system panics even if the oom
      happend in some special situation (as cpuset, mempolicy....).  Then,
      panic_on_oom=2 means painc_on_oom_always.
      
      Now, memcg doesn't check panic_on_oom flag. This patch adds a check.
      
      BTW, how it's useful ?
      
      kdump+panic_on_oom=2 is the last tool to investigate what happens in
      oom-ed system.  When a task is killed, the sysytem recovers and there will
      be few hint to know what happnes.  In mission critical system, oom should
      never happen.  Then, panic_on_oom=2+kdump is useful to avoid next OOM by
      knowing precise information via snapshot.
      
      TODO:
       - For memcg, it's for isolate system's memory usage, oom-notiifer and
         freeze_at_oom (or rest_at_oom) should be implemented. Then, management
         daemon can do similar jobs (as kdump) or taking snapshot per cgroup.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      daaf1e68
  10. 04 12月, 2009 1 次提交
  11. 22 9月, 2009 1 次提交
  12. 16 9月, 2009 1 次提交
    • A
      HWPOISON: The high level memory error handler in the VM v7 · 6a46079c
      Andi Kleen 提交于
      Add the high level memory handler that poisons pages
      that got corrupted by hardware (typically by a two bit flip in a DIMM
      or a cache) on the Linux level. The goal is to prevent everyone
      from accessing these pages in the future.
      
      This done at the VM level by marking a page hwpoisoned
      and doing the appropriate action based on the type of page
      it is.
      
      The code that does this is portable and lives in mm/memory-failure.c
      
      To quote the overview comment:
      
      High level machine check handler. Handles pages reported by the
      hardware as being corrupted usually due to a 2bit ECC memory or cache
      failure.
      
      This focuses on pages detected as corrupted in the background.
      When the current CPU tries to consume corruption the currently
      running process can just be killed directly instead. This implies
      that if the error cannot be handled for some reason it's safe to
      just ignore it because no corruption has been consumed yet. Instead
      when that happens another machine check will happen.
      
      Handles page cache pages in various states. The tricky part
      here is that we can access any page asynchronous to other VM
      users, because memory failures could happen anytime and anywhere,
      possibly violating some of their assumptions. This is why this code
      has to be extremely careful. Generally it tries to use normal locking
      rules, as in get the standard locks, even if that means the
      error handling takes potentially a long time.
      
      Some of the operations here are somewhat inefficient and have non
      linear algorithmic complexity, because the data structures have not
      been optimized for this case. This is in particular the case
      for the mapping from a vma to a process. Since this case is expected
      to be rare we hope we can get away with this.
      
      There are in principle two strategies to kill processes on poison:
      - just unmap the data and wait for an actual reference before
      killing
      - kill as soon as corruption is detected.
      Both have advantages and disadvantages and should be used
      in different situations. Right now both are implemented and can
      be switched with a new sysctl vm.memory_failure_early_kill
      The default is early kill.
      
      The patch does some rmap data structure walking on its own to collect
      processes to kill. This is unusual because normally all rmap data structure
      knowledge is in rmap.c only. I put it here for now to keep
      everything together and rmap knowledge has been seeping out anyways
      
      Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
      Nick Piggin (who did a lot of great work) and others.
      
      Cc: npiggin@suse.de
      Cc: riel@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      6a46079c
  13. 17 6月, 2009 2 次提交
    • M
      vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim · 90afa5de
      Mel Gorman 提交于
      A bug was brought to my attention against a distro kernel but it affects
      mainline and I believe problems like this have been reported in various
      guises on the mailing lists although I don't have specific examples at the
      moment.
      
      The reported problem was that malloc() stalled for a long time (minutes in
      some cases) if a large tmpfs mount was occupying a large percentage of
      memory overall.  The pages did not get cleaned or reclaimed by
      zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
      are uselessly scanned frequencly making the CPU spin at near 100%.
      
      This patchset intends to address that bug and bring the behaviour of
      zone_reclaim() more in line with expectations which were noticed during
      investigation.  It is based on top of mmotm and takes advantage of
      Kosaki's work with respect to zone_reclaim().
      
      Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
      	scan should go ahead. The broken heuristic is what was causing the
      	malloc() stall as it uselessly scanned the LRU constantly. Currently,
      	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
      	could not deal with tmpfs pages at all. This fixes up the heuristic so
      	that an unnecessary scan is more likely to be correctly avoided.
      
      Patch 2 notes that zone_reclaim() returning a failure automatically means
      	the zone is marked full. This is not always true. It could have
      	failed because the GFP mask or zone_reclaim_mode were unsuitable.
      
      Patch 3 introduces a counter zreclaim_failed that will increment each
      	time the zone_reclaim scan-avoidance heuristics fail. If that
      	counter is rapidly increasing, then zone_reclaim_mode should be
      	set to 0 as a temporarily resolution and a bug reported because
      	the scan-avoidance heuristic is still broken.
      
      This patch:
      
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.
      
      There is a heuristic that determines if the scan is worthwhile but the
      problem is that the heuristic is not being properly applied and is
      basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
      proper detection can manfiest as high CPU usage as the LRU list is scanned
      uselessly.
      
      Historically, once enabled it was depending on NR_FILE_PAGES which may
      include swapcache pages that the reclaim_mode cannot deal with.  Patch
      vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
      Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
      pages that were not file-backed such as swapcache and made a calculation
      based on the inactive, active and mapped files.  This is far superior when
      zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
      reasonable starting figure.
      
      This patch alters how zone_reclaim() works out how many pages it might be
      able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
      the reclaim_mode it will either consider NR_FILE_PAGES as potential
      candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
      swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
      then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
      not set, then NR_FILE_MAPPED are not.
      
      [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
      [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90afa5de
    • M
      page allocator: use allocation flags as an index to the zone watermark · 41858966
      Mel Gorman 提交于
      ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
      pages_min, pages_low or pages_high is used as the zone watermark when
      allocating the pages.  Two branches in the allocator hotpath determine
      which watermark to use.
      
      This patch uses the flags as an array index into a watermark array that is
      indexed with WMARK_* defines accessed via helpers.  All call sites that
      use zone->pages_* are updated to use the helpers for accessing the values
      and the array offsets for setting.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41858966
  14. 13 6月, 2009 1 次提交
  15. 15 5月, 2009 1 次提交
    • J
      Revert "mm: add /proc controls for pdflush threads" · cd17cbfd
      Jens Axboe 提交于
      This reverts commit fafd688e.
      
      Work is progressing to switch away from pdflush as the process backing
      for flushing out dirty data. So it seems pointless to add more knobs
      to control pdflush threads. The original author of the patch did not
      have any specific use cases for adding the knobs, so we can easily
      revert this before 2.6.30 to avoid having to maintain this API
      forever.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cd17cbfd
  16. 03 5月, 2009 1 次提交
    • A
      mm: prevent divide error for small values of vm_dirty_bytes · 9e4a5bda
      Andrea Righi 提交于
      Avoid setting less than two pages for vm_dirty_bytes: this is necessary to
      avoid potential division by 0 (like the following) in get_dirty_limits().
      
      [   49.951610] divide error: 0000 [#1] PREEMPT SMP
      [   49.952195] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/uevent
      [   49.952195] CPU 1
      [   49.952195] Modules linked in: pcspkr
      [   49.952195] Pid: 3064, comm: dd Not tainted 2.6.30-rc3 #1
      [   49.952195] RIP: 0010:[<ffffffff802d39a9>]  [<ffffffff802d39a9>] get_dirty_limits+0xe9/0x2c0
      [   49.952195] RSP: 0018:ffff88001de03a98  EFLAGS: 00010202
      [   49.952195] RAX: 00000000000000c0 RBX: ffff88001de03b80 RCX: 28f5c28f5c28f5c3
      [   49.952195] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000
      [   49.952195] RBP: ffff88001de03ae8 R08: 0000000000000000 R09: 0000000000000000
      [   49.952195] R10: ffff88001ddda9a0 R11: 0000000000000001 R12: 0000000000000001
      [   49.952195] R13: ffff88001fbc8218 R14: ffff88001de03b70 R15: ffff88001de03b78
      [   49.952195] FS:  00007fe9a435b6f0(0000) GS:ffff8800025d9000(0000) knlGS:0000000000000000
      [   49.952195] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   49.952195] CR2: 00007fe9a39ab000 CR3: 000000001de38000 CR4: 00000000000006e0
      [   49.952195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   49.952195] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [   49.952195] Process dd (pid: 3064, threadinfo ffff88001de02000, task ffff88001ddda250)
      [   49.952195] Stack:
      [   49.952195]  ffff88001fa0de00 ffff88001f2dbd70 ffff88001f9fe800 000080b900000000
      [   49.952195]  00000000000000c0 ffff8800027a6100 0000000000000400 ffff88001fbc8218
      [   49.952195]  0000000000000000 0000000000000600 ffff88001de03bb8 ffffffff802d3ed7
      [   49.952195] Call Trace:
      [   49.952195]  [<ffffffff802d3ed7>] balance_dirty_pages_ratelimited_nr+0x1d7/0x3f0
      [   49.952195]  [<ffffffff80368f8e>] ? ext3_writeback_write_end+0x9e/0x120
      [   49.952195]  [<ffffffff802cc7df>] generic_file_buffered_write+0x12f/0x330
      [   49.952195]  [<ffffffff802cce8d>] __generic_file_aio_write_nolock+0x26d/0x460
      [   49.952195]  [<ffffffff802cda32>] ? generic_file_aio_write+0x52/0xd0
      [   49.952195]  [<ffffffff802cda49>] generic_file_aio_write+0x69/0xd0
      [   49.952195]  [<ffffffff80365fa6>] ext3_file_write+0x26/0xc0
      [   49.952195]  [<ffffffff803034d1>] do_sync_write+0xf1/0x140
      [   49.952195]  [<ffffffff80290d1a>] ? get_lock_stats+0x2a/0x60
      [   49.952195]  [<ffffffff80280730>] ? autoremove_wake_function+0x0/0x40
      [   49.952195]  [<ffffffff8030411b>] vfs_write+0xcb/0x190
      [   49.952195]  [<ffffffff803042d0>] sys_write+0x50/0x90
      [   49.952195]  [<ffffffff8022ff6b>] system_call_fastpath+0x16/0x1b
      [   49.952195] Code: 00 00 00 2b 05 09 1c 17 01 48 89 c6 49 0f af f4 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 d0 48 89 f0 <48> f7 f7 41 8b 95 ac 01 00 00 48 89 c7 49 0f af d4 48 c1 ea 02
      [   49.952195] RIP  [<ffffffff802d39a9>] get_dirty_limits+0xe9/0x2c0
      [   49.952195]  RSP <ffff88001de03a98>
      [   50.096523] ---[ end trace 008d7aa02f244d7b ]---
      Signed-off-by: NAndrea Righi <righi.andrea@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e4a5bda
  17. 07 4月, 2009 1 次提交
    • P
      mm: add /proc controls for pdflush threads · fafd688e
      Peter W Morreale 提交于
      Add /proc entries to give the admin the ability to control the minimum and
      maximum number of pdflush threads.  This allows finer control of pdflush
      on both large and small machines.
      
      The rationale is simply one size does not fit all.  Admins on large and/or
      small systems may want to tune the min/max pdflush thread count to best
      suit their needs.  Right now the min/max is hardcoded to 2/8.  While
      probably a fair estimate for smaller machines, large machines with large
      numbers of CPUs and large numbers of filesystems/block devices may benefit
      from larger numbers of threads working on different block devices.
      
      Even if the background flushing algorithm is radically changed, it is
      still likely that multiple threads will be involved and admins would still
      desire finer control on the min/max other than to have to recompile the
      kernel.
      
      The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
      '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.
      
      The minimum value for nr_pdflush_threads_min is 1 and the maximum value is
      the current value of nr_pdflush_threads_max.  This minimum is required
      since additional thread creation is performed in a pdflush thread itself.
      
      The minimum value for nr_pdflush_threads_max is the current value of
      nr_pdflush_threads_min and the maximum value can be 1000.
      
      Documentation/sysctl/vm.txt is also updated.
      
      [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly]
      Signed-off-by: NPeter W Morreale <pmorreale@novell.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fafd688e
  18. 16 1月, 2009 1 次提交
  19. 08 1月, 2009 1 次提交
    • P
      NOMMU: Make mmap allocation page trimming behaviour configurable. · dd8632a1
      Paul Mundt 提交于
      NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to
      the nearest power-of-2 number of pages.  Currently it then discards the excess
      pages back to the page allocator, making that memory available for use by other
      things.  This can, however, cause greater amount of fragmentation.
      
      To counter this, a sysctl is added in order to fine-tune the trimming
      behaviour.  The default behaviour remains to trim pages aggressively, while
      this can either be disabled completely or set to a higher page-granular
      watermark in order to have finer-grained control.
      
      vm region vm_top bits taken from an earlier patch by David Howells.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMike Frysinger <vapier.adi@gmail.com>
      dd8632a1
  20. 07 1月, 2009 1 次提交
    • D
      mm: add dirty_background_bytes and dirty_bytes sysctls · 2da02997
      David Rientjes 提交于
      This change introduces two new sysctls to /proc/sys/vm:
      dirty_background_bytes and dirty_bytes.
      
      dirty_background_bytes is the counterpart to dirty_background_ratio and
      dirty_bytes is the counterpart to dirty_ratio.
      
      With growing memory capacities of individual machines, it's no longer
      sufficient to specify dirty thresholds as a percentage of the amount of
      dirtyable memory over the entire system.
      
      dirty_background_bytes and dirty_bytes specify quantities of memory, in
      bytes, that represent the dirty limits for the entire system.  If either
      of these values is set, its value represents the amount of dirty memory
      that is needed to commence either background or direct writeback.
      
      When a `bytes' or `ratio' file is written, its counterpart becomes a
      function of the written value.  For example, if dirty_bytes is written to
      be 8096, 8K of memory is required to commence direct writeback.
      dirty_ratio is then functionally equivalent to 8K / the amount of
      dirtyable memory:
      
      	dirtyable_memory = free pages + mapped pages + file cache
      
      	dirty_background_bytes = dirty_background_ratio * dirtyable_memory
      		-or-
      	dirty_background_ratio = dirty_background_bytes / dirtyable_memory
      
      		AND
      
      	dirty_bytes = dirty_ratio * dirtyable_memory
      		-or-
      	dirty_ratio = dirty_bytes / dirtyable_memory
      
      Only one of dirty_background_bytes and dirty_background_ratio may be
      specified at a time, and only one of dirty_bytes and dirty_ratio may be
      specified.  When one sysctl is written, the other appears as 0 when read.
      
      The `bytes' files operate on a page size granularity since dirty limits
      are compared with ZVC values, which are in page units.
      
      Prior to this change, the minimum dirty_ratio was 5 as implemented by
      get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
      written value between 0 and 100.  This restriction is maintained, but
      dirty_bytes has a lower limit of only one page.
      
      Also prior to this change, the dirty_background_ratio could not equal or
      exceed dirty_ratio.  This restriction is maintained in addition to
      restricting dirty_background_bytes.  If either background threshold equals
      or exceeds that of the dirty threshold, it is implicitly set to half the
      dirty threshold.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2da02997
  21. 27 7月, 2008 1 次提交
  22. 08 2月, 2008 1 次提交
    • D
      oom: add sysctl to enable task memory dump · fef1bdd6
      David Rientjes 提交于
      Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
      dump of all system tasks (excluding kernel threads) when performing an
      OOM-killing.  Information includes pid, uid, tgid, vm size, rss, cpu,
      oom_adj score, and name.
      
      This is helpful for determining why there was an OOM condition and which
      rogue task caused it.
      
      It is configurable so that large systems, such as those with several
      thousand tasks, do not incur a performance penalty associated with dumping
      data they may not desire.
      
      If an OOM was triggered as a result of a memory controller, the tasklist
      shall be filtered to exclude tasks that are not a member of the same
      cgroup.
      
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fef1bdd6
  23. 06 2月, 2008 1 次提交
    • B
      mm/page-writeback: highmem_is_dirtyable option · 195cf453
      Bron Gondwana 提交于
      Add vm.highmem_is_dirtyable toggle
      
      A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
      approximately 2Gb size which contains a hash format that is written
      randomly by the dbclean process.  On 2.6.16 this process took a few
      minutes.  With lowmem only accounting of dirty ratios, this takes about 12
      hours of 100% disk IO, all random writes.
      
      Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
      add the highmem back to the total available memory count.
      
      [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
      Signed-off-by: NBron Gondwana <brong@fastmail.fm>
      Cc: Ethan Solomita <solo@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      195cf453
  24. 18 12月, 2007 1 次提交
  25. 17 10月, 2007 2 次提交
  26. 18 7月, 2007 1 次提交
    • M
      handle kernelcore=: generic · ed7ed365
      Mel Gorman 提交于
      This patch adds the kernelcore= parameter for x86.
      
      Once all patches are applied, a new command-line parameter exist and a new
      sysctl.  This patch adds the necessary documentation.
      
      From: Yasunori Goto <y-goto@jp.fujitsu.com>
      
        When "kernelcore" boot option is specified, kernel can't boot up on ia64
        because of an infinite loop.  In addition, the parsing code can be handled
        in an architecture-independent manner.
      
        This patch uses common code to handle the kernelcore= parameter.  It is
        only available to architectures that support arch-independent zone-sizing
        (i.e.  define CONFIG_ARCH_POPULATES_NODE_MAP).  Other architectures will
        ignore the boot parameter.
      
      [bunk@stusta.de: make cmdline_parse_kernelcore() static]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed7ed365
  27. 17 7月, 2007 1 次提交
    • K
      change zonelist order: zonelist order selection logic · f0c0b2b8
      KAMEZAWA Hiroyuki 提交于
      Make zonelist creation policy selectable from sysctl/boot option v6.
      
      This patch makes NUMA's zonelist (of pgdat) order selectable.
      Available order are Default(automatic)/ Node-based / Zone-based.
      
      [Default Order]
      The kernel selects Node-based or Zone-based order automatically.
      
      [Node-based Order]
      This policy treats the locality of memory as the most important parameter.
      Zonelist order is created by each zone's locality. This means lower zones
      (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
      IOW. ZONE_DMA will be in the middle of zonelist.
      current 2.6.21 kernel uses this.
      
      Pros.
       * A user can expect local memory as much as possible.
      Cons.
       * lower zone will be exhansted before higher zone. This may cause OOM_KILL.
      
      Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
      because of ZONE_DMA exhaution and you need the best locality.
      
      (example)
      assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
      
      *node(0)'s memory allocation order:
      
       node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
      
      *node(1)'s memory allocation order:
      
       node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
      
      [Zone-based order]
      This policy treats the zone type as the most important parameter.
      Zonelist order is created by zone-type order. This means lower zone
      never be used bofere higher zone exhaustion.
      IOW. ZONE_DMA will be always at the tail of zonelist.
      
      Pros.
       * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
      Cons.
       * memory locality may not be best.
      
      (example)
      assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
      
      *node(0)'s memory allocation order:
      
       node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
      
      *node(1)'s memory allocation order:
      
       node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
      
      bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
      
      command:
      %echo N > /proc/sys/vm/numa_zonelist_order
      
      Will rebuild zonelist in Node-based order.
      
      command:
      %echo Z > /proc/sys/vm/numa_zonelist_order
      
      Will rebuild zonelist in Zone-based order.
      
      Thanks to Lee Schermerhorn, he gives me much help and codes.
      
      [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0c0b2b8
  28. 12 7月, 2007 1 次提交
    • E
      security: Protection for exploiting null dereference using mmap · ed032189
      Eric Paris 提交于
      Add a new security check on mmap operations to see if the user is attempting
      to mmap to low area of the address space.  The amount of space protected is
      indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
      0, preserving existing behavior.
      
      This patch uses a new SELinux security class "memprotect."  Policy already
      contains a number of allow rules like a_t self:process * (unconfined_t being
      one of them) which mean that putting this check in the process class (its
      best current fit) would make it useless as all user processes, which we also
      want to protect against, would be allowed. By taking the memprotect name of
      the new class it will also make it possible for us to move some of the other
      memory protect permissions out of 'process' and into the new class next time
      we bump the policy version number (which I also think is a good future idea)
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      ed032189
  29. 08 5月, 2007 1 次提交
  30. 30 11月, 2006 1 次提交
  31. 26 9月, 2006 1 次提交
    • C
      [PATCH] zone_reclaim: dynamic slab reclaim · 0ff38490
      Christoph Lameter 提交于
      Currently one can enable slab reclaim by setting an explicit option in
      /proc/sys/vm/zone_reclaim_mode.  Slab reclaim is then used as a final
      option if the freeing of unmapped file backed pages is not enough to free
      enough pages to allow a local allocation.
      
      However, that means that the slab can grow excessively and that most memory
      of a node may be used by slabs.  We have had a case where a machine with
      46GB of memory was using 40-42GB for slab.  Zone reclaim was effective in
      dealing with pagecache pages.  However, slab reclaim was only done during
      global reclaim (which is a bit rare on NUMA systems).
      
      This patch implements slab reclaim during zone reclaim.  Zone reclaim
      occurs if there is a danger of an off node allocation.  At that point we
      
      1. Shrink the per node page cache if the number of pagecache
         pages is more than min_unmapped_ratio percent of pages in a zone.
      
      2. Shrink the slab cache if the number of the nodes reclaimable slab pages
         (patch depends on earlier one that implements that counter)
         are more than min_slab_ratio (a new /proc/sys/vm tunable).
      
      The shrinking of the slab cache is a bit problematic since it is not node
      specific.  So we simply calculate what point in the slab we want to reach
      (current per node slab use minus the number of pages that neeed to be
      allocated) and then repeately run the global reclaim until that is
      unsuccessful or we have reached the limit.  I hope we will have zone based
      slab reclaim at some point which will make that easier.
      
      The default for the min_slab_ratio is 5%
      
      Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0ff38490
  32. 04 7月, 2006 1 次提交
    • C
      [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O · 9614634f
      Christoph Lameter 提交于
      It turns out that it is advantageous to leave a small portion of unmapped file
      backed pages if all of a zone's pages (or almost all pages) are allocated and
      so the page allocator has to go off-node.
      
      This allows recently used file I/O buffers to stay on the node and
      reduces the times that zone reclaim is invoked if file I/O occurs
      when we run out of memory in a zone.
      
      The problem is that zone reclaim runs too frequently when the page cache is
      used for file I/O (read write and therefore unmapped pages!) alone and we have
      almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
      pages.  File I/O will use these pages for the next read/write requests and the
      unmapped pages increase.  After the zone has filled up again zone reclaim will
      remove it again after only 32 pages.  This cycle is too inefficient and there
      are potentially too many zone reclaim cycles.
      
      With the 1% boundary we may still remove all unmapped pages for file I/O in
      zone reclaim pass.  However.  it will take a large number of read and writes
      to get back to 1% again where we trigger zone reclaim again.
      
      The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
      second timeout.
      
      [akpm@osdl.org: rename the /proc file and the variable]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9614634f
  33. 01 7月, 2006 1 次提交
  34. 23 6月, 2006 1 次提交
  35. 02 2月, 2006 1 次提交
    • C
      [PATCH] Reclaim slab during zone reclaim · 2a16e3f4
      Christoph Lameter 提交于
      If large amounts of zone memory are used by empty slabs then zone_reclaim
      becomes uneffective.  This patch shakes the slab a bit.
      
      The problem with this patch is that the slab reclaim is not containable to a
      zone.  Thus slab reclaim may affect the whole system and be extremely slow.
      This also means that we cannot determine how many pages were freed in this
      zone.  Thus we need to go off node for at least one allocation.
      
      The functionality is disabled by default.
      
      We could modify the shrinkers to take a zone parameter but that would be quite
      invasive.  Better ideas are welcome.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2a16e3f4