1. 14 4月, 2009 6 次提交
  2. 07 4月, 2009 3 次提交
    • P
      mm: add /proc controls for pdflush threads · fafd688e
      Peter W Morreale 提交于
      Add /proc entries to give the admin the ability to control the minimum and
      maximum number of pdflush threads.  This allows finer control of pdflush
      on both large and small machines.
      
      The rationale is simply one size does not fit all.  Admins on large and/or
      small systems may want to tune the min/max pdflush thread count to best
      suit their needs.  Right now the min/max is hardcoded to 2/8.  While
      probably a fair estimate for smaller machines, large machines with large
      numbers of CPUs and large numbers of filesystems/block devices may benefit
      from larger numbers of threads working on different block devices.
      
      Even if the background flushing algorithm is radically changed, it is
      still likely that multiple threads will be involved and admins would still
      desire finer control on the min/max other than to have to recompile the
      kernel.
      
      The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
      '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.
      
      The minimum value for nr_pdflush_threads_min is 1 and the maximum value is
      the current value of nr_pdflush_threads_max.  This minimum is required
      since additional thread creation is performed in a pdflush thread itself.
      
      The minimum value for nr_pdflush_threads_max is the current value of
      nr_pdflush_threads_min and the maximum value can be 1000.
      
      Documentation/sysctl/vm.txt is also updated.
      
      [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly]
      Signed-off-by: NPeter W Morreale <pmorreale@novell.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fafd688e
    • P
      mm: fix pdflush thread creation upper bound · a56ed663
      Peter W Morreale 提交于
      Fix a race on creating pdflush threads.  Without the patch, it is possible
      to create more than MAX_PDFLUSH_THREADS threads, and this has been
      observed in practice on IO loaded SMP machines.
      
      The fix involves moving the lock around to protect the check against the
      thread count and correctly dealing with thread creation failure.
      
      This fix also _mostly_ repairs a race condition on how quickly the threads
      are created.  The original intent was to create a pdflush thread (up to
      the max allowed) every second.  Without this patch is is possible to
      create NCPUS pdflush threads concurrently.  The 'mostly' caveat is because
      an assumption is made that thread creation will be successful.  If we fail
      to create the thread, the miss is not considered fatal.  (we will try
      again in 1 second)
      Signed-off-by: NPeter W Morreale <pmorreale@novell.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a56ed663
    • S
      percpu: __percpu_depopulate_mask can take a const mask · 5d6700ea
      Stephen Rothwell 提交于
      This eliminates a compiler warning:
      
        mm/allocpercpu.c: In function 'free_percpu':
        mm/allocpercpu.c:146: warning: passing argument 2 of '__percpu_depopulate_mask' discards qualifiers from pointer target type
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d6700ea
  3. 06 4月, 2009 1 次提交
  4. 04 4月, 2009 1 次提交
  5. 03 4月, 2009 22 次提交
    • D
      CacheFiles: Permit the page lock state to be monitored · 385e1ca5
      David Howells 提交于
      Add a function to install a monitor on the page lock waitqueue for a particular
      page, thus allowing the page being unlocked to be detected.
      
      This is used by CacheFiles to detect read completion on a page in the backing
      filesystem so that it can then copy the data to the waiting netfs page.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: NDaire Byrne <Daire.Byrne@framestore.com>
      385e1ca5
    • D
      FS-Cache: Recruit a page flags for cache management · 266cf658
      David Howells 提交于
      Recruit a page flag to aid in cache management.  The following extra flag is
      defined:
      
       (1) PG_fscache (PG_private_2)
      
           The marked page is backed by a local cache and is pinning resources in the
           cache driver.
      
      If PG_fscache is set, then things that checked for PG_private will now also
      check for that.  This includes things like truncation and page invalidation.
      The function page_has_private() had been added to make the checks for both
      PG_private and PG_private_2 at the same time.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: NDaire Byrne <Daire.Byrne@framestore.com>
      266cf658
    • D
      FS-Cache: Release page->private after failed readahead · 03fb3d2a
      David Howells 提交于
      The attached patch causes read_cache_pages() to release page-private data on a
      page for which add_to_page_cache() fails.  If the filler function fails, then
      the problematic page is left attached to the pagecache (with appropriate flags
      set, one presumes) and the remaining to-be-attached pages are invalidated and
      discarded.  This permits pages with caching references associated with them to
      be cleaned up.
      
      The invalidatepage() address space op is called (indirectly) to do the honours.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: NDaire Byrne <Daire.Byrne@framestore.com>
      03fb3d2a
    • P
      kmemtrace: trace kfree() calls with NULL or zero-length objects · 2121db74
      Pekka Enberg 提交于
      Impact: also output kfree(NULL) entries
      
      This patch moves the trace_kfree() calls before the ZERO_OR_NULL_PTR
      check so that we can trace call-sites that call kfree() with NULL many
      times which might be an indication of a bug.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      LKML-Reference: <1237971957.30175.18.camel@penberg-laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2121db74
    • E
      kmemtrace: use tracepoints · ca2b84cb
      Eduard - Gabriel Munteanu 提交于
      kmemtrace now uses tracepoints instead of markers. We no longer need to
      use format specifiers to pass arguments.
      Signed-off-by: NEduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      [ folded: Use the new TP_PROTO and TP_ARGS to fix the build.     ]
      [ folded: fix build when CONFIG_KMEMTRACE is disabled.           ]
      [ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ]
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      LKML-Reference: <ae61c0f37156db8ec8dc0d5778018edde60a92e3.1237813499.git.eduard.munteanu@linux360.ro>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ca2b84cb
    • P
      kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c · 255d11bc
      Pekka Enberg 提交于
      Impact: cleanup
      
      mm/failslab.c depends on slab.h without including it:
      
          CC      mm/failslab.o
        mm/failslab.c: In function ‘should_failslab’:
        mm/failslab.c:16: error: ‘__GFP_NOFAIL’ undeclared (first use in this function)
        mm/failslab.c:16: error: (Each undeclared identifier is reported only once
        mm/failslab.c:16: error: for each function it appears in.)
        mm/failslab.c:19: error: ‘__GFP_WAIT’ undeclared (first use in this function)
        make[1]: *** [mm/failslab.o] Error 1
        make: *** [mm] Error 2
      
      It gets included implicitly currently - but this will not be the
      case with upcoming kmemtrace changes.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      LKML-Reference: <1237888761.25315.69.camel@penberg-laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      255d11bc
    • D
      memcg: cleanup cache_charge · 83aae4c7
      Daisuke Nishimura 提交于
      Current mem_cgroup_cache_charge is a bit complicated especially
      in the case of shmem's swap-in.
      
      This patch cleans it up by using try_charge_swapin and commit_charge_swapin.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83aae4c7
    • K
      memcg: remove redundant message at swapon · 627991a2
      KAMEZAWA Hiroyuki 提交于
      It's pointed out that swap_cgroup's message at swapon() is nonsense.
      Because
      
        * It can be calculated very easily if all necessary information is
          written in Kconfig.
      
        * It's not necessary to annoying people at every swapon().
      
      In other view, now, memory usage per swp_entry is reduced to 2bytes from
      8bytes(64bit) and I think it's reasonably small.
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      627991a2
    • K
      cgroups: use css id in swap cgroup for saving memory v5 · a3b2d692
      KAMEZAWA Hiroyuki 提交于
      Try to use CSS ID for records in swap_cgroup.  By this, on 64bit machine,
      size of swap_cgroup goes down to 2 bytes from 8bytes.
      
      This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)
      
      	From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
      	To   size of swap_cgroup = 2G/4k * 2 = 1Mbytes.
      
      Reduction is large.  Of course, there are trade-offs.  This CSS ID will
      add overhead to swap-in/swap-out/swap-free.
      
      But in general,
        - swap is a resource which the user tend to avoid use.
        - If swap is never used, swap_cgroup area is not used.
        - Reading traditional manuals, size of swap should be proportional to
          size of memory. Memory size of machine is increasing now.
      
      I think reducing size of swap_cgroup makes sense.
      
      Note:
        - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
        - memcg can be obsolete at rmdir() but not freed while refcnt from
          swap_cgroup is available.
      
      Changelog v4->v5:
       - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
      Changlog ->v4:
       - fixed not configured case.
       - deleted unnecessary comments.
       - fixed NULL pointer bug.
       - fixed message in dmesg.
      
      [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b2d692
    • D
      memcg: charge swapcache to proper memcg · 3c776e64
      Daisuke Nishimura 提交于
      memcg_test.txt says at 4.1:
      
      	This swap-in is one of the most complicated work. In do_swap_page(),
      	following events occur when pte is unchanged.
      
      	(1) the page (SwapCache) is looked up.
      	(2) lock_page()
      	(3) try_charge_swapin()
      	(4) reuse_swap_page() (may call delete_swap_cache())
      	(5) commit_charge_swapin()
      	(6) swap_free().
      
      	Considering following situation for example.
      
      	(A) The page has not been charged before (2) and reuse_swap_page()
      	    doesn't call delete_from_swap_cache().
      	(B) The page has not been charged before (2) and reuse_swap_page()
      	    calls delete_from_swap_cache().
      	(C) The page has been charged before (2) and reuse_swap_page() doesn't
      	    call delete_from_swap_cache().
      	(D) The page has been charged before (2) and reuse_swap_page() calls
      	    delete_from_swap_cache().
      
      	    memory.usage/memsw.usage changes to this page/swp_entry will be
      	 Case          (A)      (B)       (C)     (D)
               Event
             Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
                ===========================================
                (3)        +1/+1    +1/+1     +1/+1   +1/+1
                (4)          -       0/ 0       -     -1/ 0
                (5)         0/-1     0/ 0     -1/-1    0/ 0
                (6)          -       0/-1       -      0/-1
                ===========================================
             Result         1/ 1     1/ 1      1/ 1    1/ 1
      
             In any cases, charges to this page should be 1/ 1.
      
      In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
      (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
      charges to the memcg("foo") to which the "current" belongs.
      OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
      to which the page has been charged.
      
      So, if the "foo" and "baa" is different(for example because of task move),
      this charge will be moved from "baa" to "foo".
      
      I think this is an unexpected behavior.
      
      This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
      to return the memcg to which the swapcache has been charged if PCG_USED bit
      is set.
      IIUC, checking PCG_USED bit of swapcache is safe under page lock.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c776e64
    • K
      memcg: remove mem_cgroup_calc_mapped_ratio() · c137b5ec
      KOSAKI Motohiro 提交于
      Currently, mem_cgroup_calc_mapped_ratio() is unused at all.  it can be
      removed and KAMEZAWA-san suggested it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c137b5ec
    • B
      memcg: show memcg information during OOM · e222432b
      Balbir Singh 提交于
      Add RSS and swap to OOM output from memcg
      
      Display memcg values like failcnt, usage and limit when an OOM occurs due
      to memcg.
      
      Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
      Daisuke Nishimura and KOSAKI Motohiro for review.
      
      Sample output
      -------------
      
      Task in /a/x killed as a result of limit of /a
      memory: usage 1048576kB, limit 1048576kB, failcnt 4183
      memory+swap: usage 1400964akB, limit 9007199254740991kB, failcnt 0
      
      [akpm@linux-foundation.org: compilation fix]
      [akpm@linux-foundation.org: fix kerneldoc and whitespace]
      [akpm@linux-foundation.org: add printk facility level]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e222432b
    • K
      memcg: fix OOM killer under memcg · 0b7f569e
      KAMEZAWA Hiroyuki 提交于
      This patch tries to fix OOM Killer problems caused by hierarchy.
      Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
      kill a task in memcg.
      
      But, when hierarchy is used, it's broken and correct task cannot
      be killed. For example, in following cgroup
      
      	/groupA/	hierarchy=1, limit=1G,
      		01	nolimit
      		02	nolimit
      All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
      groupA's 1Gbytes but OOM Killer just kills tasks in groupA.
      
      This patch provides makes the bad process be selected from all tasks
      under hierarchy. BTW, currently, oom_jiffies is updated against groupA
      in above case. oom_jiffies of tree should be updated.
      
      To see how oom_jiffies is used, please check mem_cgroup_oom_called()
      callers.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: const fix]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b7f569e
    • K
      memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm · 81d39c20
      KAMEZAWA Hiroyuki 提交于
      As pointed out, shrinking memcg's limit should return -EBUSY after
      reasonable retries.  This patch tries to fix the current behavior of
      shrink_usage.
      
      Before looking into "shrink should return -EBUSY" problem, we should fix
      hierarchical reclaim code.  It compares current usage and current limit,
      but it only makes sense when the kernel reclaims memory because hit
      limits.  This is also a problem.
      
      What this patch does are.
      
        1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
           hierarchical reclaim returns immediately and the caller checks the kernel
           should shrink more or not.
           (At shrinking memory, usage is always smaller than limit. So check for
            usage < limit is useless.)
      
        2. For adjusting to above change, 2 changes in "shrink"'s retry path.
           2-a. retry_count depends on # of children because the kernel visits
      	  the children under hierarchy one by one.
           2-b. rather than checking return value of hierarchical_reclaim's progress,
      	  compares usage-before-shrink and usage-after-shrink.
      	  If usage-before-shrink <= usage-after-shrink, retry_count is
      	  decremented.
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81d39c20
    • K
      memcg: hierarchical stat · 14067bb3
      KAMEZAWA Hiroyuki 提交于
      Clean up memory.stat file routine and show "total" hierarchical stat.
      
      This patch does
        - renamed get_all_zonestat to be get_local_zonestat.
        - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
        - add mcs_stat to cover both of per-cpu/per-lru stat.
        - add "total" stat of hierarchy (*)
        - add a callback system to scan all memcg under a root.
      == "total" is added.
      [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
      cache 0
      rss 0
      pgpgin 0
      pgpgout 0
      inactive_anon 0
      active_anon 0
      inactive_file 0
      active_file 0
      unevictable 0
      hierarchical_memory_limit 50331648
      hierarchical_memsw_limit 9223372036854775807
      total_cache 65536
      total_rss 192512
      total_pgpgin 218
      total_pgpgout 155
      total_inactive_anon 0
      total_active_anon 135168
      total_inactive_file 61440
      total_active_file 4096
      total_unevictable 0
      ==
      (*) maybe the user can do calc hierarchical stat by his own program
         in userland but if it can be written in clean way, it's worth to be
         shown, I think.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14067bb3
    • K
      memcg: use CSS ID · 04046e1a
      KAMEZAWA Hiroyuki 提交于
      Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
      
      	Assume folloing tree.
      
      	group_A (ID=3)
      		/01 (ID=4)
      		   /0A (ID=7)
      		/02 (ID=10)
      	group_B (ID=5)
      	and task in group_A/01/0A hits limit at group_A.
      
      	reclaim will be done in following order (round-robin).
      	group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
      	-> group_A -> .....
      
      	Round robin by ID. The last visited cgroup is recorded and restart
      	from it when it start reclaim again.
      	(More smart algorithm can be implemented..)
      
      	No cgroup_mutex or hierarchy_mutex is required.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04046e1a
    • K
      cgroup: fix frequent -EBUSY at rmdir · ec64f515
      KAMEZAWA Hiroyuki 提交于
      In following situation, with memory subsystem,
      
      	/groupA use_hierarchy==1
      		/01 some tasks
      		/02 some tasks
      		/03 some tasks
      		/04 empty
      
      When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
      is triggered and the kernel walks tree under groupA. In this case,
      rmdir /groupA/04 fails with -EBUSY frequently because of temporal
      refcnt from the kernel.
      
      In general. cgroup can be rmdir'd if there are no children groups and
      no tasks. Frequent fails of rmdir() is not useful to users.
      (And the reason for -EBUSY is unknown to users.....in most cases)
      
      This patch tries to modify above behavior, by
      	- retries if css_refcnt is got by someone.
      	- add "return value" to pre_destroy() and allows subsystem to
      	  say "we're really busy!"
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec64f515
    • J
      workqueue: add to_delayed_work() helper function · bf6aede7
      Jean Delvare 提交于
      It is a fairly common operation to have a pointer to a work and to need a
      pointer to the delayed work it is contained in.  In particular, all
      delayed works which want to rearm themselves will have to do that.  So it
      would seem fair to offer a helper function for this operation.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf6aede7
    • M
      mm: do_xip_mapping_read: fix length calculation · 58984ce2
      Martin Schwidefsky 提交于
      The calculation of the value nr in do_xip_mapping_read is incorrect.  If
      the copy required more than one iteration in the do while loop the copies
      variable will be non-zero.  The maximum length that may be passed to the
      call to copy_to_user(buf+copied, xip_mem+offset, nr) is len-copied but the
      check only compares against (nr > len).
      
      This bug is the cause for the heap corruption Carsten has been chasing
      for so long:
      
      *** glibc detected *** /bin/bash: free(): invalid next size (normal): 0x00000000800e39f0 ***
      ======= Backtrace: =========
      /lib64/libc.so.6[0x200000b9b44]
      /lib64/libc.so.6(cfree+0x8e)[0x200000bdade]
      /bin/bash(free_buffered_stream+0x32)[0x80050e4e]
      /bin/bash(close_buffered_stream+0x1c)[0x80050ea4]
      /bin/bash(unset_bash_input+0x2a)[0x8001c366]
      /bin/bash(make_child+0x1d4)[0x8004115c]
      /bin/bash[0x8002fc3c]
      /bin/bash(execute_command_internal+0x656)[0x8003048e]
      /bin/bash(execute_command+0x5e)[0x80031e1e]
      /bin/bash(execute_command_internal+0x79a)[0x800305d2]
      /bin/bash(execute_command+0x5e)[0x80031e1e]
      /bin/bash(reader_loop+0x270)[0x8001efe0]
      /bin/bash(main+0x1328)[0x8001e960]
      /lib64/libc.so.6(__libc_start_main+0x100)[0x200000592a8]
      /bin/bash(clearerr+0x5e)[0x8001c092]
      
      With this bug fix the commit 0e4a9b59
      "ext2/xip: refuse to change xip flag during remount with busy inodes" can
      be removed again.
      
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58984ce2
    • A
      mm: align vmstat_work's timer · 98f4ebb2
      Anton Blanchard 提交于
      Even though vmstat_work is marked deferrable, there are still benefits to
      aligning it.  For certain applications we want to keep OS jitter as low as
      possible and aligning timers and work so they occur together can reduce
      their overall impact.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98f4ebb2
    • D
      nommu: fix a number of issues with the per-MM VMA patch · 33e5d769
      David Howells 提交于
      Fix a number of issues with the per-MM VMA patch:
      
       (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
           a NOMMU system with more than 2G pages.  Makes no difference on a 32-bit
           system.
      
       (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
           lest it overflow.
      
       (3) Move the allocation of the vm_area_struct slab back for fork.c.
      
       (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.
      
       (5) Use BUG_ON() rather than if () BUG().
      
       (6) Make the default validate_nommu_regions() a static inline rather than a
           #define.
      
       (7) Make free_page_series()'s objection to pages with a refcount != 1 more
           informative.
      
       (8) Adjust the __put_nommu_region() banner comment to indicate that the
           semaphore must be held for writing.
      
       (9) Limit the number of warnings about munmaps of non-mmapped regions.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33e5d769
    • A
      generic debug pagealloc: build fix · ee3b4290
      Akinobu Mita 提交于
      This fixes a build failure with generic debug pagealloc:
      
        mm/debug-pagealloc.c: In function 'set_page_poison':
        mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags'
        mm/debug-pagealloc.c: In function 'clear_page_poison':
        mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags'
        mm/debug-pagealloc.c: In function 'page_poison':
        mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags'
        mm/debug-pagealloc.c: At top level:
        mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages'
        include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here
        mm/debug-pagealloc.c: In function 'kernel_map_pages':
        mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function)
      
      by fixing
      
       - debug_flags should be in struct page
       - define DEBUG_PAGEALLOC config option for all architectures
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reported-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee3b4290
  6. 01 4月, 2009 7 次提交
    • H
      shmem: writepage directly to swap · 9fab5619
      Hugh Dickins 提交于
      Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
      swap loads benefit, and a catastrophic interaction between SLUB and some
      flash storage is avoided.
      
      shmem_writepage() has always been peculiar in making no attempt to write:
      it has just transferred a shmem page from file cache to swap cache, then
      let that page make its way around the LRU again before being written and
      freed.
      
      The idea was that people use tmpfs because they want those pages to stay
      in RAM; so although we give it an overflow to swap, we should resist
      writing too soon, giving those pages a second chance before they can be
      reclaimed.
      
      That was always questionable, and I've toyed with this patch for years;
      but never had a clear justification to depart from the original design.
      
      It became more questionable in 2.6.28, when the split LRU patches classed
      shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
      itself gives them more resistance to reclaim than normal file pages.  I
      prepared this patch for 2.6.29, but the merge window arrived before I'd
      completed gathering statistics to justify sending it in.
      
      Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
      habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
      tests five times slower than SLAB or SLQB - other machines slower too, but
      nowhere near so bad.  Simpler "cp -a" swapping tests showed the same.
      
      slub_max_order=0 brings sanity to all, but heavy swapping is too far from
      normal to justify such a tuning.  The crucial factor on that laptop turns
      out to be that I'm using an SD card for swap.  What happens is this:
      
      By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
      fs inodes), so creating tmpfs files under memory pressure brings lumpy
      reclaim into play.  One subpage of the order is chosen from the bottom of
      the LRU as usual, then the other three picked out from their random
      positions on the LRUs.
      
      In a tmpfs load, many of these pages will be ones which already passed
      through shmem_writepage, so already have swap allocated.  And though their
      offsets on swap were probably allocated sequentially, now that the pages
      are picked off at random, their swap offsets are scattered.
      
      But the flash storage on the SD card is very sensitive to having its
      writes merged: once swap is written at scattered offsets, performance
      falls apart.  Rotating disk seeks increase too, but less disastrously.
      
      So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
      out to swap as soon as their swap has been allocated.
      
      It's surely possible to devise an artificial load which runs faster the
      old way, one whose sizing is such that the tmpfs pages on their second
      pass are the ones that are wanted again, and other pages not.
      
      But I've not yet found such a load: on all machines, under the loads I've
      tried, immediate swap_writepage speeds up shmem swapping: especially when
      using the SLUB allocator (and more effectively than slub_max_order=0), but
      also with the others; and it also reduces the variance between runs.  How
      much faster varies widely: a factor of five is rare, 5% is common.
      
      One load which might have suffered: imagine a swapping shmem load in a
      limited mem_cgroup on a machine with plenty of memory.  Before 2.6.29 the
      swapcache was not charged, and such a load would have run quickest with
      the shmem swapcache never written to swap.  But now swapcache is charged,
      so even this load benefits from shmem_writepage directly to swap.
      
      Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
      it's silly because that will never get called; but refactoring shmem.c
      sensibly according to CONFIG_SWAP will be a separate task.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fab5619
    • K
      vmscan: fix it to take care of nodemask · 327c0e96
      KAMEZAWA Hiroyuki 提交于
      try_to_free_pages() is used for the direct reclaim of up to
      SWAP_CLUSTER_MAX pages when watermarks are low.  The caller to
      alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
      be used but this is not passed to try_to_free_pages().  This can lead to
      unnecessary reclaim of pages that are unusable by the caller and int the
      worst case lead to allocation failure as progress was not been make where
      it is needed.
      
      This patch passes the nodemask used for alloc_pages_nodemask() to
      try_to_free_pages().
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      327c0e96
    • D
      vmscan: print shrink_slab symbol name on negative shrinker objects · 88c3bd70
      David Rientjes 提交于
      When a shrinker has a negative number of objects to delete, the symbol
      name of the shrinker should be printed, not shrink_slab.  This also makes
      the error message slightly more informative.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88c3bd70
    • D
      nommu: make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n · 71aa653c
      David Howells 提交于
      Make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n.  There's no logical
      reason it shouldn't be available, and it can be used for ramfs.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Enrik Berkhan <Enrik.Berkhan@ge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71aa653c
    • D
      nommu: there is no mlock() for NOMMU, so don't provide the bits · 33925b25
      David Howells 提交于
      The mlock() facility does not exist for NOMMU since all mappings are
      effectively locked anyway, so we don't make the bits available when
      they're not useful.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Enrik Berkhan <Enrik.Berkhan@ge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33925b25
    • A
      mm: introduce debug_kmap_atomic · f4112de6
      Akinobu Mita 提交于
      x86 has debug_kmap_atomic_prot() which is error checking function for
      kmap_atomic.  It is usefull for the other architectures, although it needs
      CONFIG_TRACE_IRQFLAGS_SUPPORT.
      
      This patch exposes it to the other architectures.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4112de6
    • N
      mm: page_mkwrite change prototype to match fault · c2ec175c
      Nick Piggin 提交于
      Change the page_mkwrite prototype to take a struct vm_fault, and return
      VM_FAULT_xxx flags.  There should be no functional change.
      
      This makes it possible to return much more detailed error information to
      the VM (and also can provide more information eg.  virtual_address to the
      driver, which might be important in some special cases).
      
      This is required for a subsequent fix.  And will also make it easier to
      merge page_mkwrite() with fault() in future.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Cc: Felix Blyakher <felixb@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2ec175c