1. 27 7月, 2008 7 次提交
    • N
      mm: spinlock tree_lock · 19fd6231
      Nick Piggin 提交于
      mapping->tree_lock has no read lockers.  convert the lock from an rwlock
      to a spinlock.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19fd6231
    • N
      mm: lockless pagecache · a60637c8
      Nick Piggin 提交于
      Combine page_cache_get_speculative with lockless radix tree lookups to
      introduce lockless page cache lookups (ie.  no mapping->tree_lock on the
      read-side).
      
      The only atomicity changes this introduces is that the gang pagecache
      lookup functions now behave as if they are implemented with multiple
      find_get_page calls, rather than operating on a snapshot of the pages.  In
      practice, this atomicity guarantee is not used anyway, and it is to
      replace individual lookups, so these semantics are natural.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a60637c8
    • N
      mm: speculative page references · e286781d
      Nick Piggin 提交于
      If we can be sure that elevating the page_count on a pagecache page will
      pin it, we can speculatively run this operation, and subsequently check to
      see if we hit the right page rather than relying on holding a lock or
      otherwise pinning a reference to the page.
      
      This can be done if get_page/put_page behaves consistently throughout the
      whole tree (ie.  if we "get" the page after it has been used for something
      else, we must be able to free it with a put_page).
      
      Actually, there is a period where the count behaves differently: when the
      page is free or if it is a constituent page of a compound page.  We need
      an atomic_inc_not_zero operation to ensure we don't try to grab the page
      in either case.
      
      This patch introduces the core locking protocol to the pagecache (ie.
      adds page_cache_get_speculative, and tweaks some update-side code to make
      it work).
      
      Thanks to Hugh for pointing out an improvement to the algorithm setting
      page_count to zero when we have control of all references, in order to
      hold off speculative getters.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
      [hugh@veritas.com: fix add_to_page_cache]
      [akpm@linux-foundation.org: repair a comment]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e286781d
    • N
      mm: readahead scan lockless · 30002ed2
      Nick Piggin 提交于
      radix_tree_next_hole() is implemented as a series of radix_tree_lookup()s.
      So it can be called locklessly, under rcu_read_lock().
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30002ed2
    • N
      x86: lockless get_user_pages_fast() · 8174c430
      Nick Piggin 提交于
      Implement get_user_pages_fast without locking in the fastpath on x86.
      
      Do an optimistic lockless pagetable walk, without taking mmap_sem or any
      page table locks or even mmap_sem.  Page table existence is guaranteed by
      turning interrupts off (combined with the fact that we're always looking
      up the current mm, means we can do the lockless page table walk within the
      constraints of the TLB shootdown design).  Basically we can do this
      lockless pagetable walk in a similar manner to the way the CPU's pagetable
      walker does not have to take any locks to find present ptes.
      
      This patch (combined with the subsequent ones to convert direct IO to use
      it) was found to give about 10% performance improvement on a 2 socket 8
      core Intel Xeon system running an OLTP workload on DB2 v9.5
      
       "To test the effects of the patch, an OLTP workload was run on an IBM
        x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
        2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel.  Comparing
        runs with and without the patch resulted in an overall performance
        benefit of ~9.8%.  Correspondingly, oprofiles showed that samples from
        __up_read and __down_read routines that is seen during thread contention
        for system resources was reduced from 2.8% down to .05%.  Monitoring the
        /proc/vmstat output from the patched run showed that the counter for
        fast_gup contained a very high number while the fast_gup_slow value was
        zero."
      
      (fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
      counter we had for the number of times the slowpath was invoked).
      
      The main reason for the improvement is that DB2 has multiple threads each
      issuing direct-IO.  Direct-IO uses get_user_pages, and thus the threads
      contend the mmap_sem cacheline, and can also contend on page table locks.
      
      I would anticipate larger performance gains on larger systems, however I
      think DB2 uses an adaptive mix of threads and processes, so it could be
      that thread contention remains pretty constant as machine size increases.
      In which case, we stuck with "only" a 10% gain.
      
      The downside of using get_user_pages_fast is that if there is not a pte
      with the correct permissions for the access, we end up falling back to
      get_user_pages and so the get_user_pages_fast is a bit of extra work.
      However this should not be the common case in most performance critical
      code.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: Kconfig fix]
      [akpm@linux-foundation.org: Makefile fix/cleanup]
      [akpm@linux-foundation.org: warning fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8174c430
    • N
      hugetlb: fix CONFIG_SYSCTL=n build · 8a213460
      Nishanth Aravamudan 提交于
      Fixes a build failure reported by Alan Cox:
      
      mm/hugetlb.c: In function `hugetlb_acct_memory': mm/hugetlb.c:1507:
      error: implicit declaration of function `cpuset_mems_nr'
      
      Also reverts Ingo's
      
          commit e44d1b29
          Author: Ingo Molnar <mingo@elte.hu>
          Date:   Fri Jul 25 12:57:41 2008 +0200
      
              mm/hugetlb.c: fix build failure with !CONFIG_SYSCTL
      
      which fixed the build error but added some unused-static-function warnings.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a213460
    • A
      uninline arch_pick_mmap_layout() · 16d69265
      Andrew Morton 提交于
      Fix this, on avr32:
      
        include/linux/utsname.h:35,
                         from init/main.c:20:
        include/linux/sched.h: In function 'arch_pick_mmap_layout':
        include/linux/sched.h:2149: error: implicit declaration of function 'PAGE_ALIGN'
      Reported-by: NAdrian Bunk <bunk@kernel.org>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16d69265
  2. 26 7月, 2008 14 次提交
    • I
      mm/hugetlb.c: fix build failure with !CONFIG_SYSCTL · e44d1b29
      Ingo Molnar 提交于
      on !CONFIG_SYSCTL on x86 with latest -git i get:
      
           mm/hugetlb.c: In function 'decrement_hugepage_resv_vma':
           mm/hugetlb.c:83: error: 'reserve' undeclared (first use in this function)
           mm/hugetlb.c:83: error: (Each undeclared identifier is reported only once
           mm/hugetlb.c:83: error: for each function it appears in.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e44d1b29
    • K
      per-task-delay-accounting: add memory reclaim delay · 873b4771
      Keika Kobayashi 提交于
      Sometimes, application responses become bad under heavy memory load.
      Applications take a bit time to reclaim memory.  The statistics, how long
      memory reclaim takes, will be useful to measure memory usage.
      
      This patch adds accounting memory reclaim to per-task-delay-accounting for
      accounting the time of do_try_to_free_pages().
      
      <i.e>
      
      - When System is under low memory load,
        memory reclaim may not occur.
      
      $ free
                   total       used       free     shared    buffers     cached
      Mem:       8197800    1577300    6620500          0       4808    1516724
      -/+ buffers/cache:      55768    8142032
      Swap:     16386292          0   16386292
      
      $ vmstat 1
      procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
       0  0      0 5069748  10612 3014060    0    0     0     0    3   26  0  0 100  0
       0  0      0 5069748  10612 3014060    0    0     0     0    4   22  0  0 100  0
       0  0      0 5069748  10612 3014060    0    0     0     0    3   18  0  0 100  0
      
      Measure the time of tar command.
      
      $ ls -s test.dat
      1501472 test.dat
      
      $ time tar cvf test.tar test.dat
      real    0m13.388s
      user    0m0.116s
      sys     0m5.304s
      
      $ ./delayget -d -p <pid>
      CPU             count     real total  virtual total    delay total
                        428     5528345500     5477116080       62749891
      IO              count    delay total
                        338     8078977189
      SWAP            count    delay total
                          0              0
      RECLAIM         count    delay total
                          0              0
      
      - When system is under heavy memory load
        memory reclaim may occur.
      
      $ vmstat 1
      procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
       0  0 7159032  49724   1812   3012    0    0     0     0    3   24  0  0 100  0
       0  0 7159032  49724   1812   3012    0    0     0     0    4   24  0  0 100  0
       0  0 7159032  49848   1812   3012    0    0     0     0    3   22  0  0 100  0
      
      In this case, one process uses more 8G memory
      by execution of malloc() and memset().
      
      $ time tar cvf test.tar test.dat
      real    1m38.563s        <-  increased by 85 sec
      user    0m0.140s
      sys     0m7.060s
      
      $ ./delayget -d -p <pid>
      CPU             count     real total  virtual total    delay total
                       9021     7140446250     7315277975      923201824
      IO              count    delay total
                       8965    90466349669
      SWAP            count    delay total
                          3       21036367
      RECLAIM         count    delay total
                        740    61011951153
      
      In the later case, the value of RECLAIM is increasing.
      So, taskstats can show how much memory reclaim influences TAT.
      Signed-off-by: NKeika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujistu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      873b4771
    • K
      memcg: limit change shrink usage · 628f4235
      KAMEZAWA Hiroyuki 提交于
      Shrinking memory usage at limit change.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      628f4235
    • L
      memcg: clean up checking of the disabled flag · cede86ac
      Li Zefan 提交于
      Those checks are unnecessary, because when the subsystem is disabled
      it can't be mounted, so those functions won't get called.
      
      The check is needed in functions which will be called in other places
      except cgroup.
      
      [hugh@veritas.com: further checking of disabled flag]
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cede86ac
    • K
      memcg: remove a redundant check · accf163e
      KAMEZAWA Hiroyuki 提交于
      Because of remove refcnt patch, it's very rare case to that
      mem_cgroup_charge_common() is called against a page which is accounted.
      
      mem_cgroup_charge_common() is called when.
       1. a page is added into file cache.
       2. an anon page is _newly_ mapped.
      
      A racy case is that a newly-swapped-in anonymous page is referred from
      prural threads in do_swap_page() at the same time.
      (a page is not Locked when mem_cgroup_charge() is called from do_swap_page.)
      
      Another case is shmem. It charges its page before calling add_to_page_cache().
      Then, mem_cgroup_charge_cache() is called twice. This case is handled in
      mem_cgroup_cache_charge(). But this check may be too hacky...
      
      Signed-off-by : KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accf163e
    • K
      memcg: add hints for branch · b76734e5
      KAMEZAWA Hiroyuki 提交于
      Showing brach direction for obvious conditions.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b76734e5
    • K
      memcg: helper function for relcaim from shmem. · c9b0ed51
      KAMEZAWA Hiroyuki 提交于
      A new call, mem_cgroup_shrink_usage() is added for shmem handling and
      relacing non-standard usage of mem_cgroup_charge/uncharge.
      
      Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
      mem_cgroup.  In general, shmem is used by some process group and not for
      global resource (like file caches).  So, it's reasonable to reclaim pages
      from mem_cgroup where shmem is mainly used.
      
      [hugh@veritas.com: shmem_getpage release page sooner]
      [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b0ed51
    • K
      memcg: remove refcnt from page_cgroup · 69029cd5
      KAMEZAWA Hiroyuki 提交于
      memcg: performance improvements
      
      Patch Description
       1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
       2/5 ... swapcache handling patch
       3/5 ... add helper function for shmem's memory reclaim patch
       4/5 ... optimize by likely/unlikely ppatch
       5/5 ... remove redundunt check patch (shmem handling is fixed.)
      
      Unix bench result.
      
      == 2.6.26-rc2-mm1 + memory resource controller
      Execl Throughput                           2915.4 lps   (29.6 secs, 3 samples)
      C Compiler Throughput                      1019.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5796.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1097.7 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               565.3 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1022128.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   544057.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    346481.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      319325.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     148788.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks       99051.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2058917.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1606109.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    854789.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         126145.2 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     2915.4      678.0
      File Copy 1024 bufsize 2000 maxblocks         3960.0   346481.0      875.0
      File Copy 256 bufsize 500 maxblocks           1655.0    99051.0      598.5
      File Copy 4096 bufsize 8000 maxblocks         5800.0   854789.0     1473.8
      Shell Scripts (8 concurrent)                     6.0     1097.7     1829.5
                                                                       =========
           FINAL SCORE                                                     991.3
      
      == 2.6.26-rc2-mm1 + this set ==
      Execl Throughput                           3012.9 lps   (29.9 secs, 3 samples)
      C Compiler Throughput                       981.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5872.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1120.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               578.0 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1003993.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   550452.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    347159.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      314644.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     151852.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks      101000.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2033256.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1611814.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    847979.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         128148.7 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     3012.9      700.7
      File Copy 1024 bufsize 2000 maxblocks         3960.0   347159.0      876.7
      File Copy 256 bufsize 500 maxblocks           1655.0   101000.0      610.3
      File Copy 4096 bufsize 8000 maxblocks         5800.0   847979.0     1462.0
      Shell Scripts (8 concurrent)                     6.0     1120.3     1867.2
                                                                       =========
           FINAL SCORE                                                    1004.6
      
      This patch:
      
      Remove refcnt from page_cgroup().
      
      After this,
      
       * A page is charged only when !page_mapped() && no page_cgroup is assigned.
      	* Anon page is newly mapped.
      	* File page is added to mapping->tree.
      
       * A page is uncharged only when
      	* Anon page is fully unmapped.
      	* File page is removed from LRU.
      
      There is no change in behavior from user's view.
      
      This patch also removes unnecessary calls in rmap.c which was used only for
      refcnt mangement.
      
      [akpm@linux-foundation.org: fix warning]
      [hugh@veritas.com: fix shmem_unuse_inode charging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69029cd5
    • K
      memcg: better migration handling · e8589cc1
      KAMEZAWA Hiroyuki 提交于
      This patch changes page migration under memory controller to use a
      different algorithm.  (thanks to Christoph for new idea.)
      
      Before:
       - page_cgroup is migrated from an old page to a new page.
      After:
       - a new page is accounted , no reuse of page_cgroup.
      
      Pros:
      
       - We can avoid compliated lock depndencies and races in migration.
      
      Cons:
      
       - new param to mem_cgroup_charge_common().
      
       - mem_cgroup_getref() is added for handling ref_cnt ping-pong.
      
      This version simplifies complicated lock dependency in page migraiton
      under memory resource controller.
      
        new refcnt sequence is following.
      
      a mapped page:
        prepage_migration() ..... +1 to NEW page
        try_to_unmap()      ..... all refs to OLD page is gone.
        move_pages()        ..... +1 to NEW page if page cache.
        remap...            ..... all refs from *map* is added to NEW one.
        end_migration()     ..... -1 to New page.
      
        page's mapcount + (page_is_cache) refs are added to NEW one.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8589cc1
    • K
      memcg: avoid unnecessary initialization · 508b7be0
      KAMEZAWA Hiroyuki 提交于
      * remove over-killing initialization (in fast path)
      * makeing the condition for PAGE_CGROUP_FLAG_ACTIVE be more obvious.
      Signed-off-by: NKAMEAZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      508b7be0
    • K
      memcg: make global var read_mostly · a181b0e8
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_subsys and page_cgroup_cache should be read_mostly and
      MEM_CGROUP_RECLAIM_RETRIES can be just a fixed number.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a181b0e8
    • P
      cgroup files: convert res_counter_write() to be a cgroups write_string() handler · 856c13aa
      Paul Menage 提交于
      Currently res_counter_write() is a raw file handler even though it's
      ultimately taking a number, since in some cases it wants to
      pre-process the string when converting it to a number.
      
      This patch converts res_counter_write() from a raw file handler to a
      write_string() handler; this allows some of the boilerplate
      copying/locking/checking to be removed, and simplies the cleanup path,
      since these functions are now performed by the cgroups framework.
      
      [lizf@cn.fujitsu.com: build fix]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      856c13aa
    • M
      jbd: fix race between free buffer and commit transaction · 3f31fddf
      Mingming Cao 提交于
      journal_try_to_free_buffers() could race with jbd commit transaction when
      the later is holding the buffer reference while waiting for the data
      buffer to flush to disk.  If the caller of journal_try_to_free_buffers()
      request tries hard to release the buffers, it will treat the failure as
      error and return back to the caller.  We have seen the directo IO failed
      due to this race.  Some of the caller of releasepage() also expecting the
      buffer to be dropped when passed with GFP_KERNEL mask to the
      releasepage()->journal_try_to_free_buffers().
      
      With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
      indicating this call could wait, in case of try_to_free_buffers() failed,
      let's waiting for journal_commit_transaction() to finish commit the
      current committing transaction, then try to free those buffers again.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f31fddf
    • O
  3. 25 7月, 2008 19 次提交