1. 23 10月, 2008 2 次提交
  2. 21 10月, 2008 2 次提交
  3. 20 10月, 2008 36 次提交
    • A
      make mm/rmap.c:anon_vma_cachep static · fdd2e5f8
      Adrian Bunk 提交于
      This patch makes the needlessly global anon_vma_cachep static.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdd2e5f8
    • K
      memcg: allocate all page_cgroup at boot · 52d4b9ac
      KAMEZAWA Hiroyuki 提交于
      Allocate all page_cgroup at boot and remove page_cgroup poitner from
      struct page.  This patch adds an interface as
      
       struct page_cgroup *lookup_page_cgroup(struct page*)
      
      All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.
      
      Remove page_cgroup pointer reduces the amount of memory by
       - 4 bytes per PAGE_SIZE.
       - 8 bytes per PAGE_SIZE
      if memory controller is disabled. (even if configured.)
      
      On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
      On my x86-64 server with 48GB of memory, this saves 96MB of memory.
      I think this reduction makes sense.
      
      By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
      This means
        - we're not necessary to be afraid of kmalloc faiulre.
          (this can happen because of gfp_mask type.)
        - we can avoid calling kmalloc/kfree.
        - we can avoid allocating tons of small objects which can be fragmented.
        - we can know what amount of memory will be used for this extra-lru handling.
      
      I added printk message as
      
      	"allocated %ld bytes of page_cgroup"
              "please try cgroup_disable=memory option if you don't want"
      
      maybe enough informative for users.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52d4b9ac
    • K
      memcg: atomic ops for page_cgroup->flags · c05555b5
      KAMEZAWA Hiroyuki 提交于
      This patch makes page_cgroup->flags to be atomic_ops and define functions
      (and macros) to access it.
      
      Before trying to modify memory resource controller, this atomic operation
      on flags is necessary.  Most of flags in this patch is for LRU and modfied
      under mz->lru_lock but we'll add another flags which is not for LRU soon.
      For example, we'll place LOCK bit on flags field.  We need atomic
      operation to modify LRU bit without LOCK.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c05555b5
    • K
      memcg: optimize per-cpu statistics · addb9efe
      KAMEZAWA Hiroyuki 提交于
      Some obvious optimization to memcg.
      
      I found mem_cgroup_charge_statistics() is a little big (in object) and
      does unnecessary address calclation.  This patch is for optimization to
      reduce the size of this function.
      
      And res_counter_charge() is 'likely' to succeed.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      addb9efe
    • K
      memcg: avoid accounting special pages · 5b4e655e
      KAMEZAWA Hiroyuki 提交于
      There are not-on-LRU pages which can be mapped and they are not worth to
      be accounted.  (becasue we can't shrink them and need dirty codes to
      handle specical case) We'd like to make use of usual objrmap/radix-tree's
      protcol and don't want to account out-of-vm's control pages.
      
      When special_mapping_fault() is called, page->mapping is tend to be NULL
      and it's charged as Anonymous page.  insert_page() also handles some
      special pages from drivers.
      
      This patch is for avoiding to account special pages.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b4e655e
    • K
      memcg: make page->mapping NULL before uncharge · b7abea96
      KAMEZAWA Hiroyuki 提交于
      This patch tries to make page->mapping to be NULL before
      mem_cgroup_uncharge_cache_page() is called.
      
      "page->mapping == NULL" is a good check for "whether the page is still
      radix-tree or not".  This patch also adds BUG_ON() to
      mem_cgroup_uncharge_cache_page();
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7abea96
    • K
      memcg: move charge swapin under lock · 073e587e
      KAMEZAWA Hiroyuki 提交于
      While page-cache's charge/uncharge is done under page_lock(), swap-cache
      isn't.  (anonymous page is charged when it's newly allocated.)
      
      This patch moves do_swap_page()'s charge() call under lock.  I don't see
      any bad problem *now* but this fix will be good for future for avoiding
      unnecessary racy state.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      073e587e
    • B
      mm: extract do_pages_move() out of sys_move_pages() · 5e9a0f02
      Brice Goglin 提交于
      To prepare the chunking, move the sys_move_pages() code that is used when
      nodes!=NULL into do_pages_move().  And rename do_move_pages() into
      do_move_page_to_node_array().
      Signed-off-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e9a0f02
    • B
      mm: don't vmalloc a huge page_to_node array for do_pages_stat() · 2f007e74
      Brice Goglin 提交于
      do_pages_stat() does not need any page_to_node entry for real.  Just pass
      the pointers to the user-space page address array and to the user-space
      status array, and have do_pages_stat() traverse the former and fill the
      latter directly.
      Signed-off-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f007e74
    • B
      mm: stop returning -ENOENT from sys_move_pages() if nothing got migrated · e78bbfa8
      Brice Goglin 提交于
      A patchset reworking sys_move_pages().  It removes the possibly large
      vmalloc by using multiple chunks when migrating large buffers.  It also
      dramatically increases the throughput for large buffers since the lookup
      in new_page_node() is now limited to a single chunk, causing the quadratic
      complexity to have a much slower impact.  There is no need to use any
      radix-tree-like structure to improve this lookup.
      
      sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
      migrating between nodes #2 and #3:
      
      	length		move_pages (us)		move_pages+patch (us)
      	4kB		126			98
      	40kB		198			168
      	400kB		963			937
      	4MB		12503			11930
      	40MB		246867			11848
      
      Patches #1 and #4 are the important ones:
      1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
      2) don't vmalloc a huge page_to_node array for do_pages_stat()
      3) extract do_pages_move() out of sys_move_pages()
      4) rework do_pages_move() to work on page_sized chunks
      5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default
      
      This patch:
      
      There is no point in returning -ENOENT from sys_move_pages() if all pages
      were already on the right node, while we return 0 if only 1 page was not.
      Most application don't know where their pages are allocated, so it's not
      an error to try to migrate them anyway.
      
      Just return 0 and let the status array in user-space be checked if the
      application needs details.
      
      It will make the upcoming chunked-move_pages() support much easier.
      Signed-off-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e78bbfa8
    • N
      memory hotplug: release memory regions in PAGES_PER_SECTION chunks · de7f0cba
      Nathan Fontenot 提交于
      During hotplug memory remove, memory regions should be released on a
      PAGES_PER_SECTION size chunks.  This mirrors the code in add_memory where
      resources are requested on a PAGES_PER_SECTION size.
      
      Attempting to release the entire memory region fails because there is not
      a single resource for the total number of pages being removed.  Instead
      the resources for the pages are split in PAGES_PER_SECTION size chunks as
      requested during memory add.
      Signed-off-by: NNathan Fontenot <nfont@austin.ibm.com>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de7f0cba
    • G
      setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lock · 1125b4e3
      Gerald Schaefer 提交于
      This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
      There seems to be no need for the lru_lock anymore, but there is a need for
      zone->lock instead, because that function may call move_freepages() via
      setup_zone_migrate_reserve().
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Tested-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1125b4e3
    • K
      hugepage: support ZERO_PAGE() · 4b2e38ad
      KOSAKI Motohiro 提交于
      Presently hugepage doesn't use zero page at all because zero page is only
      used for coredumping and hugepage can't core dump.
      
      However we have now implemented hugepage coredumping.  Therefore we should
      implement the zero page of hugepage.
      
      Implementation note:
      
      o Why do we only check VM_SHARED for zero page?
        normal page checked as ..
      
      	static inline int use_zero_page(struct vm_area_struct *vma)
      	{
      	        if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
      	                return 0;
      
      	        return !vma->vm_ops || !vma->vm_ops->fault;
      	}
      
      First, hugepages are never mlock()ed.  We aren't concerned with VM_LOCKED.
      
      Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
      doesn't have any file backing.  Thus ops->fault checking is meaningless.
      
      o Why don't we use zero page if !pte.
      
      !pte indicate {pud, pmd} doesn't exist or some error happened.  So we
      shouldn't return zero page if any error occurred.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b2e38ad
    • Y
      mm: print out meminit for memmap · d903ef9f
      Yinghai Lu 提交于
      Improve debuggability of memory setup problems.
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d903ef9f
    • H
      mm: hugetlb.c make functions static, use NULL rather than 0 · 2a4b3ded
      Harvey Harrison 提交于
      mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
      mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
      mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
      mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a4b3ded
    • N
      mm: rewrite vmap layer · db64fe02
      Nick Piggin 提交于
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      
      So with a 8 cores, the rewritten version is already 25x faster.
      
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      profiles.
      
      Before:
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
      
      After:
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db64fe02
    • D
      mmap.c: deinline a few functions · cb8f488c
      Denys Vlasenko 提交于
      __vma_link_file and expand_downwards functions are not small, yeat they
      are marked inline.  They probably had one callsite sometime in the past,
      but now they have more.  In order to prevent similar thing, I also
      deinlined expand_upwards, despite it having only pne callsite.  Nowadays
      gcc auto-inlines such static functions anyway.  In find_extend_vma, I
      removed one extra level of indirection.
      
      Patch is deliberately generated with -U $BIGNUM to make
      it easier to see that functions are big.
      
      Result:
      
      # size */*/mmap.o */vmlinux
         text    data     bss     dec     hex filename
         9514     188      16    9718    25f6 0.org/mm/mmap.o
         9237     188      16    9441    24e1 deinline/mm/mmap.o
      6124402  858996  389480 7372878  70804e 0.org/vmlinux
      6124113  858996  389480 7372589  707f2d deinline/vmlinux
      Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb8f488c
    • N
      mm: page lock use lock bitops · 8413ac9d
      Nick Piggin 提交于
      trylock_page, unlock_page open and close a critical section. Hence,
      we can use the lock bitops to get the desired memory ordering.
      
      Also, mark trylock as likely to succeed (and remove the annotation from
      callers).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8413ac9d
    • N
      mm: unlockless reclaim · a978d6f5
      Nick Piggin 提交于
      unlock_page is fairly expensive.  It can be avoided in page reclaim
      success path.  By definition if we have any other references to the page
      it would be a bug anyway.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a978d6f5
    • N
      mm: pagecache insertion fewer atomics · f45840b5
      Nick Piggin 提交于
      Setting and clearing the page locked when inserting it into swapcache /
      pagecache when it has no other references can use non-atomic page flags
      operations because no other CPU may be operating on it at this time.
      
      This saves one atomic operation when inserting a page into pagecache.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f45840b5
    • L
      mlock: make mlock error return Posixly Correct · 9978ad58
      Lee Schermerhorn 提交于
      Rework Posix error return for mlock().
      
      Posix requires error code for mlock*() system calls for some conditions
      that differ from what kernel low level functions, such as
      get_user_pages(), return for those conditions.  For more info, see:
      
      http://marc.info/?l=linux-kernel&m=121750892930775&w=2
      
      This patch provides the same translation of get_user_pages()
      error codes to posix specified error codes in the context
      of the mlock rework for unevictable lru.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9978ad58
    • L
      mlock: revert mainline handling of mlock error return · c11d69d8
      Lee Schermerhorn 提交于
      This change is intended to make mlock() error returns correct.
      make_page_present() is a lower level function used by more than mlock().
      Subsequent patch[es] will add this error return fixup in an mlock specific
      path.
      
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c11d69d8
    • J
      vmscan: don't accumulate scan pressure on unrelated lists · e0f79b8f
      Johannes Weiner 提交于
      During each reclaim scan we accumulate scan pressure on unrelated lists
      which will result in bogus scans and unwanted reclaims eventually.
      
      Scanning lists with few reclaim candidates results in a lot of rotation
      and therefor also disturbs the list balancing, putting even more
      pressure on the wrong lists.
      
      In a test-case with much streaming IO, and therefor a crowded inactive
      file page list, swapping started because
      
        a) anon pages were reclaimed after swap_cluster_max reclaim
        invocations -- nr_scan of this list has just accumulated
      
        b) active file pages were scanned because *their* nr_scan has also
        accumulated through the same logic.  And this in return created a
        lot of rotation for file pages and resulted in a decrease of file
        list priority, again increasing the pressure on anon pages.
      
      The result was an evicted working set of anon pages while there were
      tons of inactive file pages that should have been taken instead.
      Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0f79b8f
    • L
      mlock: count attempts to free mlocked page · 985737cf
      Lee Schermerhorn 提交于
      Allow free of mlock()ed pages.  This shouldn't happen, but during
      developement, it occasionally did.
      
      This patch allows us to survive that condition, while keeping the
      statistics and events correct for debug.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      985737cf
    • L
      vmscan: unevictable LRU scan sysctl · af936a16
      Lee Schermerhorn 提交于
      This patch adds a function to scan individual or all zones' unevictable
      lists and move any pages that have become evictable onto the respective
      zone's inactive list, where shrink_inactive_list() will deal with them.
      
      Adds sysctl to scan all nodes, and per node attributes to individual
      nodes' zones.
      
      Kosaki: If evictable page found in unevictable lru when write
      /proc/sys/vm/scan_unevictable_pages, print filename and file offset of
      these pages.
      
      [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
      [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af936a16
    • L
      swap: cull unevictable pages in fault path · 64d6519d
      Lee Schermerhorn 提交于
      In the fault paths that install new anonymous pages, check whether the
      page is evictable or not using lru_cache_add_active_or_unevictable().  If
      the page is evictable, just add it to the active lru list [via the pagevec
      cache], else add it to the unevictable list.
      
      This "proactive" culling in the fault path mimics the handling of mlocked
      pages in Nick Piggin's series to keep mlocked pages off the lru lists.
      
      Notes:
      
      1) This patch is optional--e.g., if one is concerned about the
         additional test in the fault path.  We can defer the moving of
         nonreclaimable pages until when vmscan [shrink_*_list()]
         encounters them.  Vmscan will only need to handle such pages
         once, but if there are a lot of them it could impact system
         performance.
      
      2) The 'vma' argument to page_evictable() is require to notice that
         we're faulting a page into an mlock()ed vma w/o having to scan the
         page's rmap in the fault path.   Culling mlock()ed anon pages is
         currently the only reason for this patch.
      
      3) We can't cull swap pages in read_swap_cache_async() because the
         vma argument doesn't necessarily correspond to the swap cache
         offset passed in by swapin_readahead().  This could [did!] result
         in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
         cull in this path.
      
      4) Move set_pte_at() to after where we add page to lru to keep it
         hidden from other tasks that might walk the page table.
         We already do it in this order in do_anonymous() page.  And,
         these are COW'd anon pages.  Is this safe?
      
      [riel@redhat.com: undo an overzealous code cleanup]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64d6519d
    • N
      vmstat: mlocked pages statistics · 5344b7e6
      Nick Piggin 提交于
      Add NR_MLOCK zone page state, which provides a (conservative) count of
      mlocked pages (actually, the number of mlocked pages moved off the LRU).
      
      Reworked by lts to fit in with the modified mlock page support in the
      Reclaim Scalability series.
      
      [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
      [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5344b7e6
    • R
      mmap: handle mlocked pages during map, remap, unmap · ba470de4
      Rik van Riel 提交于
      Originally by Nick Piggin <npiggin@suse.de>
      
      Remove mlocked pages from the LRU using "unevictable infrastructure"
      during mmap(), munmap(), mremap() and truncate().  Try to move back to
      normal LRU lists on munmap() when last mlocked mapping removed.  Remove
      PageMlocked() status when page truncated from file.
      
      [akpm@linux-foundation.org: cleanup]
      [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
      [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
      [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
      [akpm@linux-foundation.org: remove bogus kerneldoc token]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba470de4
    • L
      mlock: downgrade mmap sem while populating mlocked regions · 8edb08ca
      Lee Schermerhorn 提交于
      We need to hold the mmap_sem for write to initiatate mlock()/munlock()
      because we may need to merge/split vmas.  However, this can lead to very
      long lock hold times attempting to fault in a large memory region to mlock
      it into memory.  This can hold off other faults against the mm
      [multithreaded tasks] and other scans of the mm, such as via /proc.  To
      alleviate this, downgrade the mmap_sem to read mode during the population
      of the region for locking.  This is especially the case if we need to
      reclaim memory to lock down the region.  We [probably?] don't need to do
      this for unlocking as all of the pages should be resident--they're already
      mlocked.
      
      Now, the caller's of the mlock functions [mlock_fixup() and
      mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
       Changing all callers appears to be way too much effort at this point.
      So, restore write mode before returning.  Note that this opens a window
      where the mmap list could change in a multithreaded process.  So, at least
      for mlock_fixup(), where we could be called in a loop over multiple vmas,
      we check that a vma still exists at the start address and that vma still
      covers the page range [start,end).  If not, we return an error, -EAGAIN,
      and let the caller deal with it.
      
      Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
      the vma at 'start' disappears or changes so that the page range
      [start,end) is no longer contained in the vma.  Again, let the caller deal
      with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
      should actually care.
      
      With this patch, I no longer see processes like ps(1) blocked for seconds
      or minutes at a time waiting for a large [multiple gigabyte] region to be
      locked down.  However, I occassionally see delays while unlocking or
      unmapping a large mlocked region.  Should we also downgrade the mmap_sem
      for the unlock path?
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8edb08ca
    • N
      mlock: mlocked pages are unevictable · b291f000
      Nick Piggin 提交于
      Make sure that mlocked pages also live on the unevictable LRU, so kswapd
      will not scan them over and over again.
      
      This is achieved through various strategies:
      
      1) add yet another page flag--PG_mlocked--to indicate that
         the page is locked for efficient testing in vmscan and,
         optionally, fault path.  This allows early culling of
         unevictable pages, preventing them from getting to
         page_referenced()/try_to_unmap().  Also allows separate
         accounting of mlock'd pages, as Nick's original patch
         did.
      
         Note:  Nick's original mlock patch used a PG_mlocked
         flag.  I had removed this in favor of the PG_unevictable
         flag + an mlock_count [new page struct member].  I
         restored the PG_mlocked flag to eliminate the new
         count field.
      
      2) add the mlock/unevictable infrastructure to mm/mlock.c,
         with internal APIs in mm/internal.h.  This is a rework
         of Nick's original patch to these files, taking into
         account that mlocked pages are now kept on unevictable
         LRU list.
      
      3) update vmscan.c:page_evictable() to check PageMlocked()
         and, if vma passed in, the vm_flags.  Note that the vma
         will only be passed in for new pages in the fault path;
         and then only if the "cull unevictable pages in fault
         path" patch is included.
      
      4) add try_to_unlock() to rmap.c to walk a page's rmap and
         ClearPageMlocked() if no other vmas have it mlocked.
         Reuses as much of try_to_unmap() as possible.  This
         effectively replaces the use of one of the lru list links
         as an mlock count.  If this mechanism let's pages in mlocked
         vmas leak through w/o PG_mlocked set [I don't know that it
         does], we should catch them later in try_to_unmap().  One
         hopes this will be rare, as it will be relatively expensive.
      
      Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      splitlru: introduce __get_user_pages():
      
        New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
        because current get_user_pages() can't grab PROT_NONE pages theresore it
        cause PROT_NONE pages can't munlock.
      
      [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
      [akpm@linux-foundation.org: untangle patch interdependencies]
      [akpm@linux-foundation.org: fix things after out-of-order merging]
      [hugh@veritas.com: fix page-flags mess]
      [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
      [kosaki.motohiro@jp.fujitsu.com: build fix]
      [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
      [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b291f000
    • L
      SHM_LOCKED pages are unevictable · 89e004ea
      Lee Schermerhorn 提交于
      Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
      kept on the normal LRU, since scanning them is a waste of time and might
      throw off kswapd's balancing algorithms.  Place them on the unevictable
      LRU list instead.
      
      Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
      memory regions as unevictable.  Then these pages will be culled off the
      normal LRU lists during vmscan.
      
      Add new wrapper function to clear the mapping's unevictable state when/if
      shared memory segment is munlocked.
      
      Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
      the shmem segment's mapping [struct address_space] for evictability now
      that they're no longer locked.  If so, move them to the appropriate zone
      lru list.
      
      Changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [kosaki.motohiro@jp.fujitsu.com: revert shm change]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e004ea
    • L
      Ramfs and Ram Disk pages are unevictable · ba9ddf49
      Lee Schermerhorn 提交于
      Christoph Lameter pointed out that ram disk pages also clutter the LRU
      lists.  When vmscan finds them dirty and tries to clean them, the ram disk
      writeback function just redirties the page so that it goes back onto the
      active list.  Round and round she goes...
      
      With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
      longer the case, as ram disk pages are no longer maintained on the lru.
      [This makes them unmigratable for defrag or memory hot remove, but that
      can be addressed by a separate patch series.] However, the ramfs pages
      behave like ram disk pages used to, so:
      
      Define new address_space flag [shares address_space flags member with
      mapping's gfp mask] to indicate that the address space contains all
      unevictable pages.  This will provide for efficient testing of ramfs pages
      in page_evictable().
      
      Also provide wrapper functions to set/test the unevictable state to
      minimize #ifdefs in ramfs driver and any other users of this facility.
      
      Set the unevictable state on address_space structures for new ramfs
      inodes.  Test the unevictable state in page_evictable() to cull
      unevictable pages.
      
      These changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [riel@redhat.com: undo the brd.c part]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Debugged-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba9ddf49
    • L
      Unevictable LRU Page Statistics · 7b854121
      Lee Schermerhorn 提交于
      Report unevictable pages per zone and system wide.
      
      Kosaki Motohiro added support for memory controller unevictable
      statistics.
      
      [riel@redhat.com: fix printk in show_free_areas()]
      [akpm@linux-foundation.org: fix units in /proc/vmstats]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b854121
    • L
      unevictable lru: add event counting with statistics · bbfd28ee
      Lee Schermerhorn 提交于
      Fix to unevictable-lru-page-statistics.patch
      
      Add unevictable lru infrastructure vm events to the statistics patch.
      Rename the "NORECL_" and "noreclaim_" symbols and text strings to
      "UNEVICTABLE_" and "unevictable_", respectively.
      
      Currently, both the infrastructure and the mlocked pages event are
      added by a single patch later in the series.  This makes it difficult
      to add or rework the incremental patches.  The events actually "belong"
      with the stats, so pull them up to here.
      
      Also, restore the event counting to putback_lru_page().  This was removed
      from previous patch in series where it was "misplaced".  The actual events
      weren't defined that early.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbfd28ee
    • L
      Unevictable LRU Infrastructure · 894bc310
      Lee Schermerhorn 提交于
      When the system contains lots of mlocked or otherwise unevictable pages,
      the pageout code (kswapd) can spend lots of time scanning over these
      pages.  Worse still, the presence of lots of unevictable pages can confuse
      kswapd into thinking that more aggressive pageout modes are required,
      resulting in all kinds of bad behaviour.
      
      Infrastructure to manage pages excluded from reclaim--i.e., hidden from
      vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
      maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
      them from vmscan.
      
      Kosaki Motohiro added the support for the memory controller unevictable
      lru list.
      
      Pages on the unevictable list have both PG_unevictable and PG_lru set.
      Thus, PG_unevictable is analogous to and mutually exclusive with
      PG_active--it specifies which LRU list the page is on.
      
      The unevictable infrastructure is enabled by a new mm Kconfig option
      [CONFIG_]UNEVICTABLE_LRU.
      
      A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
      not a page may be evictable.  Subsequent patches will add the various
      !evictable tests.  We'll want to keep these tests light-weight for use in
      shrink_active_list() and, possibly, the fault path.
      
      To avoid races between tasks putting pages [back] onto an LRU list and
      tasks that might be moving the page from non-evictable to evictable state,
      the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
      -- tests the "evictability" of a page after placing it on the LRU, before
      dropping the reference.  If the page has become unevictable,
      putback_lru_page() will redo the 'putback', thus moving the page to the
      unevictable list.  This way, we avoid "stranding" evictable pages on the
      unevictable list.
      
      [akpm@linux-foundation.org: fix fallout from out-of-order merge]
      [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
      [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
      [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
      [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
      [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
      [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
      [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NBenjamin Kidwell <benjkidwell@yahoo.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894bc310
    • R
      more aggressively use lumpy reclaim · 33c120ed
      Rik van Riel 提交于
      During an AIM7 run on a 16GB system, fork started failing around 32000
      threads, despite the system having plenty of free swap and 15GB of
      pageable memory.  This was on x86-64, so 8k stacks.
      
      If a higher order allocation fails, we can either:
      - keep evicting pages off the end of the LRUs and hope that
        we eventually create a contiguous region; this is somewhat
        unlikely if the system is under enough stress by new
        allocations
      - after trying normal eviction for a bit, use lumpy reclaim
      
      This patch switches the system to lumpy reclaim if the VM is having
      trouble freeing enough pages, using the same threshold for detection as
      used by pageout congestion wait.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c120ed