1. 17 6月, 2009 31 次提交
    • W
      vmscan: report vm_flags in page_referenced() · 6fe6b7e3
      Wu Fengguang 提交于
      Collect vma->vm_flags of the VMAs that actually referenced the page.
      
      This is preparing for more informed reclaim heuristics, eg.  to protect
      executable file pages more aggressively.  For now only the VM_EXEC bit
      will be used by the caller.
      
      Thanks to Johannes, Peter and Minchan for all the good tips.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fe6b7e3
    • S
      mm cleanup: shmem_file_setup: 'char *' -> 'const char *' for name argument · 168f5ac6
      Sergei Trofimovich 提交于
      As function shmem_file_setup does not modify/allocate/free/pass given
      filename - mark it as const.
      Signed-off-by: NSergei Trofimovich <slyfox@inbox.ru>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      168f5ac6
    • M
      mm: remove file argument from swap_readpage() · aca8bf32
      Minchan Kim 提交于
      The file argument resulted from address_space's readpage long time ago.
      
      We don't use it any more.  Let's remove unnecessary argement.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aca8bf32
    • M
      mm: remove __invalidate_mapping_pages variant · 28697355
      Mike Waychison 提交于
      Remove __invalidate_mapping_pages atomic variant now that its sole caller
      can sleep (fixed in eccb95ce ("vfs: fix
      lock inversion in drop_pagecache_sb()")).
      
      This fixes softlockups that can occur while in the drop_caches path.
      Signed-off-by: NMike Waychison <mikew@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28697355
    • D
      oom: move oom_adj value from task_struct to mm_struct · 2ff05b2b
      David Rientjes 提交于
      The per-task oom_adj value is a characteristic of its mm more than the
      task itself since it's not possible to oom kill any thread that shares the
      mm.  If a task were to be killed while attached to an mm that could not be
      freed because another thread were set to OOM_DISABLE, it would have
      needlessly been terminated since there is no potential for future memory
      freeing.
      
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct mm_struct.  This requires task_lock() on a
      task to check its oom_adj value to protect against exec, but it's already
      necessary to take the lock when dereferencing the mm to find the total VM
      size for the badness heuristic.
      
      This fixes a livelock if the oom killer chooses a task and another thread
      sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
      because oom_kill_task() repeatedly returns 1 and refuses to kill the
      chosen task while select_bad_process() will repeatedly choose the same
      task during the next retry.
      
      Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
      oom_kill_task() to check for threads sharing the same memory will be
      removed in the next patch in this series where it will no longer be
      necessary.
      
      Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
      these threads are immune from oom killing already.  They simply report an
      oom_adj value of OOM_DISABLE.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff05b2b
    • K
      mm: modify swap_map and add SWAP_HAS_CACHE flag · 355cfa73
      KAMEZAWA Hiroyuki 提交于
      This is a part of the patches for fixing memcg's swap accountinf leak.
      But, IMHO, not a bad patch even if no memcg.
      
      There are 2 kinds of references to swap.
       - reference from swap entry
       - reference from swap cache
      
      Then,
      
       - If there is swap cache && swap's refcnt is 1, there is only swap cache.
        (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL
      
      This counting logic have worked well for a long time.  But considering
      that we cannot know there is a _real_ reference or not by swap_map[],
      current usage of counter is not very good.
      
      This patch adds a flag SWAP_HAS_CACHE and recored information that a swap
      entry has a cache or not.  This will remove -1 magic used in swapfile.c
      and be a help to avoid unnecessary find_get_page().
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355cfa73
    • K
      mm: add swap cache interface for swap reference · cb4b86ba
      KAMEZAWA Hiroyuki 提交于
      In a following patch, the usage of swap cache is recorded into swap_map.
      This patch is for necessary interface changes to do that.
      
      2 interfaces:
      
        - swapcache_prepare()
        - swapcache_free()
      
      are added for allocating/freeing refcnt from swap-cache to existing swap
      entries.  But implementation itself is not changed under this patch.  At
      adding swapcache_free(), memcg's hook code is moved under
      swapcache_free().  This is better than using scattered hooks.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb4b86ba
    • K
      mm: remove CONFIG_UNEVICTABLE_LRU config option · 68377659
      KOSAKI Motohiro 提交于
      Currently, nobody wants to turn UNEVICTABLE_LRU off.  Thus this
      configurability is unnecessary.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68377659
    • M
      page-allocator: add inactive ratio calculation function of each zone · 96cb4df5
      Minchan Kim 提交于
      Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s
      loop into a a separate function, calculate_zone_inactive_ratio().  This
      function will be used in a later patch
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96cb4df5
    • M
      page-allocator: clean up functions related to pages_min · bc75d33f
      Minchan Kim 提交于
      Change the names of two functions. It doesn't affect behavior.
      
      Presently, setup_per_zone_pages_min() changes low, high of zone as well as
      min.  So a better name is setup_per_zone_wmarks().  That's because Mel
      changed zone->pages_[hig/low/min] to zone->watermark array in "page
      allocator: replace the watermark-related union in struct zone with a
      watermark[] array".
      
       * setup_per_zone_pages_min => setup_per_zone_wmarks
      
      Of course, we have to change init_per_zone_pages_min, too.  There are not
      pages_min any more.
      
       * init_per_zone_pages_min => init_per_zone_wmark_min
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc75d33f
    • C
      page-allocator: use integer fields lookup for gfp_zone and check for errors in... · b70d94ee
      Christoph Lameter 提交于
      page-allocator: use integer fields lookup for gfp_zone and check for errors in flags passed to the page allocator
      
      This simplifies the code in gfp_zone() and also keeps the ability of the
      compiler to use constant folding to get rid of gfp_zone processing.
      
      The lookup of the zone is done using a bitfield stored in an integer.  So
      the code in gfp_zone is a simple extraction of bits from a constant
      bitfield.  The compiler is generating a load of a constant into a register
      and then performs a shift and mask operation to get the zone from a gfp_t.
       No cachelines are touched and no branches have to be predicted by the
      compiler.
      
      We are doing some macro tricks here to convince the compiler to always do
      the constant folding if possible.
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70d94ee
    • M
      mm: check the argument of kunmap on architectures without highmem · 31c91132
      Matthew Wilcox 提交于
      If you're using a non-highmem architecture, passing an argument with the
      wrong type to kunmap() doesn't give you a warning because the ifdef
      doesn't check the type.
      
      Using a static inline function solves the problem nicely.
      Reported-by: NDavid Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31c91132
    • R
      mm, PM/Freezer: Disable OOM killer when tasks are frozen · 7f33d49a
      Rafael J. Wysocki 提交于
      Currently, the following scenario appears to be possible in theory:
      
      * Tasks are frozen for hibernation or suspend.
      * Free pages are almost exhausted.
      * Certain piece of code in the suspend code path attempts to allocate
        some memory using GFP_KERNEL and allocation order less than or
        equal to PAGE_ALLOC_COSTLY_ORDER.
      * __alloc_pages_internal() cannot find a free page so it invokes the
        OOM killer.
      * The OOM killer attempts to kill a task, but the task is frozen, so
        it doesn't die immediately.
      * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
        to find a free page and invokes the OOM killer.
      * No progress can be made.
      
      Although it is now hard to trigger during hibernation due to the memory
      shrinking carried out by the hibernation code, it is theoretically
      possible to trigger during suspend after the memory shrinking has been
      removed from that code path.  Moreover, since memory allocations are
      going to be used for the hibernation memory shrinking, it will be even
      more likely to happen during hibernation.
      
      To prevent it from happening, introduce the oom_killer_disabled switch
      that will cause __alloc_pages_internal() to fail in the situations in
      which the OOM killer would have been called and make the freezer set
      this switch after tasks have been successfully frozen.
      
      [akpm@linux-foundation.org: be nicer to the namespace]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Fengguang Wu <fengguang.wu@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f33d49a
    • J
      mm: introduce follow_pfn() · 3b6748e2
      Johannes Weiner 提交于
      Analoguous to follow_phys(), add a helper that looks up the PFN at a
      user virtual address in an IO mapping or a raw PFN mapping.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NMagnus Damm <magnus.damm@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b6748e2
    • W
      vmscan: cleanup the scan batching code · 6e08a369
      Wu Fengguang 提交于
      The vmscan batching logic is twisting.  Move it into a standalone function
      nr_scan_try_batch() and document it.  No behavior change.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e08a369
    • R
      vmscan: evict use-once pages first · 56e49d21
      Rik van Riel 提交于
      When the file LRU lists are dominated by streaming IO pages, evict those
      pages first, before considering evicting other pages.
      
      This should be safe from deadlocks or performance problems
      because only three things can happen to an inactive file page:
      
      1) referenced twice and promoted to the active list
      2) evicted by the pageout code
      3) under IO, after which it will get evicted or promoted
      
      The pages freed in this way can either be reused for streaming IO, or
      allocated for something else.  If the pages are used for streaming IO,
      this pageout pattern continues.  Otherwise, we will fall back to the
      normal pageout pattern.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NElladan <elladan@eskimo.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56e49d21
    • W
      mm: introduce PageHuge() for testing huge/gigantic pages · 20a0307c
      Wu Fengguang 提交于
      A series of patches to enhance the /proc/pagemap interface and to add a
      userspace executable which can be used to present the pagemap data.
      
      Export 10 more flags to end users (and more for kernel developers):
      
              11. KPF_MMAP            (pseudo flag) memory mapped page
              12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
              13. KPF_SWAPCACHE       page is in swap cache
              14. KPF_SWAPBACKED      page is swap/RAM backed
              15. KPF_COMPOUND_HEAD   (*)
              16. KPF_COMPOUND_TAIL   (*)
              17. KPF_HUGE		hugeTLB pages
              18. KPF_UNEVICTABLE     page is in the unevictable LRU list
              19. KPF_HWPOISON        hardware detected corruption
              20. KPF_NOPAGE          (pseudo flag) no page frame at the address
      
              (*) For compound pages, exporting _both_ head/tail info enables
                  users to tell where a compound page starts/ends, and its order.
      
      a simple demo of the page-types tool
      
      # ./page-types -h
      page-types [options]
                  -r|--raw                  Raw mode, for kernel developers
                  -a|--addr    addr-spec    Walk a range of pages
                  -b|--bits    bits-spec    Walk pages with specified bits
                  -l|--list                 Show page details in ranges
                  -L|--list-each            Show page details one by one
                  -N|--no-summary           Don't show summay info
                  -h|--help                 Show this usage message
      addr-spec:
                  N                         one page at offset N (unit: pages)
                  N+M                       pages range from N to N+M-1
                  N,M                       pages range from N to M-1
                  N,                        pages range from N to end
                  ,M                        pages range from 0 to M
      bits-spec:
                  bit1,bit2                 (flags & (bit1|bit2)) != 0
                  bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1
                  bit1,~bit2                (flags & (bit1|bit2)) == bit1
                  =bit1,bit2                flags == (bit1|bit2)
      bit-names:
                locked              error         referenced           uptodate
                 dirty                lru             active               slab
             writeback            reclaim              buddy               mmap
             anonymous          swapcache         swapbacked      compound_head
         compound_tail               huge        unevictable           hwpoison
                nopage           reserved(r)         mlocked(r)    mappedtodisk(r)
               private(r)       private_2(r)   owner_private(r)            arch(r)
              uncached(r)       readahead(o)       slob_free(o)     slub_frozen(o)
            slub_debug(o)
                                         (r) raw mode bits  (o) overloaded bits
      
      # ./page-types
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          487369     1903  _________________________________
      0x0000000000000014               5        0  __R_D____________________________  referenced,dirty
      0x0000000000000020               1        0  _____l___________________________  lru
      0x0000000000000024              34        0  __R__l___________________________  referenced,lru
      0x0000000000000028            3838       14  ___U_l___________________________  uptodate,lru
      0x0001000000000028              48        0  ___U_l_______________________I___  uptodate,lru,readahead
      0x000000000000002c            6478       25  __RU_l___________________________  referenced,uptodate,lru
      0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
      0x0000000000000040            8344       32  ______A__________________________  active
      0x0000000000000060               1        0  _____lA__________________________  lru,active
      0x0000000000000068             348        1  ___U_lA__________________________  uptodate,lru,active
      0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
      0x000000000000006c             988        3  __RU_lA__________________________  referenced,uptodate,lru,active
      0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
      0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
      0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
      0x0000000000000400             503        1  __________B______________________  buddy
      0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
      0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
      0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
      0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
      0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
      0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
      0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
      0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
      0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
      0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
      0x0000000000001000             492        1  ____________a____________________  anonymous
      0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
      0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c              30        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total          513968     2007
      
      # ./page-types -r
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          468002     1828  _________________________________
      0x0000000100000000           19102       74  _____________________r___________  reserved
      0x0000000000008000              41        0  _______________H_________________  compound_head
      0x0000000000010000             188        0  ________________T________________  compound_tail
      0x0000000000008014               1        0  __R_D__________H_________________  referenced,dirty,compound_head
      0x0000000000010014               4        0  __R_D___________T________________  referenced,dirty,compound_tail
      0x0000000000000020               1        0  _____l___________________________  lru
      0x0000000800000024              34        0  __R__l__________________P________  referenced,lru,private
      0x0000000000000028            3794       14  ___U_l___________________________  uptodate,lru
      0x0001000000000028              46        0  ___U_l_______________________I___  uptodate,lru,readahead
      0x0000000400000028              44        0  ___U_l_________________d_________  uptodate,lru,mappedtodisk
      0x0001000400000028               2        0  ___U_l_________________d_____I___  uptodate,lru,mappedtodisk,readahead
      0x000000000000002c            6434       25  __RU_l___________________________  referenced,uptodate,lru
      0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
      0x000000040000002c              14        0  __RU_l_________________d_________  referenced,uptodate,lru,mappedtodisk
      0x000000080000002c              30        0  __RU_l__________________P________  referenced,uptodate,lru,private
      0x0000000800000040            8124       31  ______A_________________P________  active,private
      0x0000000000000040             219        0  ______A__________________________  active
      0x0000000800000060               1        0  _____lA_________________P________  lru,active,private
      0x0000000000000068             322        1  ___U_lA__________________________  uptodate,lru,active
      0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
      0x0000000400000068              13        0  ___U_lA________________d_________  uptodate,lru,active,mappedtodisk
      0x0000000800000068              12        0  ___U_lA_________________P________  uptodate,lru,active,private
      0x000000000000006c             977        3  __RU_lA__________________________  referenced,uptodate,lru,active
      0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
      0x000000040000006c               5        0  __RU_lA________________d_________  referenced,uptodate,lru,active,mappedtodisk
      0x000000080000006c               3        0  __RU_lA_________________P________  referenced,uptodate,lru,active,private
      0x0000000c0000006c               3        0  __RU_lA________________dP________  referenced,uptodate,lru,active,mappedtodisk,private
      0x0000000c00000068               1        0  ___U_lA________________dP________  uptodate,lru,active,mappedtodisk,private
      0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
      0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
      0x0000000000000400             538        2  __________B______________________  buddy
      0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
      0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
      0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
      0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
      0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
      0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
      0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
      0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
      0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
      0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
      0x0000000000001000             492        1  ____________a____________________  anonymous
      0x0000000000005008               2        0  ___U________a_b__________________  uptodate,anonymous,swapbacked
      0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
      0x000000000000580c               1        0  __RU_______Ma_b__________________  referenced,uptodate,mmap,anonymous,swapbacked
      0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c              29        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total          513968     2007
      
      # ./page-types --raw --list --no-summary --bits reserved
      offset  count   flags
      0       15      _____________________r___________
      31      4       _____________________r___________
      159     97      _____________________r___________
      4096    2067    _____________________r___________
      6752    2390    _____________________r___________
      9355    3       _____________________r___________
      9728    14526   _____________________r___________
      
      This patch:
      
      Introduce PageHuge(), which identifies huge/gigantic pages by their
      dedicated compound destructor functions.
      
      Also move prep_compound_gigantic_page() to hugetlb.c and make
      __free_pages_ok() non-static.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20a0307c
    • C
      page allocator: use a pre-calculated value instead of num_online_nodes() in fast paths · 62bc62a8
      Christoph Lameter 提交于
      num_online_nodes() is called in a number of places but most often by the
      page allocator when deciding whether the zonelist needs to be filtered
      based on cpusets or the zonelist cache.  This is actually a heavy function
      and touches a number of cache lines.
      
      This patch stores the number of online nodes at boot time and updates the
      value when nodes get onlined and offlined.  The value is then used in a
      number of important paths in place of num_online_nodes().
      
      [rientjes@google.com: do not override definition of node_set_online() with macro]
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62bc62a8
    • M
      page allocator: use allocation flags as an index to the zone watermark · 41858966
      Mel Gorman 提交于
      ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
      pages_min, pages_low or pages_high is used as the zone watermark when
      allocating the pages.  Two branches in the allocator hotpath determine
      which watermark to use.
      
      This patch uses the flags as an array index into a watermark array that is
      indexed with WMARK_* defines accessed via helpers.  All call sites that
      use zone->pages_* are updated to use the helpers for accessing the values
      and the array offsets for setting.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41858966
    • M
      page allocator: move check for disabled anti-fragmentation out of fastpath · 49255c61
      Mel Gorman 提交于
      On low-memory systems, anti-fragmentation gets disabled as there is
      nothing it can do and it would just incur overhead shuffling pages between
      lists constantly.  Currently the check is made in the free page fast path
      for every page.  This patch moves it to a slow path.  On machines with low
      memory, there will be small amount of additional overhead as pages get
      shuffled between lists but it should quickly settle.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49255c61
    • M
      page allocator: do not check NUMA node ID when the caller knows the node is valid · 6484eb3e
      Mel Gorman 提交于
      Callers of alloc_pages_node() can optionally specify -1 as a node to mean
      "allocate from the current node".  However, a number of the callers in
      fast paths know for a fact their node is valid.  To avoid a comparison and
      branch, this patch adds alloc_pages_exact_node() that only checks the nid
      with VM_BUG_ON().  Callers that know their node is valid are then
      converted.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6484eb3e
    • M
      page allocator: do not sanity check order in the fast path · b3c466ce
      Mel Gorman 提交于
      No user of the allocator API should be passing in an order >= MAX_ORDER
      but we check for it on each and every allocation.  Delete this check and
      make it a VM_BUG_ON check further down the call path.
      
      [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3c466ce
    • M
      page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask() · d239171e
      Mel Gorman 提交于
      The start of a large patch series to clean up and optimise the page
      allocator.
      
      The performance improvements are in a wide range depending on the exact
      machine but the results I've seen so fair are approximately;
      
      kernbench:	0	to	 0.12% (elapsed time)
      		0.49%	to	 3.20% (sys time)
      aim9:		-4%	to	30% (for page_test and brk_test)
      tbench:		-1%	to	 4%
      hackbench:	-2.5%	to	 3.45% (mostly within the noise though)
      netperf-udp	-1.34%  to	 4.06% (varies between machines a bit)
      netperf-tcp	-0.44%  to	 5.22% (varies between machines a bit)
      
      I haven't sysbench figures at hand, but previously they were within the
      -0.5% to 2% range.
      
      On netperf, the client and server were bound to opposite number CPUs to
      maximise the problems with cache line bouncing of the struct pages so I
      expect different people to report different results for netperf depending
      on their exact machine and how they ran the test (different machines, same
      cpus client/server, shared cache but two threads client/server, different
      socket client/server etc).
      
      I also measured the vmlinux sizes for a single x86-based config with
      CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM.  The core of the
      .config is based on the Debian Lenny kernel config so I expect it to be
      reasonably typical.
      
      This patch:
      
      __alloc_pages_internal is the core page allocator function but essentially
      it is an alias of __alloc_pages_nodemask.  Naming a publicly available and
      exported function "internal" is also a big ugly.  This patch renames
      __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
      nodemask function.
      
      Warning - This patch renames an exported symbol.  No kernel driver is
      affected by external drivers calling __alloc_pages_internal() should
      change the call to __alloc_pages_nodemask() without any alteration of
      parameters.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d239171e
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
    • N
      mm: clean up get_user_pages_fast() documentation · d2bf6be8
      Nick Piggin 提交于
      Move more documentation for get_user_pages_fast into the new kerneldoc comment.
      Add some comments for get_user_pages as well.
      
      Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
      there initially because it was once a static inline function.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Andy Grover <andy.grover@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2bf6be8
    • W
      radix-tree: add radix_tree_prev_hole() · dc566127
      Wu Fengguang 提交于
      The counterpart of radix_tree_next_hole(). To be used by context readahead.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc566127
    • W
      readahead: record mmap read-around states in file_ra_state · d30a1100
      Wu Fengguang 提交于
      Mmap read-around now shares the same code style and data structure with
      readahead code.
      
      This also removes do_page_cache_readahead().  Its last user, mmap
      read-around, has been changed to call ra_submit().
      
      The no-readahead-if-congested logic is dumped by the way.  Users will be
      pretty sensitive about the slow loading of executables.  So it's
      unfavorable to disabled mmap read-around on a congested queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30a1100
    • W
      readahead: make mmap_miss an unsigned int · 1ebf26a9
      Wu Fengguang 提交于
      This makes the performance impact of possible mmap_miss wrap around to be
      temporary and tolerable: i.e.  MMAP_LOTSAMISS=100 extra readarounds.
      
      Otherwise if ever mmap_miss wraps around to negative, it takes INT_MAX
      cache misses to bring it back to normal state.  During the time mmap
      readaround will be _enabled_ for whatever wild random workload.  That's
      almost permanent performance impact.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ebf26a9
    • A
      mm: consolidate init_mm definition · bb1f17b0
      Alexey Dobriyan 提交于
      * create mm/init-mm.c, move init_mm there
      * remove INIT_MM, initialize init_mm with C99 initializer
      * unexport init_mm on all arches:
      
        init_mm is already unexported on x86.
      
        One strange place is some OMAP driver (drivers/video/omap/) which
        won't build modular, but it's already wants get_vm_area() export.
        Somebody should look there.
      
      [akpm@linux-foundation.org: add missing #includes]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Frysinger <vapier.adi@gmail.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb1f17b0
    • Y
      firmware_map: fix hang with x86/32bit · 3b0fde0f
      Yinghai Lu 提交于
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13484
      
      Peer reported:
      | The bug is introduced from kernel 2.6.27, if E820 table reserve the memory
      | above 4G in 32bit OS(BIOS-e820: 00000000fff80000 - 0000000120000000
      | (reserved)), system will report Int 6 error and hang up. The bug is caused by
      | the following code in drivers/firmware/memmap.c, the resource_size_t is 32bit
      | variable in 32bit OS, the BUG_ON() will be invoked to result in the Int 6
      | error. I try the latest 32bit Ubuntu and Fedora distributions, all hit this
      | bug.
      |======
      |static int firmware_map_add_entry(resource_size_t start, resource_size_t end,
      |                  const char *type,
      |                  struct firmware_map_entry *entry)
      
      and it only happen with CONFIG_PHYS_ADDR_T_64BIT is not set.
      
      it turns out we need to pass u64 instead of resource_size_t for that.
      
      [akpm@linux-foundation.org: add comment]
      Reported-and-tested-by: NPeer Chen <pchen@nvidia.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b0fde0f
    • A
      time: move PIT_TICK_RATE to linux/timex.h · 08604bd9
      Arnd Bergmann 提交于
      PIT_TICK_RATE is currently defined in four architectures, but in three
      different places.  While linux/timex.h is not the perfect place for it, it
      is still a reasonable replacement for those drivers that traditionally use
      asm/timex.h to get CLOCK_TICK_RATE and expect it to be the PIT frequency.
      
      Note that for Alpha, the actual value changed from 1193182UL to 1193180UL.
       This is unlikely to make a difference, and probably can only improve
      accuracy.  There was a discussion on the correct value of CLOCK_TICK_RATE
      a few years ago, after which every existing instance was getting changed
      to 1193182.  According to the specification, it should be
      1193181.818181...
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Dmitry Torokhov <dtor@mail.ru>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08604bd9
  2. 15 6月, 2009 6 次提交
  3. 14 6月, 2009 3 次提交