1. 29 4月, 2008 16 次提交
    • B
      memcgroup: make the memory controller more desktop responsive · 4a56d02e
      Balbir Singh 提交于
      This patch makes the memory controller more responsive on my desktop.
      
      1. Set all cached pages as inactive.  We were by default marking all pages
         as active, thus forcing us to go through two passes for reclaiming pages
      
      2. Remove congestion_wait(), since we already have that logic in
         do_try_to_free_pages()
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a56d02e
    • K
      memcg: remove redundant function calls · 3eae90c3
      KAMEZAWA Hiroyuki 提交于
      remove_list/add_list uses page_cgroup_zoneinfo() in it.
      
      So, it's called twice before and after lock.
      
      	mz = page_cgroup_zoneinfo();
      	lock();
      	mz = page_cgroup_zoneinfo();
      	....
      	unlock();
      
      And address of mz never changes.
      
      This is not good. This patch fixes this behavior.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eae90c3
    • P
      memcgroup: implement failcounter reset · 29f2a4da
      Pavel Emelyanov 提交于
      This is a very common requirement from people using the resource accounting
      facilities (not only memcgroup but also OpenVZ beancounters).  They want to
      put the cgroup in an initial state without re-creating it.
      
      For example after re-configuring a group people want to observe how this new
      configuration fits the group needs without saving the previous failcnt value.
      
      Merge two resets into one mem_cgroup_reset() function to demonstrate how
      multiplexing work.
      
      Besides, I have plans to move the files, that correspond to res_counter to the
      res_counter.c file and somehow "import" them into controller.  I don't know
      how to make it gracefully yet, but merging resets of max_usage and failcnt in
      one function will be there for sure.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29f2a4da
    • P
      memcgroup: use triggers in force_empty and max_usage files · 85cc59db
      Pavel Emelyanov 提交于
      These two files are essentially event callbacks.  They do not care about the
      contents of the string, but only about the fact of the write itself.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85cc59db
    • B
      memcgroup: move memory controller allocations to their own slabs · b6ac57d5
      Balbir Singh 提交于
      Move the memory controller data structure page_cgroup to its own slab cache.
      It saves space on the system, allocations are not necessarily pushed to order
      of 2 and should provide performance benefits.  Users who disable the memory
      controller can also double check that the memory controller is not allocating
      page_cgroup's.
      
      NOTE: Hugh Dickins brought up the issue of whether we want to mark page_cgroup
      as __GFP_MOVABLE or __GFP_RECLAIMABLE.  I don't think there is an easy answer
      at the moment.  page_cgroup's are associated with user pages, they can be
      reclaimed once the user page has been reclaimed, so it might make sense to
      mark them as __GFP_RECLAIMABLE.  For now, I am leaving the marking to default
      values that the slab allocator uses.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6ac57d5
    • P
      memcgroup: add the max_usage member on the res_counter · c84872e1
      Pavel Emelyanov 提交于
      This field is the maximal value of the usage one since the counter creation
      (or since the latest reset).
      
      To reset this to the usage value simply write anything to the appropriate
      cgroup file.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c84872e1
    • B
      cgroups: add an owner to the mm_struct · cf475ad2
      Balbir Singh 提交于
      Remove the mem_cgroup member from mm_struct and instead adds an owner.
      
      This approach was suggested by Paul Menage.  The advantage of this approach
      is that, once the mm->owner is known, using the subsystem id, the cgroup
      can be determined.  It also allows several control groups that are
      virtually grouped by mm_struct, to exist independent of the memory
      controller i.e., without adding mem_cgroup's for each controller, to
      mm_struct.
      
      A new config option CONFIG_MM_OWNER is added and the memory resource
      controller selects this config option.
      
      This patch also adds cgroup callbacks to notify subsystems when mm->owner
      changes.  The mm_cgroup_changed callback is called with the task_lock() of
      the new task held and is called just prior to changing the mm->owner.
      
      I am indebted to Paul Menage for the several reviews of this patchset and
      helping me make it lighter and simpler.
      
      This patch was tested on a powerpc box, it was compiled with both the
      MM_OWNER config turned on and off.
      
      After the thread group leader exits, it's moved to init_css_state by
      cgroup_exit(), thus all future charges from runnings threads would be
      redirected to the init_css_set's subsystem.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>,
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf475ad2
    • P
      CGroup API files: drop mem_cgroup_force_empty() · c27e8818
      Paul Menage 提交于
      This function isn't needed - a NULL pointer in the cftype read function will
      result in the same EINVAL response to userspace.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: "Li Zefan" <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c27e8818
    • P
      CGroup API files: use cgroup map for memcontrol stats file · c64745cf
      Paul Menage 提交于
      Remove the seq_file boilerplate used to construct the memcontrol stats map,
      and instead use the new map representation for cgroup control files
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: "Li Zefan" <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c64745cf
    • P
      CGroup API files: use read_u64 in memory controller · 2c3daa72
      Paul Menage 提交于
      Update the memory controller to use read_u64 for its limit/usage/failcnt
      control files, calling the new res_counter_read_u64() function.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: "Li Zefan" <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c3daa72
    • N
      page allocator: explicitly retry hugepage allocations · 551883ae
      Nishanth Aravamudan 提交于
      Add __GFP_REPEAT to hugepage allocations.  Do so to not necessitate userspace
      putting pressure on the VM by repeated echo's into /proc/sys/vm/nr_hugepages
      to grow the pool.  With the previous patch to allow for large-order
      __GFP_REPEAT attempts to loop for a bit (as opposed to indefinitely), this
      increases the likelihood of getting hugepages when the system experiences (or
      recently experienced) load.
      
      Mel tested the patchset on an x86_32 laptop.  With the patches, it was easier
      to use the proc interface to grow the hugepage pool.  The following is the
      output of a script that grows the pool as much as possible running on
      2.6.25-rc9.
      
      Allocating hugepages test
      -------------------------
      Disabling OOM Killer for current test process
      Starting page count: 0
      Attempt 1: 57 pages Progress made with 57 pages
      Attempt 2: 73 pages Progress made with 16 pages
      Attempt 3: 74 pages Progress made with 1 pages
      Attempt 4: 75 pages Progress made with 1 pages
      Attempt 5: 77 pages Progress made with 2 pages
      
      77 pages was the most it allocated but it took 5 attempts from userspace
      to get it. With the 3 patches in this series applied,
      
      Allocating hugepages test
      -------------------------
      Disabling OOM Killer for current test process
      Starting page count: 0
      Attempt 1: 75 pages Progress made with 75 pages
      Attempt 2: 76 pages Progress made with 1 pages
      Attempt 3: 79 pages Progress made with 3 pages
      
      And 79 pages was the most it got. Your patches were able to allocate the
      bulk of possible pages on the first attempt.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Tested-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      551883ae
    • N
      page allocator: smarter retry of costly-order allocations · a41f24ea
      Nishanth Aravamudan 提交于
      Because of page order checks in __alloc_pages(), hugepage (and similarly
      large order) allocations will not retry unless explicitly marked
      __GFP_REPEAT. However, the current retry logic is nearly an infinite
      loop (or until reclaim does no progress whatsoever). For these costly
      allocations, that seems like overkill and could potentially never
      terminate. Mel observed that allowing current __GFP_REPEAT semantics for
      hugepage allocations essentially killed the system. I believe this is
      because we may continue to reclaim small orders of pages all over, but
      never have enough to satisfy the hugepage allocation request. This is
      clearly only a problem for large order allocations, of which hugepages
      are the most obvious (to me).
      
      Modify try_to_free_pages() to indicate how many pages were reclaimed.
      Use that information in __alloc_pages() to eventually fail a large
      __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
      or greater than the allocation's order. This relies on lumpy reclaim
      functioning as advertised. Due to fragmentation, lumpy reclaim may not
      be able to free up the order needed in one invocation, so multiple
      iterations may be requred. In other words, the more fragmented memory
      is, the more retry attempts __GFP_REPEAT will make (particularly for
      higher order allocations).
      
      This changes the semantics of __GFP_REPEAT subtly, but *only* for
      allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
      allocations, we will try up to some point (at least 1<<order reclaimed
      pages), rather than forever (which is the case for allocations <=
      PAGE_ALLOC_COSTLY_ORDER).
      
      This change improves the /proc/sys/vm/nr_hugepages interface with a
      follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
      than administrators repeatedly echo'ing a particular value into the
      sysctl, and forcing reclaim into action manually, this change allows for
      the sysctl to attempt a reasonable effort itself. Similarly, dynamic
      pool growth should be more successful under load, as lumpy reclaim can
      try to free up pages, rather than failing right away.
      
      Choosing to reclaim only up to the order of the requested allocation
      strikes a balance between not failing hugepage allocations and returning
      to the caller when it's unlikely to every succeed. Because of lumpy
      reclaim, if we have freed the order requested, hopefully it has been in
      big chunks and those chunks will allow our allocation to succeed. If
      that isn't the case after freeing up the current order, I don't think it
      is likely to succeed in the future, although it is possible given a
      particular fragmentation pattern.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Tested-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a41f24ea
    • N
      mm: fix misleading __GFP_REPEAT related comments · ab857d09
      Nishanth Aravamudan 提交于
      The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the
      core VM have somewhat differing comments as to their actual semantics.
      Annoyingly, the flags definition has inline and header comments, which might
      be interpreted as not being equivalent.  Just add references to the header
      comments in the inline ones so they don't go out of sync in the future.  In
      their use in __alloc_pages() clarify that the current implementation treats
      low-order allocations and __GFP_REPEAT allocations as distinct cases.
      
      To clarify, the flags' semantics are:
      
      __GFP_NORETRY means try no harder than one run through __alloc_pages
      
      __GFP_REPEAT means __GFP_NOFAIL
      
      __GFP_NOFAIL means repeat forever
      
      order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab857d09
    • K
      mm: fix usemap initialization · 86051ca5
      KAMEZAWA Hiroyuki 提交于
      usemap must be initialized only when pfn is within zone.  If not, it corrupts
      memory.
      
      And this patch also reduces the number of calls to set_pageblock_migratetype()
      from
      	(pfn & (pageblock_nr_pages -1)
      to
      	!(pfn & (pageblock_nr_pages-1)
      it should be called once per pageblock.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Shi Weihua <shiwh@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86051ca5
    • H
      mm: fix integer as NULL pointer warnings · 7b8ee84d
      Harvey Harrison 提交于
      mm/hugetlb.c:207:11: warning: Using plain integer as NULL pointer
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b8ee84d
    • A
      mm/memory_hotplug.c must #include "internal.h" · 1e5ad9a3
      Adrian Bunk 提交于
      This patch fixes the following compile error caused by commit
      04753278 ("memory hotplug: register
      section/node id to free"):
      
          CC      mm/memory_hotplug.o
        /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/memory_hotplug.c: In function ‘put_page_bootmem’:
        /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/memory_hotplug.c:82: error: implicit declaration of function ‘__free_pages_bootmem’
        /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/memory_hotplug.c: At top level:
        /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/memory_hotplug.c:87: warning: no previous prototype for ‘register_page_bootmem_info_section’
        make[2]: *** [mm/memory_hotplug.o] Error 1
      
      [ Andrew: "Argh.  The -mm-only memory-hotplug-add-removable-to-sysfs-
        to-show-memblock-removability.patch debugging patch adds that include
        so nobody hit this before. ]
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e5ad9a3
  2. 28 4月, 2008 24 次提交
    • M
      mm/nommu.c: return 0 from kobjsize with invalid objects · 4016a139
      Michael Hennerich 提交于
      Don't perform kobjsize operations on objects the kernel doesn't manage.
      
      On Blackfin, drivers can get dma coherent memory by calling a function
      dma_alloc_coherent(). We do this in nommu by configuring a chunk of uncached
      memory at the top of memory.
      
      Since we don't want the kernel to use the uncached memory, we lie to the
      kernel, and tell it that it's max memory is between 0, and the start of the
      uncached dma coherent section.
      
      this all works well, until this memory gets exposed into userspace (with a
      frame buffer), when you look at the process's maps, it shows the framebuf:
      
      root:/proc> cat maps
      [snip]
      03f0ef00-03f34700 rw-p 00000000 1f:00 192        /dev/fb0
      root:/proc>
      
      This is outside the "normal" range for the kernel. When the kernel tries to
      find the size of this object (when you run ps), it dies in nommu.c in
      kobjsize.
      
      BUG_ON(page->index >= MAX_ORDER);
      
      since the page we are referring to is outside what the kernel thinks is it's
      max valid memory.
      
      root:~> while [ 1 ]; ps > /dev/null; done
      kernel BUG at mm/nommu.c:119!
      Kernel panic - not syncing: BUG!
      
      We fixed this by adding a check to reject out of range object pointers as it
      already does that for NULL pointers.
      Signed-off-by: NMichael Hennerich <Michael.Hennerich@analog.com>
      Signed-off-by: NRobin Getz <rgetz@blackfin.uclinux.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4016a139
    • D
      vmstats: add cond_resched() to refresh_cpu_vm_stats() · 468fd62e
      Dimitri Sivanich 提交于
      We've found that it can take quite a bit of time (100's of usec) to get
      through the zone loop in refresh_cpu_vm_stats().
      
      Adding a cond_resched() to allow other threads to run in the non-preemptive
      case.
      Signed-off-by: NDimitri Sivanich <sivanich@sgi.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      468fd62e
    • P
      mm/page_alloc.c: remove hand-coded get_order() · 2309f9e6
      Pavel Machek 提交于
      Remove hand-coded get_order() from page_alloc.c.
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2309f9e6
    • L
      oom_kill: remove unused parameter in badness() · 97d87c97
      Li Zefan 提交于
      In commit 4c4a2214, we moved the
      memcontroller-related code from badness() to select_bad_process(), so the
      parameter 'mem' in badness() is unused now.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97d87c97
    • Y
      memory hotplug: free memmaps allocated by bootmem · 0c0a4a51
      Yasunori Goto 提交于
      This patch is to free memmaps which is allocated by bootmem.
      
      Freeing usemap is not necessary.  The pages of usemap may be necessary for
      other sections.
      
      If removing section is last section on the node, its section is the final user
      of usemap page.  (usemaps are allocated on its section by previous patch.) But
      it shouldn't be freed too, because the section must be logical offline state
      which all pages are isolated against page allocater.  If it is freed, page
      alloctor may use it which will be removed physically soon.  It will be
      disaster.  So, this patch keeps it as it is.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c0a4a51
    • Y
      memory hotplug: allocate usemap on the section with pgdat · 86f6dae1
      Yasunori Goto 提交于
      Usemaps are allocated on the section which has pgdat by this.
      
      Because usemap size is very small, many other sections usemaps are allocated
      on only one page.  If a section has usemap, it can't be removed until removing
      other sections.  This dependency is not desirable for memory removing.
      
      Pgdat has similar feature.  When a section has pgdat area, it must be the last
      section for removing on the node.  So, if section A has pgdat and section B
      has usemap for section A, Both sections can't be removed due to dependency
      each other.
      
      To solve this issue, this patch collects usemap on same section with pgdat.
      If other sections doesn't have any dependency, this section will be able to be
      removed finally.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86f6dae1
    • Y
      memory hotplug: make alloc_bootmem_section() · e70260aa
      Yasunori Goto 提交于
      alloc_bootmem_section() can allocate specified section's area.  This is used
      for usemap to keep same section with pgdat by later patch.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e70260aa
    • Y
      memory hotplug: align memmap to page size · 9d99217a
      Yasunori Goto 提交于
      To free memmap easier, this patch aligns it to page size.  Bootmem allocater
      may mix some objects in one pages.  It's not good for freeing memmap of memory
      hot-remove.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d99217a
    • Y
      memory hotplug: register section/node id to free · 04753278
      Yasunori Goto 提交于
      This patch set is to free pages which is allocated by bootmem for
      memory-hotremove.  Some structures of memory management are allocated by
      bootmem.  ex) memmap, etc.
      
      To remove memory physically, some of them must be freed according to
      circumstance.  This patch set makes basis to free those pages, and free
      memmaps.
      
      Basic my idea is using remain members of struct page to remember information
      of users of bootmem (section number or node id).  When the section is
      removing, kernel can confirm it.  By this information, some issues can be
      solved.
      
        1) When the memmap of removing section is allocated on other
           section by bootmem, it should/can be free.
        2) When the memmap of removing section is allocated on the
           same section, it shouldn't be freed. Because the section has to be
           logical memory offlined already and all pages must be isolated against
           page allocater. If it is freed, page allocator may use it which will
           be removed physically soon.
        3) When removing section has other section's memmap,
           kernel will be able to show easily which section should be removed
           before it for user. (Not implemented yet)
        4) When the above case 2), the page isolation will be able to check and skip
           memmap's page when logical memory offline (offline_pages()).
           Current page isolation code fails in this case because this page is
           just reserved page and it can't distinguish this pages can be
           removed or not. But, it will be able to do by this patch.
           (Not implemented yet.)
        5) The node information like pgdat has similar issues. But, this
           will be able to be solved too by this.
           (Not implemented yet, but, remembering node id in the pages.)
      
      Fortunately, current bootmem allocator just keeps PageReserved flags,
      and doesn't use any other members of page struct. The users of
      bootmem doesn't use them too.
      
      This patch:
      
      This is to register information which is node or section's id.  Kernel can
      distinguish which node/section uses the pages allcated by bootmem.  This is
      basis for hot-remove sections or nodes.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04753278
    • G
      hugetlbfs: common code update for s390 · 7f2e9525
      Gerald Schaefer 提交于
      Huge ptes have a special type on s390 and cannot be handled with the standard
      pte functions in certain cases, e.g.  because of a different location of the
      invalid bit.  This patch adds some new architecture- specific functions to
      hugetlb common code, as a prerequisite for the s390 large page support.
      
      This won't affect other architectures in functionality, but I need to add some
      new dummy inline functions to the headers.
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f2e9525
    • G
      hugetlbfs: add missing TLB flush to hugetlb_cow() · 8fe627ec
      Gerald Schaefer 提交于
      A cow break on a hugetlbfs page with page_count > 1 will set a new pte with
      set_huge_pte_at(), w/o any tlb flush operation.  The old pte will remain in
      the tlb and subsequent write access to the page will result in a page fault
      loop, for as long as it may take until the tlb is flushed from somewhere else.
       This patch introduces an architecture-specific huge_ptep_clear_flush()
      function, which is called before the the set_huge_pte_at() in hugetlb_cow().
      
      ATTENTION: This is just a nop on all architectures for now, the s390
      implementation will come with our large page patch later.  Other architectures
      should define their own huge_ptep_clear_flush() if needed.
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fe627ec
    • L
      mempolicy: use struct mempolicy pointer in shmem_sb_info · 71fe804b
      Lee Schermerhorn 提交于
      This patch replaces the mempolicy mode, mode_flags, and nodemask in the
      shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
      This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
      inode.c and simplifies the interfaces.
      
      mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
      pointer arg, a struct mempolicy pointer on success.  For MPOL_DEFAULT, the
      returned pointer is NULL.  Further, mpol_parse_str() now takes a 'no_context'
      argument that causes the input nodemask to be stored in the w.user_nodemask of
      the created mempolicy for use when the mempolicy is installed in a tmpfs inode
      shared policy tree.  At that time, any cpuset contextualization is applied to
      the original input nodemask.  This preserves the previous behavior where the
      input nodemask was stored in the superblock.  We can think of the returned
      mempolicy as "context free".
      
      Because mpol_parse_str() is now calling mpol_new(), we can remove from
      mpol_to_str() the semantic checks that mpol_new() already performs.
      
      Add 'no_context' parameter to mpol_to_str() to specify that it should format
      the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.
      
      Change mpol_shared_policy_init() to take a pointer to a "context free" struct
      mempolicy and to create a new, "contextualized" mempolicy using the mode,
      mode_flags and user_nodemask from the input mempolicy.
      
        Note: we know that the mempolicy passed to mpol_to_str() or
        mpol_shared_policy_init() from a tmpfs superblock is "context free".  This
        is currently the only instance thereof.  However, if we found more uses for
        this concept, and introduced any ambiguity as to whether a mempolicy was
        context free or not, we could add another internal mode flag to identify
        context free mempolicies.  Then, we could remove the 'no_context' argument
        from mpol_to_str().
      
      Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
      if one exists, to pass to mpol_shared_policy_init().  We must add the
      reference under the sb stat_lock to prevent races with replacement of the mpol
      by remount.  This reference is removed in mpol_shared_policy_init().
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: another build fix]
      [akpm@linux-foundation.org: yet another build fix]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71fe804b
    • L
      mempolicy: support mpol=local tmpfs mount option · 3f226aa1
      Lee Schermerhorn 提交于
      For tmpfs/shmem shared policies, MPOL_DEFAULT is not necessarily equivalent to
      "local allocation".  Because shared policies are at the same "scope" level
      [see Documentation/vm/numa_memory_policy.txt], as vma policies MPOL_DEFAULT
      means "fall back to current task policy".
      
      This patch extends the memory policy string parsing function to display
      "local" for MPOL_PREFERRED + MPOL_F_LOCAL.  This allows one to specify local
      allocation as the default policy for shared memory areas via the tmpfs mpol
      mount option, regardless of the current task's policy.
      
      Also, "local" is now displayed for this policy.  This patch allows us to
      accept the same input format as the display.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f226aa1
    • L
      mempolicy: rework shmem mpol parsing and display · 095f1fc4
      Lee Schermerhorn 提交于
      mm/shmem.c currently contains functions to parse and display memory policy
      strings for the tmpfs 'mpol' mount option.  Move this to mm/mempolicy.c with
      the rest of the mempolicy support.  With subsequent patches, we'll be able to
      remove knowledge of the details [mode, flags, policy, ...] completely from
      shmem.c
      
      1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
         mm/mempolicy.c.  Rework to use the policy_types[] array [used by
         mpol_to_str()] to look up mode by name.
      
      2) use mpol_to_str() to format policy for shmem_show_mpol().  mpol_to_str()
         expects a pointer to a struct mempolicy, so temporarily construct one.
         This will be replaced with a reference to a struct mempolicy in the tmpfs
         superblock in a subsequent patch.
      
         NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
         sign '=' as the nodemask delimiter to match mpol_parse_str() and the
         tmpfs/shmem mpol mount option formatting that now uses mpol_to_str().  This
         is a user visible change to numa_maps, but then the addition of the mode
         flags already changed the display.  It makes sense to me to have the mounts
         and numa_maps display the policy in the same format.  However, if anyone
         objects strongly, I can pass the desired nodemask delimeter as an arg to
         mpol_to_str().
      
         Note 2: Like show_numa_map(), I don't check the return code from
         mpol_to_str().  I do use a longer buffer than the one provided by
         show_numa_map(), which seems to have sufficed so far.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      095f1fc4
    • L
      mempolicy: clean-up mpol-to-str() mempolicy formatting · 2291990a
      Lee Schermerhorn 提交于
      mpol-to-str() formats memory policies into printable strings.  Currently this
      is only used to display "numa_maps".  A subsequent patch will use
      mpol_to_str() for formatting tmpfs [shmem] mpol mount options, allowing us to
      remove essentially duplicate code in mm/shmem.c.  This patch cleans up
      mpol_to_str() generally and in preparation for that patch.
      
      1) show_numa_maps() is not checking the return code from mpol_to_str().
         There's not a lot we can do in this context if mpol_to_str() did return the
         error [insufficient space in buffer].  Proposed "solution": just check,
         under DEBUG_VM, that callers are providing sufficient buffer space for the
         policy, flags, and a few nodes.  This way, we'll get some display.
         show_numa_maps() is providing a 50-byte buffer, so it won't trip this
         check.  50-bytes should be sufficient unless one has a large number of
         nodes in a very sparse nodemask.
      
      2) The display of the new mode flags ["static" & "relative"] was set up to
         display multiple flags, separated by a "bar" '|'.  However, this support is
         incomplete--e.g., need_bar was never incremented; and currently, these two
         flags are mutually exclusive.  So remove the "bar" support, for now, and
         only display one flag.
      
      3) Use snprint() to format flags, so as not to overflow the buffer.  Not
         that it's ever happed, AFAIK.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2291990a
    • L
      mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy · fc36b8d3
      Lee Schermerhorn 提交于
      Now that we're using "preferred local" policy for system default, we need to
      make this as fast as possible.  Because of the variable size of the mempolicy
      structure [based on size of nodemasks], the preferred_node may be in a
      different cacheline from the mode.  This can result in accessing an extra
      cacheline in the normal case of system default policy.  Suspect this is the
      cause of an observed 2-3% slowdown in page fault testing relative to kernel
      without this patch series.
      
      To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy
      flags member which is guaranteed [?] to be in the same cacheline as the mode
      itself.
      
      Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1
      for both anon and shmem segments with system default and vma [preferred local]
      policy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc36b8d3
    • L
      mempolicy: mPOL_PREFERRED cleanups for "local allocation" · 53f2556b
      Lee Schermerhorn 提交于
      Here are a couple of "cleanups" for MPOL_PREFERRED behavior when
      v.preferred_node < 0 -- i.e., "local allocation":
      
      1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
          to fetch the nodemask associated with a policy.  Currently,
          get_policy_nodemask() returns the set of nodes with memory, when
          the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
          Change to return an empty nodemask, as this is what was specified
          to achieve "local allocation".
      
      2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
          called to adjust any task and vma policy nodes to be valid in the
          new cpuset.  However, when the policy is MPOL_PREFERRED, and the
          preferred_node is <0, no rebind is necessary.  The "local allocation"
          indication is valid in any cpuset.  Existing code will "do the right
          thing" because node_remap() will just return the argument node when
          it is outside of the valid range of node ids.  However, I think it is
          clearer and cleaner to skip the remap explicitly in this case.
      
      3)  mpol_to_str() produces a printable, "human readable" string from a
          struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
          "local", as this indicates local allocation, as the task migrates
          among nodes.  Note that this matches the usage of "local allocation"
          in libnuma() and numactl.  Without this change, I believe that node_set()
          [via set_bit()] will set bit 31, resulting in a misleading display.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f2556b
    • L
      mempolicy: use MPOL_PREFERRED for system-wide default policy · bea904d5
      Lee Schermerhorn 提交于
      Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API
      [set_mempolicy(), mbind() and internal versions], the kernel simply installs a
      NULL struct mempolicy pointer in the appropriate context: task policy, vma
      policy, or shared policy.  This causes any use of that policy to "fall back"
      to the next most specific policy scope.
      
      The only use of MPOL_DEFAULT to mean "local allocation" is in the system
      default policy.  This requires extra checks/cases for MPOL_DEFAULT in many
      mempolicy.c functions.
      
      There is another, "preferred" way to specify local allocation via the APIs.
      That is using the MPOL_PREFERRED policy mode with an empty nodemask.
      Internally, the empty nodemask gets converted to a preferred_node id of '-1'.
      All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the
      node local to the cpu where the allocation occurs.
      
      System default policy, except during boot, is hard-coded to "local
      allocation".  By using the MPOL_PREFERRED mode with a negative value of
      preferred node for system default policy, MPOL_DEFAULT will never occur in the
      'policy' member of a struct mempolicy.  Thus, we can remove all checks for
      MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation
      paths.
      
      In slab_node() return local node id when policy pointer is NULL.  No need to
      set a pol value to take the switch default.  Replace switch default with
      BUG()--i.e., shouldn't happen.
      
      With this patch MPOL_DEFAULT is only used in the APIs, including internal
      calls to do_set_mempolicy() and in the display of policy in
      /proc/<pid>/numa_maps.  It always means "fall back" to the the next most
      specific policy scope.  This simplifies the description of memory policies
      quite a bit, with no visible change in behavior.
      
      get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when
      the requested policy [task or vma/shared] is NULL.  These are the values one
      would supply via set_mempolicy() or mbind() to achieve that condition--default
      behavior.
      
      This patch updates Documentation to reflect this change.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea904d5
    • L
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn 提交于
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • L
      mempolicy: mark shared policies for unref · aab0b102
      Lee Schermerhorn 提交于
      As part of yet another rework of mempolicy reference counting, we want to be
      able to identify shared policies efficiently, because they have an extra ref
      taken on lookup that needs to be removed when we're finished using the policy.
      
        Note:  the extra ref is required because the policies are
        shared between tasks/processes and can be changed/freed
        by one task while another task is using them--e.g., for
        page allocation.
      
      Building on David Rientjes mempolicy "mode flags" enhancement, this patch
      indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
      member of the struct mempolicy added by David.  MPOL_F_SHARED, and any future
      "internal mode flags" are reserved from bit zero up, as they will never be
      passed in the upper bits of the mode argument of a mempolicy API.
      
      I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
      rb-tree.  Don't need/want to clear the flag when removing from the tree as the
      mempolicy is freed [unref'd] internally to the sp_delete() function.  However,
      a task could hold another reference on this mempolicy from a prior lookup.  We
      need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
      unref, eventually freeing, the mempolicy.
      
      A later patch in this series will introduce a function to conditionally unref
      [mpol_free] a policy.  The MPOL_F_SHARED flag is one reason [currently the
      only reason] to unref/free a policy via the conditional free.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aab0b102
    • L
      mempolicy: rename struct mempolicy 'policy' member to 'mode' · 45c4745a
      Lee Schermerhorn 提交于
      The terms 'policy' and 'mode' are both used in various places to describe the
      semantics of the value stored in the 'policy' member of struct mempolicy.
      Furthermore, the term 'policy' is used to refer to that member, to the entire
      struct mempolicy and to the more abstract concept of the tuple consisting of a
      "mode" and an optional node or set of nodes.  Recently, we have added "mode
      flags" that are passed in the upper bits of the 'mode' [or sometimes,
      'policy'] member of the numa APIs.
      
      I'd like to resolve this confusion, which perhaps only exists in my mind, by
      renaming the 'policy' member to 'mode' throughout, and fixing up the
      Documentation.  Man pages will be updated separately.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45c4745a
    • L
      mempolicy: fixup Fallback for Default Shmem Policy · ae4d8c16
      Lee Schermerhorn 提交于
      get_vma_policy() is not handling fallback to task policy correctly when the
      get_policy() vm_op returns NULL.  The NULL overwrites the 'pol' variable that
      was holding the fallback task mempolicy.  So, it was falling back directly to
      system default policy.
      
      Fix get_vma_policy() to use only non-NULL policy returned from the vma
      get_policy op.
      
      shm_get_policy() was falling back to current task's mempolicy if the "backing
      file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
      the vma policy is null.  This is incorrect for show_numa_maps() which is
      likely querying the numa_maps of some task other than current.  Remove this
      fallback.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae4d8c16
    • L
      mempolicy: write lock mmap_sem while changing task mempolicy · f4e53d91
      Lee Schermerhorn 提交于
      A read of /proc/<pid>/numa_maps holds the target task's mmap_sem for read
      while examining each vma's mempolicy.  A vma's mempolicy can fall back to the
      task's policy.  However, the task could be changing it's task policy and free
      the one that the show_numa_maps() is examining.
      
      To prevent this, grab the mmap_sem for write when updating task mempolicy.
      Pointed out to me by Christoph Lameter and extracted and reworked from
      Christoph's alternative mempol reference counting patch.
      
      This is analogous to the way that do_mbind() and do_get_mempolicy() prevent
      races between task's sharing an mm_struct [a.k.a.  threads] setting and
      querying a mempolicy for a particular address.
      
      Note: this is necessary, but not sufficient, to allow us to stop taking an
      extra reference on "other task's mempolicy" in get_vma_policy.  Subsequent
      patches will complete this update, allowing us to simplify the tests for
      whether we need to unref a mempolicy at various points in the code.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4e53d91
    • L
      mempolicy: rename mpol_copy to mpol_dup · 846a16bf
      Lee Schermerhorn 提交于
      This patch renames mpol_copy() to mpol_dup() because, well, that's what it
      does.  Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
      existing mempolicy, allocates a new one and copies the contents.
      
      In a later patch, I want to use the name mpol_copy() to copy the contents from
      one mempolicy to another like, e.g., strcpy() does for strings.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846a16bf