1. 11 12月, 2012 4 次提交
    • M
      mm: mempolicy: Implement change_prot_numa() in terms of change_protection() · 4b10e7d5
      Mel Gorman 提交于
      This patch converts change_prot_numa() to use change_protection(). As
      pte_numa and friends check the PTE bits directly it is necessary for
      change_protection() to use pmd_mknuma(). Hence the required
      modifications to change_protection() are a little clumsy but the
      end result is that most of the numa page table helpers are just one or
      two instructions.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4b10e7d5
    • L
      mm: mempolicy: Add MPOL_MF_LAZY · b24f53a0
      Lee Schermerhorn 提交于
      NOTE: Once again there is a lot of patch stealing and the end result
      	is sufficiently different that I had to drop the signed-offs.
      	Will re-add if the original authors are ok with that.
      
      This patch adds another mbind() flag to request "lazy migration".  The
      flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
      pages are marked PROT_NONE. The pages will be migrated in the fault
      path on "first touch", if the policy dictates at that time.
      
      "Lazy Migration" will allow testing of migrate-on-fault via mbind().
      Also allows applications to specify that only subsequently touched
      pages be migrated to obey new policy, instead of all pages in range.
      This can be useful for multi-threaded applications working on a
      large shared data area that is initialized by an initial thread
      resulting in all pages on one [or a few, if overflowed] nodes.
      After PROT_NONE, the pages in regions assigned to the worker threads
      will be automatically migrated local to the threads on 1st touch.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      b24f53a0
    • A
      mm: numa: Support NUMA hinting page faults from gup/gup_fast · 0b9d7052
      Andrea Arcangeli 提交于
      Introduce FOLL_NUMA to tell follow_page to check
      pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
      so because it always invokes handle_mm_fault and retries the
      follow_page later.
      
      KVM secondary MMU page faults will trigger the NUMA hinting page
      faults through gup_fast -> get_user_pages -> follow_page ->
      handle_mm_fault.
      
      Other follow_page callers like KSM should not use FOLL_NUMA, or they
      would fail to get the pages if they use follow_page instead of
      get_user_pages.
      
      [ This patch was picked up from the AutoNUMA tree. ]
      Originally-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ ported to this tree. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      0b9d7052
    • P
      mm: Count the number of pages affected in change_protection() · 7da4d641
      Peter Zijlstra 提交于
      This will be used for three kinds of purposes:
      
       - to optimize mprotect()
      
       - to speed up working set scanning for working set areas that
         have not been touched
      
       - to more accurately scan per real working set
      
      No change in functionality from this patch.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7da4d641
  2. 17 11月, 2012 1 次提交
    • A
      revert "mm: fix-up zone present pages" · 5576646f
      Andrew Morton 提交于
      Revert commit 7f1290f2 ("mm: fix-up zone present pages")
      
      That patch tried to fix a issue when calculating zone->present_pages,
      but it caused a regression on 32bit systems with HIGHMEM.  With that
      change, reset_zone_present_pages() resets all zone->present_pages to
      zero, and fixup_zone_present_pages() is called to recalculate
      zone->present_pages when the boot allocator frees core memory pages into
      buddy allocator.  Because highmem pages are not freed by bootmem
      allocator, all highmem zones' present_pages becomes zero.
      
      Various options for improving the situation are being discussed but for
      now, let's return to the 3.6 code.
      
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5576646f
  3. 09 10月, 2012 17 次提交
    • J
      mm: fix-up zone present pages · 7f1290f2
      Jianguo Wu 提交于
      I think zone->present_pages indicates pages that buddy system can management,
      it should be:
      
      	zone->present_pages = spanned pages - absent pages - bootmem pages,
      
      but is now:
      	zone->present_pages = spanned pages - absent pages - memmap pages.
      
      spanned pages: total size, including holes.
      absent pages: holes.
      bootmem pages: pages used in system boot, managed by bootmem allocator.
      memmap pages: pages used by page structs.
      
      This may cause zone->present_pages less than it should be.  For example,
      numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
      bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
      present_pages should be spanned pages - absent pages, but now it also
      minus memmap pages(free_area_init_core), which are actually allocated from
      ZONE_MOVABLE.  When offlining all memory of a zone, this will cause
      zone->present_pages less than 0, because present_pages is unsigned long
      type, it is actually a very large integer, it indirectly caused
      zone->watermark[WMARK_MIN] becomes a large
      integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
      large integer(calculate_totalreserve_pages()), and finally cause memory
      allocating failure when fork process(__vm_enough_memory()).
      
      [root@localhost ~]# dmesg
      -bash: fork: Cannot allocate memory
      
      I think the bug described in
      
        http://marc.info/?l=linux-mm&m=134502182714186&w=2
      
      is also caused by wrong zone present pages.
      
      This patch intends to fix-up zone->present_pages when memory are freed to
      buddy system on x86_64 and IA64 platforms.
      Signed-off-by: NJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Reported-by: NPetr Tesarik <ptesarik@suse.cz>
      Tested-by: NPetr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f1290f2
    • S
      readahead: fault retry breaks mmap file read random detection · 45cac65b
      Shaohua Li 提交于
      .fault now can retry.  The retry can break state machine of .fault.  In
      filemap_fault, if page is miss, ra->mmap_miss is increased.  In the second
      try, since the page is in page cache now, ra->mmap_miss is decreased.  And
      these are done in one fault, so we can't detect random mmap file access.
      
      Add a new flag to indicate .fault is tried once.  In the second try, skip
      ra->mmap_miss decreasing.  The filemap_fault state machine is ok with it.
      
      I only tested x86, didn't test other archs, but looks the change for other
      archs is obvious, but who knows :)
      Signed-off-by: NShaohua Li <shaohua.li@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45cac65b
    • M
      mm: remain migratetype in freed page · 95e34412
      Minchan Kim 提交于
      The page allocator caches the pageblock information in page->private while
      it is in the PCP freelists but this is overwritten with the order of the
      page when freed to the buddy allocator.  This patch stores the migratetype
      of the page in the page->index field so that it is available at all times
      when the page remain in free_list.
      
      This patch adds a new call site in __free_pages_ok so it might be overhead
      a bit but it's for high order allocation.  So I believe damage isn't hurt.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95e34412
    • M
      mm: page_alloc: use get_freepage_migratetype() instead of page_private() · b12c4ad1
      Minchan Kim 提交于
      The page allocator uses set_page_private and page_private for handling
      migratetype when it frees page.  Let's replace them with [set|get]
      _freepage_migratetype to make it more clear.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b12c4ad1
    • M
      mm: avoid taking rmap locks in move_ptes() · 38a76013
      Michel Lespinasse 提交于
      During mremap(), the destination VMA is generally placed after the
      original vma in rmap traversal order: in move_vma(), we always have
      new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
      vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.
      
      When the destination VMA is placed after the original in rmap traversal
      order, we can avoid taking the rmap locks in move_ptes().
      
      Essentially, this reintroduces the optimization that had been disabled in
      "mm anon rmap: remove anon_vma_moveto_tail".  The difference is that we
      don't try to impose the rmap traversal order; instead we just rely on
      things being in the desired order in the common case and fall back to
      taking locks in the uncommon case.  Also we skip the i_mmap_mutex in
      addition to the anon_vma lock: in both cases, the vmas are traversed in
      increasing vm_pgoff order with ties resolved in tree insertion order.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38a76013
    • M
      mm: add CONFIG_DEBUG_VM_RB build option · ed8ea815
      Michel Lespinasse 提交于
      Add a CONFIG_DEBUG_VM_RB build option for the previously existing
      DEBUG_MM_RB code.  Now that Andi Kleen modified it to avoid using
      recursive algorithms, we can expose it a bit more.
      
      Also extend this code to validate_mm() after stack expansion, and to check
      that the vma's start and last pgoffs have not changed since the nodes were
      inserted on the anon vma interval tree (as it is important that the nodes
      be reindexed after each such update).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed8ea815
    • M
      mm anon rmap: replace same_anon_vma linked list with an interval tree. · bf181b9f
      Michel Lespinasse 提交于
      When a large VMA (anon or private file mapping) is first touched, which
      will populate its anon_vma field, and then split into many regions through
      the use of mprotect(), the original anon_vma ends up linking all of the
      vmas on a linked list.  This can cause rmap to become inefficient, as we
      have to walk potentially thousands of irrelevent vmas before finding the
      one a given anon page might fall into.
      
      By replacing the same_anon_vma linked list with an interval tree (where
      each avc's interval is determined by its vma's start and last pgoffs), we
      can make rmap efficient for this use case again.
      
      While the change is large, all of its pieces are fairly simple.
      
      Most places that were walking the same_anon_vma list were looking for a
      known pgoff, so they can just use the anon_vma_interval_tree_foreach()
      interval tree iterator instead.  The exception here is ksm, where the
      page's index is not known.  It would probably be possible to rework ksm so
      that the index would be known, but for now I have decided to keep things
      simple and just walk the entirety of the interval tree there.
      
      When updating vma's that already have an anon_vma assigned, we must take
      care to re-index the corresponding avc's on their interval tree.  This is
      done through the use of anon_vma_interval_tree_pre_update_vma() and
      anon_vma_interval_tree_post_update_vma(), which remove the avc's from
      their interval tree before the update and re-insert them after the update.
       The anon_vma stays locked during the update, so there is no chance that
      rmap would miss the vmas that are being updated.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf181b9f
    • M
      mm: interval tree updates · 9826a516
      Michel Lespinasse 提交于
      Update the generic interval tree code that was introduced in "mm: replace
      vma prio_tree with an interval tree".
      
      Changes:
      
      - fixed 'endpoing' typo noticed by Andrew Morton
      
      - replaced include/linux/interval_tree_tmpl.h, which was used as a
        template (including it automatically defined the interval tree
        functions) with include/linux/interval_tree_generic.h, which only
        defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
        defines the interval tree functions when invoked. Now that is a very
        long macro which is unfortunate, but it does make the usage sites
        (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.
      
      - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
        instead of duplicating that code in the interval tree template.
      
      - replaced vma_interval_tree_add(), which was actually handling the
        nonlinear and interval tree cases, with vma_interval_tree_insert_after()
        which handles only the interval tree case and has an API that is more
        consistent with the other interval tree handling functions.
        The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9826a516
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • M
      mm: compaction: capture a suitable high-order page immediately when it is made available · 1fb3f8ca
      Mel Gorman 提交于
      While compaction is migrating pages to free up large contiguous blocks
      for allocation it races with other allocation requests that may steal
      these blocks or break them up.  This patch alters direct compaction to
      capture a suitable free page as soon as it becomes available to reduce
      this race.  It uses similar logic to split_free_page() to ensure that
      watermarks are still obeyed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fb3f8ca
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
    • K
      mm: prepare VM_DONTDUMP for using in drivers · 0103bd16
      Konstantin Khlebnikov 提交于
      Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags:
      VM_DONTEXPAND, VM_DONTCOPY.  Currently this flag used only for
      sys_madvise.  The next patch will use it for replacing the outdated flag
      VM_RESERVED.
      
      Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL
      (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0103bd16
    • K
      mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas · e9714acf
      Konstantin Khlebnikov 提交于
      Currently the kernel sets mm->exe_file during sys_execve() and then tracks
      number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
      as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
      resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
      
      VMA with VM_EXECUTABLE flag appears after mapping file with flag
      MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
      splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
      mm->exe_file and mmaps executable vmas with this file, they hold
      mm->exe_file while task is running.
      
      comment from v2.6.25-6245-g925d1c40 ("procfs task exe symlink"),
      where all this stuff was introduced:
      
      > The kernel implements readlink of /proc/pid/exe by getting the file from
      > the first executable VMA.  Then the path to the file is reconstructed and
      > reported as the result.
      >
      > Because of the VMA walk the code is slightly different on nommu systems.
      > This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      > walking the VMAs to find the first executable file-backed VMA we store a
      > reference to the exec'd file in the mm_struct.
      >
      > That reference would prevent the filesystem holding the executable file
      > from being unmounted even after unmapping the VMAs.  So we track the number
      > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      > unmapped.  This avoids pinning the mounted filesystem.
      
      exe_file's vma accounting is hooked into every file mmap/unmmap and vma
      split/merge just to fix some hypothetical pinning fs from umounting by mm,
      which already unmapped all its executable files, but still alive.
      
      Seems like currently nobody depends on this behaviour.  We can try to
      remove this logic and keep mm->exe_file until final mmput().
      
      mm->exe_file is still protected with mm->mmap_sem, because we want to
      change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
      task can change its mm->exe_file and unpin mountpoint explicitly.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9714acf
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
    • K
      mm: kill vma flag VM_INSERTPAGE · 4b6e1e37
      Konstantin Khlebnikov 提交于
      Merge VM_INSERTPAGE into VM_MIXEDMAP.  VM_MIXEDMAP VMA can mix pure-pfn
      ptes, special ptes and normal ptes.
      
      Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
      VM_PFNMAP.  If driver populates whole VMA at mmap() it probably not
      expects page-faults.
      
      This patch removes special check from vma_wants_writenotify() which
      disables pages write tracking for VMA populated via vm_instert_page().
      BDI below mapped file should not use dirty-accounting, moreover
      do_wp_page() can handle this.
      
      vm_insert_page() still marks vma after first usage.  Usually it is called
      from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
      change vma->vm_flags.  Caller must set VM_MIXEDMAP at mmap time if it
      wants to call this function from other places, for example from page-fault
      handler.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b6e1e37
    • K
      mm: introduce arch-specific vma flag VM_ARCH_1 · cc2383ec
      Konstantin Khlebnikov 提交于
      Combine several arch-specific vma flags into one.
      
      before patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     VM_NOHUGEPAGE   VM_HUGEPAGE     -               VM_PAT
      powerpc -               -               VM_SAO          -
      parisc  VM_GROWSUP      -               -               -
      ia64    VM_GROWSUP      -               -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               -               -               -
      
      after patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     -               VM_PAT          VM_HUGEPAGE     VM_NOHUGEPAGE
      powerpc -               VM_SAO          -               -
      parisc  -               VM_GROWSUP      -               -
      ia64    -               VM_GROWSUP      -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               VM_ARCH_1       -               -
      
      And voila! One completely free bit.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc2383ec
    • K
      mm, x86, pat: rework linear pfn-mmap tracking · b3b9c293
      Konstantin Khlebnikov 提交于
      Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.
      
      We can toss mapping address from remap_pfn_range() into
      track_pfn_vma_new(), and collect all PAT-related logic together in
      arch/x86/.
      
      This patch also restores orignal frustration-free is_cow_mapping() check
      in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73a
      ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")
      
      is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
      because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.
      
      [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3b9c293
  4. 01 8月, 2012 4 次提交
    • M
      mm: add get_kernel_page[s] for pinning of kernel addresses for I/O · 18022c5d
      Mel Gorman 提交于
      This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
      may be used to pin a vector of kernel addresses for IO.  The initial user
      is expected to be NFS for allowing pages to be written to swap using
      aops->direct_IO().  Strictly speaking, swap-over-NFS only needs to pin one
      page for IO but it makes sense to express the API in terms of a vector and
      add a helper for pinning single pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18022c5d
    • M
      mm: methods for teaching filesystems about PG_swapcache pages · f981c595
      Mel Gorman 提交于
      In order to teach filesystems to handle swap cache pages, three new page
      functions are introduced:
      
        pgoff_t page_file_index(struct page *);
        loff_t page_file_offset(struct page *);
        struct address_space *page_file_mapping(struct page *);
      
      page_file_index() - gives the offset of this page in the file in
      PAGE_CACHE_SIZE blocks.  Like page->index is for mapped pages, this
      function also gives the correct index for PG_swapcache pages.
      
      page_file_offset() - uses page_file_index(), so that it will give the
      expected result, even for PG_swapcache pages.
      
      page_file_mapping() - gives the mapping backing the actual page; that is
      for swap cache pages it will give swap_file->f_mapping.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f981c595
    • J
      mm/hotplug: free zone->pageset when a zone becomes empty · 340175b7
      Jiang Liu 提交于
      When a zone becomes empty after memory offlining, free zone->pageset.
      Otherwise it will cause memory leak when adding memory to the empty zone
      again because build_all_zonelists() will allocate zone->pageset for an
      empty zone.
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Signed-off-by: NWei Wang <Bessel.Wang@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      340175b7
    • H
      mm: account the total_vm in the vm_stat_account() · 44de9d0c
      Huang Shijie 提交于
      vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
      But we can also account for total_vm in the vm_stat_account() which makes
      the code tidy.
      
      Even for mprotect_fixup(), we can get the right result in the end.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44de9d0c
  5. 31 7月, 2012 1 次提交
  6. 12 7月, 2012 1 次提交
    • T
      x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults · 6751ed65
      Tony Luck 提交于
      In commit dad1743e ("x86/mce: Only restart instruction after machine
      check recovery if it is safe") we fixed mce_notify_process() to force a
      signal to the current process if it was not restartable (RIPV bit not
      set in MCG_STATUS). But doing it here means that the process doesn't
      get told the virtual address of the fault via siginfo_t->si_addr. This
      would prevent application level recovery from the fault.
      
      Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
      that we will provide the right information with the signal.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: stable@kernel.org    # 3.4+
      6751ed65
  7. 01 6月, 2012 1 次提交
  8. 30 5月, 2012 2 次提交
  9. 07 5月, 2012 2 次提交
  10. 21 4月, 2012 4 次提交
  11. 29 3月, 2012 2 次提交
    • H
      mm for fs: add truncate_pagecache_range() · 623e3db9
      Hugh Dickins 提交于
      Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
      but forgetting to unmap pages first (ocfs2 remembers).  This is not really
      a bug, since races already require truncate_inode_page() to handle that
      case once the page is locked; but it can be very inefficient if the file
      being punched happens to be mapped into many vmas.
      
      Provide a drop-in replacement truncate_pagecache_range() which does the
      unmapping pass first, handling the awkward mismatch between arguments to
      truncate_inode_pages_range() and arguments to unmap_mapping_range().
      
      Note that holepunching does not unmap privately COWed pages in the range:
      POSIX requires that we do so when truncating, but it's hard to justify,
      difficult to implement without an i_size cutoff, and no filesystem is
      attempting to implement it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623e3db9
    • D
      Move all declarations of free_initmem() to linux/mm.h · 49a7f04a
      David Howells 提交于
      Move all declarations of free_initmem() to linux/mm.h so that there's only one
      and it's used by everything.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-c6x-dev@linux-c6x.org
      cc: microblaze-uclinux@itee.uq.edu.au
      cc: linux-sh@vger.kernel.org
      cc: sparclinux@vger.kernel.org
      cc: x86@kernel.org
      cc: linux-mm@kvack.org
      49a7f04a
  12. 24 3月, 2012 1 次提交
    • J
      coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP · accb61fe
      Jason Baron 提交于
      Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
      for 'VM_NODUMP' flag.  The idea is is to add a new madvise() flag:
      MADV_DONTDUMP, which can be set by applications to specifically request
      memory regions which should not dump core.
      
      The specific application I have in mind is qemu: we can add a flag there
      that wouldn't dump all of guest memory when qemu dumps core.  This flag
      might also be useful for security sensitive apps that want to absolutely
      make sure that parts of memory are not dumped.  To clear the flag use:
      MADV_DODUMP.
      
      [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
      [akpm@linux-foundation.org: fix up the architectures which broke]
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accb61fe