1. 12 12月, 2012 2 次提交
    • M
      mm: augment vma rbtree with rb_subtree_gap · d3737187
      Michel Lespinasse 提交于
      Define vma->rb_subtree_gap as the largest gap between any vma in the
      subtree rooted at that vma, and their predecessor.  Or, for a recursive
      definition, vma->rb_subtree_gap is the max of:
      
       - vma->vm_start - vma->vm_prev->vm_end
       - rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
         vma->rb.rb_right
      
      This will allow get_unmapped_area_* to find a free area of the right
      size in O(log(N)) time, instead of potentially having to do a linear
      walk across all the VMAs.
      
      Also define mm->highest_vm_end as the vm_end field of the highest vma,
      so that we can easily check if the following gap is suitable.
      
      This does have the potential to make unmapping VMAs more expensive,
      especially for processes with very large numbers of VMAs, where the VMA
      rbtree can grow quite deep.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3737187
    • A
      mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB · 42d7395f
      Andi Kleen 提交于
      There was some desire in large applications using MAP_HUGETLB or
      SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
      others.  This is useful together with NUMA policy: use 2MB interleaving
      on some mappings, but 1GB on local mappings.
      
      This patch extends the IPC/SHM syscall interfaces slightly to allow
      specifying the page size.
      
      It borrows some upper bits in the existing flag arguments and allows
      encoding the log of the desired page size in addition to the *_HUGETLB
      flag.  When 0 is specified the default size is used, this makes the
      change fully compatible.
      
      Extending the internal hugetlb code to handle this is straight forward.
      Instead of a single mount it just keeps an array of them and selects the
      right mount based on the specified page size.  When no page size is
      specified it uses the mount of the default page size.
      
      The change is not visible in /proc/mounts because internal mounts don't
      appear there.  It also has very little overhead: the additional mounts
      just consume a super block, but not more memory when not used.
      
      I also exported the new flags to the user headers (they were previously
      under __KERNEL__).  Right now only symbols for x86 and some other
      architecture for 1GB and 2MB are defined.  The interface should already
      work for all other architectures though.  Only architectures that define
      multiple hugetlb sizes actually need it (that is currently x86, tile,
      powerpc).  However tile and powerpc have user configurable hugetlb
      sizes, so it's not easy to add defines.  A program on those
      architectures would need to query sysfs and use the appropiate log2.
      
      [akpm@linux-foundation.org: cleanups]
      [rientjes@google.com: fix build]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42d7395f
  2. 17 11月, 2012 1 次提交
    • M
      mm: add anon_vma_lock to validate_mm() · 63c3b902
      Michel Lespinasse 提交于
      Iterating over the vma->anon_vma_chain without anon_vma_lock may cause
      NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the
      chain might have been removed.
      
        BUG: unable to handle kernel paging request at fffffffffffffff0
        IP: [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
        PGD 4e28067 PUD 4e29067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU 0
        Pid: 9050, comm: trinity-child64 Tainted: G        W    3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77
        RIP: 0010: anon_vma_interval_tree_verify+0xc/0xa0
        Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000)
        Call Trace:
          validate_mm+0x58/0x1e0
          vma_adjust+0x635/0x6b0
          __split_vma.isra.22+0x161/0x220
          split_vma+0x24/0x30
          sys_madvise+0x5da/0x7b0
          tracesys+0xe1/0xe6
        RIP  anon_vma_interval_tree_verify+0xc/0xa0
        CR2: fffffffffffffff0
      
      Figured out by Bob Liu.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63c3b902
  3. 09 10月, 2012 12 次提交
    • M
      mm: avoid taking rmap locks in move_ptes() · 38a76013
      Michel Lespinasse 提交于
      During mremap(), the destination VMA is generally placed after the
      original vma in rmap traversal order: in move_vma(), we always have
      new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
      vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.
      
      When the destination VMA is placed after the original in rmap traversal
      order, we can avoid taking the rmap locks in move_ptes().
      
      Essentially, this reintroduces the optimization that had been disabled in
      "mm anon rmap: remove anon_vma_moveto_tail".  The difference is that we
      don't try to impose the rmap traversal order; instead we just rely on
      things being in the desired order in the common case and fall back to
      taking locks in the uncommon case.  Also we skip the i_mmap_mutex in
      addition to the anon_vma lock: in both cases, the vmas are traversed in
      increasing vm_pgoff order with ties resolved in tree insertion order.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38a76013
    • M
      mm anon rmap: in mremap, set the new vma's position before anon_vma_clone() · 523d4e20
      Michel Lespinasse 提交于
      anon_vma_clone() expects new_vma->vm_{start,end,pgoff} to be correctly set
      so that the new vma can be indexed on the anon interval tree.
      
      copy_vma() was failing to do that, which broke mremap().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Tested-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      523d4e20
    • M
      mm: add CONFIG_DEBUG_VM_RB build option · ed8ea815
      Michel Lespinasse 提交于
      Add a CONFIG_DEBUG_VM_RB build option for the previously existing
      DEBUG_MM_RB code.  Now that Andi Kleen modified it to avoid using
      recursive algorithms, we can expose it a bit more.
      
      Also extend this code to validate_mm() after stack expansion, and to check
      that the vma's start and last pgoffs have not changed since the nodes were
      inserted on the anon vma interval tree (as it is important that the nodes
      be reindexed after each such update).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed8ea815
    • M
      mm anon rmap: replace same_anon_vma linked list with an interval tree. · bf181b9f
      Michel Lespinasse 提交于
      When a large VMA (anon or private file mapping) is first touched, which
      will populate its anon_vma field, and then split into many regions through
      the use of mprotect(), the original anon_vma ends up linking all of the
      vmas on a linked list.  This can cause rmap to become inefficient, as we
      have to walk potentially thousands of irrelevent vmas before finding the
      one a given anon page might fall into.
      
      By replacing the same_anon_vma linked list with an interval tree (where
      each avc's interval is determined by its vma's start and last pgoffs), we
      can make rmap efficient for this use case again.
      
      While the change is large, all of its pieces are fairly simple.
      
      Most places that were walking the same_anon_vma list were looking for a
      known pgoff, so they can just use the anon_vma_interval_tree_foreach()
      interval tree iterator instead.  The exception here is ksm, where the
      page's index is not known.  It would probably be possible to rework ksm so
      that the index would be known, but for now I have decided to keep things
      simple and just walk the entirety of the interval tree there.
      
      When updating vma's that already have an anon_vma assigned, we must take
      care to re-index the corresponding avc's on their interval tree.  This is
      done through the use of anon_vma_interval_tree_pre_update_vma() and
      anon_vma_interval_tree_post_update_vma(), which remove the avc's from
      their interval tree before the update and re-insert them after the update.
       The anon_vma stays locked during the update, so there is no chance that
      rmap would miss the vmas that are being updated.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf181b9f
    • M
      mm anon rmap: remove anon_vma_moveto_tail · 108d6642
      Michel Lespinasse 提交于
      mremap() had a clever optimization where move_ptes() did not take the
      anon_vma lock to avoid a race with anon rmap users such as page migration.
       Instead, the avc's were ordered in such a way that the origin vma was
      always visited by rmap before the destination.  This ordering and the use
      of page table locks rmap usage safe.  However, we want to replace the use
      of linked lists in anon rmap with an interval tree, and this will make it
      harder to impose such ordering as the interval tree will always be sorted
      by the avc->vma->vm_pgoff value.  For now, let's replace the
      anon_vma_moveto_tail() ordering function with proper anon_vma locking in
      move_ptes().  Once we have the anon interval tree in place, we will
      re-introduce an optimization to avoid taking these locks in the most
      common cases.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108d6642
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • M
      mm: fix potential anon_vma locking issue in mprotect() · ca42b26a
      Michel Lespinasse 提交于
      Fix an anon_vma locking issue in the following situation:
      
      - vma has no anon_vma
      - next has an anon_vma
      - vma is being shrunk / next is being expanded, due to an mprotect call
      
      We need to take next's anon_vma lock to avoid races with rmap users (such
      as page migration) while next is being expanded.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca42b26a
    • H
      mm/mmap.c: replace find_vma_prepare() with clearer find_vma_links() · 6597d783
      Hugh Dickins 提交于
      People get confused by find_vma_prepare(), because it doesn't care about
      what it returns in its output args, when its callers won't be interested.
      
      Clarify by passing in end-of-range address too, and returning failure if
      any existing vma overlaps the new range: instead of returning an ambiguous
      vma which most callers then must check.  find_vma_links() is a clearer
      name.
      
      This does revert 2.6.27's dfe195fb ("mm: fix uninitialized variables
      for find_vma_prepare callers"), but it looks like gcc 4.3.0 was one of
      those releases too eager to shout about uninitialized variables: only
      copy_vma() warns with 4.5.1 and 4.7.1, which a BUG on error silences.
      
      [hughd@google.com: fix warning, remove BUG()]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6597d783
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
    • K
      mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas · e9714acf
      Konstantin Khlebnikov 提交于
      Currently the kernel sets mm->exe_file during sys_execve() and then tracks
      number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
      as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
      resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
      
      VMA with VM_EXECUTABLE flag appears after mapping file with flag
      MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
      splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
      mm->exe_file and mmaps executable vmas with this file, they hold
      mm->exe_file while task is running.
      
      comment from v2.6.25-6245-g925d1c40 ("procfs task exe symlink"),
      where all this stuff was introduced:
      
      > The kernel implements readlink of /proc/pid/exe by getting the file from
      > the first executable VMA.  Then the path to the file is reconstructed and
      > reported as the result.
      >
      > Because of the VMA walk the code is slightly different on nommu systems.
      > This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      > walking the VMAs to find the first executable file-backed VMA we store a
      > reference to the exec'd file in the mm_struct.
      >
      > That reference would prevent the filesystem holding the executable file
      > from being unmounted even after unmapping the VMAs.  So we track the number
      > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      > unmapped.  This avoids pinning the mounted filesystem.
      
      exe_file's vma accounting is hooked into every file mmap/unmmap and vma
      split/merge just to fix some hypothetical pinning fs from umounting by mm,
      which already unmapped all its executable files, but still alive.
      
      Seems like currently nobody depends on this behaviour.  We can try to
      remove this logic and keep mm->exe_file until final mmput().
      
      mm->exe_file is still protected with mm->mmap_sem, because we want to
      change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
      task can change its mm->exe_file and unpin mountpoint explicitly.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9714acf
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
    • K
      mm: kill vma flag VM_INSERTPAGE · 4b6e1e37
      Konstantin Khlebnikov 提交于
      Merge VM_INSERTPAGE into VM_MIXEDMAP.  VM_MIXEDMAP VMA can mix pure-pfn
      ptes, special ptes and normal ptes.
      
      Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
      VM_PFNMAP.  If driver populates whole VMA at mmap() it probably not
      expects page-faults.
      
      This patch removes special check from vma_wants_writenotify() which
      disables pages write tracking for VMA populated via vm_instert_page().
      BDI below mapped file should not use dirty-accounting, moreover
      do_wp_page() can handle this.
      
      vm_insert_page() still marks vma after first usage.  Usually it is called
      from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
      change vma->vm_flags.  Caller must set VM_MIXEDMAP at mmap time if it
      wants to call this function from other places, for example from page-fault
      handler.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b6e1e37
  4. 27 9月, 2012 1 次提交
  5. 22 8月, 2012 1 次提交
    • H
      mm: change nr_ptes BUG_ON to WARN_ON · f9aed62a
      Hugh Dickins 提交于
      Occasionally an isolated BUG_ON(mm->nr_ptes) gets reported, indicating
      that not all the page tables allocated could be found and freed when
      exit_mmap() tore down the user address space.
      
      There's usually nothing we can say about it, beyond that it's probably a
      sign of some bad memory or memory corruption; though it might still
      indicate a bug in vma or page table management (and did recently reveal a
      race in THP, fixed a few months ago).
      
      But one overdue change we can make is from BUG_ON to WARN_ON.
      
      It's fairly likely that the system will crash shortly afterwards in some
      other way (for example, the BUG_ON(page_mapped(page)) in
      __delete_from_page_cache(), once an inode mapped into the lost page tables
      gets evicted); but might tell us more before that.
      
      Change the BUG_ON(page_mapped) to WARN_ON too?  Later perhaps: I'm less
      eager, since that one has several times led to fixes.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9aed62a
  6. 21 8月, 2012 1 次提交
  7. 01 8月, 2012 1 次提交
  8. 30 7月, 2012 2 次提交
    • O
      uprobes: Remove insert_vm_struct()->uprobe_mmap() · 89133786
      Oleg Nesterov 提交于
      Remove insert_vm_struct()->uprobe_mmap(). It is not needed, nobody
      except arch/ia64/kernel/perfmon.c uses insert_vm_struct(vma)
      with vma->vm_file != NULL.
      
      And it is wrong. Again, get_user_pages() can not succeed before
      vma_link(vma) makes is visible to find_vma(). And even if this
      worked, we must not insert the new bp before this mapping is
      visible to vma_prio_tree_foreach() for uprobe_unregister().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/20120729182238.GA20349@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89133786
    • O
      uprobes: Remove copy_vma()->uprobe_mmap() · 6dab3cc0
      Oleg Nesterov 提交于
      Remove copy_vma()->uprobe_mmap(new_vma), it is absolutely wrong.
      
      This new_vma was just initialized to represent the new unmapped
      area, [vm_start, vm_end) was returned by get_unmapped_area() in
      the caller.
      
      This means that uprobe_mmap()->get_user_pages() will fail for
      sure, simply because find_vma() can never succeed. And I
      verified that sys_mremap()->mremap_to() indeed always fails with
      the wrong ENOMEM code if [addr, addr+old_len] is probed.
      
      And why this uprobe_mmap() was added? I believe the intent was
      wrong. Note that the caller is going to do move_page_tables(),
      all registered uprobes are already faulted in, we only change
      the virtual addresses.
      
      NOTE: However, somehow we need to close the race with
      uprobe_register() which relies on map_info->vaddr. This needs
      another fix I'll try to do later. Probably we need uprobe_mmap()
      in move_vma() but we can not do this right now, this can confuse
      uprobes_state.counter (which I still hope we are going to kill).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/20120729182236.GA20342@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6dab3cc0
  9. 01 6月, 2012 7 次提交
  10. 31 5月, 2012 1 次提交
  11. 30 5月, 2012 1 次提交
  12. 07 5月, 2012 2 次提交
  13. 21 4月, 2012 4 次提交
  14. 14 4月, 2012 1 次提交
    • S
      uprobes/core: Decrement uprobe count before the pages are unmapped · cbc91f71
      Srikar Dronamraju 提交于
      Uprobes has a callback (uprobe_munmap()) in the unmap path to
      maintain the uprobes count.
      
      In the exit path this callback gets called in unlink_file_vma().
      However by the time unlink_file_vma() is called, the pages would
      have been unmapped (in unmap_vmas()) and the task->rss_stat counts
      accounted (in zap_pte_range()).
      
      If the exiting process has probepoints, uprobe_munmap() checks if
      the breakpoint instruction was around before decrementing the probe
      count.
      
      This results in a file backed page being reread by uprobe_munmap()
      and hence it does not find the breakpoint.
      
      This patch fixes this problem by moving the callback to
      unmap_single_vma(). Since unmap_single_vma() may not unmap the
      complete vma, add start and end parameters to uprobe_munmap().
      
      This bug became apparent courtesy of commit c3f0327f
      ("mm: add rss counters consistency check").
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
      Cc: Linux-mm <linux-mm@kvack.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20120411103527.23245.9835.sendpatchset@srdronam.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cbc91f71
  15. 31 3月, 2012 1 次提交
    • S
      uprobes/core: Optimize probe hits with the help of a counter · 682968e0
      Srikar Dronamraju 提交于
      Maintain a per-mm counter: number of uprobes that are inserted
      on this process address space.
      
      This counter can be used at probe hit time to determine if we
      need a lookup in the uprobes rbtree. Everytime a probe gets
      inserted successfully, the probe count is incremented and
      everytime a probe gets removed, the probe count is decremented.
      
      The new uprobe_munmap hook ensures the count is correct on a
      unmap or remap of a region. We expect that once a
      uprobe_munmap() is called, the vma goes away.  So
      uprobe_unregister() finding a probe to unregister would either
      mean unmap event hasnt occurred yet or a mmap event on the same
      executable file occured after a unmap event.
      
      Additionally, uprobe_mmap hook now also gets called:
      
       a. on every executable vma that is COWed at fork.
       b. a vma of interest is newly mapped; breakpoint insertion also
          happens at the required address.
      
      On process creation, make sure the probes count in the child is
      set correctly.
      
      Special cases that are taken care include:
      
       a. mremap
       b. VM_DONTCOPY vmas on fork()
       c. insertion/removal races in the parent during fork().
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
      Cc: Linux-mm <linux-mm@kvack.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20120330182646.10018.85805.sendpatchset@srdronam.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      682968e0
  16. 22 3月, 2012 2 次提交