1. 15 2月, 2016 1 次提交
    • A
      powerpc/mm: Fix Multi hit ERAT cause by recent THP update · c777e2a8
      Aneesh Kumar K.V 提交于
      With ppc64 we use the deposited pgtable_t to store the hash pte slot
      information. We should not withdraw the deposited pgtable_t without
      marking the pmd none. This ensure that low level hash fault handling
      will skip this huge pte and we will handle them at upper levels.
      
      Recent change to pmd splitting changed the above in order to handle the
      race between pmd split and exit_mmap. The race is explained below.
      
      Consider following race:
      
      		CPU0				CPU1
      shrink_page_list()
        add_to_swap()
          split_huge_page_to_list()
            __split_huge_pmd_locked()
              pmdp_huge_clear_flush_notify()
      	// pmd_none() == true
      					exit_mmap()
      					  unmap_vmas()
      					    zap_pmd_range()
      					      // no action on pmd since pmd_none() == true
      	pmd_populate()
      
      As result the THP will not be freed. The leak is detected by check_mm():
      
      	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
      
      The above required us to not mark pmd none during a pmd split.
      
      The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
      level fault handling code skip this pte. At higher level we do take ptl
      lock. That should serialze us against the pmd split. Once the lock is
      acquired we do check the pmd again using pmd_same. That should always
      return false for us and hence we should retry the access. We do the
      pmd_same check in all case after taking plt with
      THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
      huge_pmd_set_accessed)
      
      Also make sure we wait for irq disable section in other cpus to finish
      before flipping a huge pte entry with a regular pmd entry. Code paths
      like find_linux_pte_or_hugepte depend on irq disable to get
      a stable pte_t pointer. A parallel thp split need to make sure we
      don't convert a pmd pte to a regular pmd entry without waiting for the
      irq disable section to finish.
      
      Fixes: eef1b3ba ("thp: implement split_huge_pmd()")
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c777e2a8
  2. 16 1月, 2016 2 次提交
  3. 05 1月, 2016 1 次提交
    • T
      x86/mm/pat: Add untrack_pfn_moved for mremap · d9fe4fab
      Toshi Kani 提交于
      mremap() with MREMAP_FIXED on a VM_PFNMAP range causes the following
      WARN_ON_ONCE() message in untrack_pfn().
      
        WARNING: CPU: 1 PID: 3493 at arch/x86/mm/pat.c:985 untrack_pfn+0xbd/0xd0()
        Call Trace:
        [<ffffffff817729ea>] dump_stack+0x45/0x57
        [<ffffffff8109e4b6>] warn_slowpath_common+0x86/0xc0
        [<ffffffff8109e5ea>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8106a88d>] untrack_pfn+0xbd/0xd0
        [<ffffffff811d2d5e>] unmap_single_vma+0x80e/0x860
        [<ffffffff811d3725>] unmap_vmas+0x55/0xb0
        [<ffffffff811d916c>] unmap_region+0xac/0x120
        [<ffffffff811db86a>] do_munmap+0x28a/0x460
        [<ffffffff811dec33>] move_vma+0x1b3/0x2e0
        [<ffffffff811df113>] SyS_mremap+0x3b3/0x510
        [<ffffffff817793ee>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      MREMAP_FIXED moves a pfnmap from old vma to new vma.  untrack_pfn() is
      called with the old vma after its pfnmap page table has been removed,
      which causes follow_phys() to fail.  The new vma has a new pfnmap to
      the same pfn & cache type with VM_PAT set.  Therefore, we only need to
      clear VM_PAT from the old vma in this case.
      
      Add untrack_pfn_moved(), which clears VM_PAT from a given old vma.
      move_vma() is changed to call this function with the old vma when
      VM_PFNMAP is set.  move_vma() then calls do_munmap(), and untrack_pfn()
      is a no-op since VM_PAT is cleared.
      Reported-by: NStas Sergeev <stsp@list.ru>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1450832064-10093-2-git-send-email-toshi.kani@hpe.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d9fe4fab
  4. 17 10月, 2015 1 次提交
  5. 14 10月, 2015 1 次提交
  6. 25 6月, 2015 3 次提交
  7. 07 6月, 2015 1 次提交
  8. 15 4月, 2015 3 次提交
    • T
      mm: change vunmap to tear down huge KVA mappings · b9820d8f
      Toshi Kani 提交于
      Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA
      mappings when they are set.  pud_clear_huge() and pmd_clear_huge() return
      zero when no-operation is performed, i.e.  huge page mapping was not used.
      
      These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined
      on the architecture.
      
      [akpm@linux-foundation.org: use consistent code layout]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9820d8f
    • T
      mm: change ioremap to set up huge I/O mappings · e61ce6ad
      Toshi Kani 提交于
      ioremap_pud_range() and ioremap_pmd_range() are changed to create huge I/O
      mappings when their capability is enabled, and a request meets required
      conditions -- both virtual & physical addresses are aligned by their huge
      page size, and a requested range fufills their huge page size.  When
      pud_set_huge() or pmd_set_huge() returns zero, i.e.  no-operation is
      performed, the code simply falls back to the next level.
      
      The changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined on
      the architecture.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e61ce6ad
    • K
      mm: define default PGTABLE_LEVELS to two · 235a8f02
      Kirill A. Shutemov 提交于
      By this time all architectures which support more than two page table
      levels should be covered.  This patch add default definiton of
      PGTABLE_LEVELS equal 2.
      
      We also add assert to detect inconsistence between CONFIG_PGTABLE_LEVELS
      and __PAGETABLE_PMD_FOLDED/__PAGETABLE_PUD_FOLDED.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      235a8f02
  9. 13 2月, 2015 2 次提交
    • M
      mm: remove remaining references to NUMA hinting bits and helpers · 21d9ee3e
      Mel Gorman 提交于
      This patch removes the NUMA PTE bits and associated helpers.  As a
      side-effect it increases the maximum possible swap space on x86-64.
      
      One potential source of problems is races between the marking of PTEs
      PROT_NONE, NUMA hinting faults and migration.  It must be guaranteed that
      a PTE being protected is not faulted in parallel, seen as a pte_none and
      corrupting memory.  The base case is safe but transhuge has problems in
      the past due to an different migration mechanism and a dependance on page
      lock to serialise migrations and warrants a closer look.
      
      task_work hinting update			parallel fault
      ------------------------			--------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						pmd_none
      						  do_huge_pmd_anonymous_page
      						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
      						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
            pmd_modify
            set_pmd_at
      
      task_work hinting update			parallel migration
      ------------------------			------------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						  do_huge_pmd_numa_page
      						    migrate_misplaced_transhuge_page
      						    pmd_lock waits for updates to complete, recheck pmd_same
            pmd_modify
            set_pmd_at
      
      Both of those are safe and the case where a transhuge page is inserted
      during a protection update is unchanged.  The case where two processes try
      migrating at the same time is unchanged by this series so should still be
      ok.  I could not find a case where we are accidentally depending on the
      PTE not being cleared and flushed.  If one is missed, it'll manifest as
      corruption problems that start triggering shortly after this series is
      merged and only happen when NUMA balancing is enabled.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21d9ee3e
    • M
      mm: add p[te|md] protnone helpers for use by NUMA balancing · e7bb4b6d
      Mel Gorman 提交于
      This is a preparatory patch that introduces protnone helpers for automatic
      NUMA balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7bb4b6d
  10. 11 2月, 2015 1 次提交
  11. 27 10月, 2014 1 次提交
  12. 14 10月, 2014 1 次提交
    • P
      mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared · 64e45507
      Peter Feiner 提交于
      For VMAs that don't want write notifications, PTEs created for read faults
      have their write bit set.  If the read fault happens after VM_SOFTDIRTY is
      cleared, then the PTE's softdirty bit will remain clear after subsequent
      writes.
      
      Here's a simple code snippet to demonstrate the bug:
      
        char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                       MAP_ANONYMOUS | MAP_SHARED, -1, 0);
        system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
        assert(*m == '\0');     /* new PTE allows write access */
        assert(!soft_dirty(x));
        *m = 'x';               /* should dirty the page */
        assert(soft_dirty(x));  /* fails */
      
      With this patch, write notifications are enabled when VM_SOFTDIRTY is
      cleared.  Furthermore, to avoid unnecessary faults, write notifications
      are disabled when VM_SOFTDIRTY is set.
      
      As a side effect of enabling and disabling write notifications with
      care, this patch fixes a bug in mprotect where vm_page_prot bits set by
      drivers were zapped on mprotect.  An analogous bug was fixed in mmap by
      commit c9d0bf24 ("mm: uncached vma support with writenotify").
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Reported-by: NPeter Feiner <pfeiner@google.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64e45507
  13. 10 10月, 2014 1 次提交
    • M
      mm: remove misleading ARCH_USES_NUMA_PROT_NONE · 6a33979d
      Mel Gorman 提交于
      ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
      _PAGE_NUMA using _PROT_NONE.  This saved using an additional PTE bit and
      relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
      fault scanner.  This was found to be conceptually confusing with a lot of
      implicit assumptions and it was asked that an alternative be found.
      
      Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
      PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE
      bits and shrunk the maximum possible swap size but it did not go far
      enough.  There are no architectures that reuse _PROT_NONE as _PROT_NUMA
      but the relics still exist.
      
      This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
      duplication in powerpc vs the generic implementation by defining the types
      the core NUMA helpers expected to exist from x86 with their ppc64
      equivalent.  This necessitated that a PTE bit mask be created that
      identified the bits that distinguish present from NUMA pte entries but it
      is expected this will only differ between arches based on _PAGE_PROTNONE.
      The naming for the generic helpers was taken from x86 originally but ppc64
      has types that are equivalent for the purposes of the helper so they are
      mapped instead of duplicating code.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a33979d
  14. 01 10月, 2014 1 次提交
  15. 05 6月, 2014 1 次提交
    • M
      x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81
      Mel Gorman 提交于
      _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
      faults on x86.  Care is taken such that _PAGE_NUMA is used only in
      situations where the VMA flags distinguish between NUMA hinting faults
      and prot_none faults.  This decision was x86-specific and conceptually
      it is difficult requiring special casing to distinguish between PROTNONE
      and NUMA ptes based on context.
      
      Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
      between an entry that is really unmapped and a page that is protected
      for NUMA hinting faults as if the PTE is not present then a fault will
      be trapped.
      
      Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
      This patch shrinks the maximum possible swap size and uses the bit to
      uniquely distinguish between NUMA hinting ptes and swap ptes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c46a7c81
  16. 19 4月, 2014 1 次提交
    • M
      mm: use paravirt friendly ops for NUMA hinting ptes · 29c77870
      Mel Gorman 提交于
      David Vrabel identified a regression when using automatic NUMA balancing
      under Xen whereby page table entries were getting corrupted due to the
      use of native PTE operations.  Quoting him
      
      	Xen PV guest page tables require that their entries use machine
      	addresses if the preset bit (_PAGE_PRESENT) is set, and (for
      	successful migration) non-present PTEs must use pseudo-physical
      	addresses.  This is because on migration MFNs in present PTEs are
      	translated to PFNs (canonicalised) so they may be translated back
      	to the new MFN in the destination domain (uncanonicalised).
      
      	pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
      	set and clear the _PAGE_PRESENT bit using pte_set_flags(),
      	pte_clear_flags(), etc.
      
      	In a Xen PV guest, these functions must translate MFNs to PFNs
      	when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
      	_PAGE_PRESENT.
      
      His suggested fix converted p[te|md]_[set|clear]_flags to using
      paravirt-friendly ops but this is overkill.  He suggested an alternative
      of using p[te|md]_modify in the NUMA page table operations but this is
      does more work than necessary and would require looking up a VMA for
      protections.
      
      This patch modifies the NUMA page table operations to use paravirt
      friendly operations to set/clear the flags of interest.  Unfortunately
      this will take a performance hit when updating the PTEs on
      CONFIG_PARAVIRT but I do not see a way around it that does not break
      Xen.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Vrabel <david.vrabel@citrix.com>
      Tested-by: NDavid Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c77870
  17. 21 2月, 2014 1 次提交
    • K
      mm: add support for discard of unused ptes · 45961722
      Konstantin Weitz 提交于
      In a virtualized environment and given an appropriate interface the guest
      can mark pages as unused while they are free (for the s390 implementation
      see git commit 45e576b1 "guest page hinting light"). For the host
      the unused state is a property of the pte.
      
      This patch adds the primitive 'pte_unused' and code to the host swap out
      handler so that pages marked as unused by all mappers are not swapped out
      but discarded instead, thus saving one IO for swap out and potentially
      another one for swap in.
      
      [ Martin Schwidefsky: patch reordering and simplification ]
      Signed-off-by: NKonstantin Weitz <konstantin.weitz@gmail.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      45961722
  18. 17 2月, 2014 1 次提交
    • A
      mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit · 56eecdb9
      Aneesh Kumar K.V 提交于
      Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
      a hash table MMU for various reasons (the flush is handled as part of
      the PTE modification when necessary).
      
      ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.
      
      Additionally ppc64 require the tlb flushing to be batched within ptl locks.
      
      The reason to do that is to ensure that the hash page table is in sync with
      linux page table.
      
      We track the hpte index in linux pte and if we clear them without flushing
      hash and drop the ptl lock, we can have another cpu update the pte and can
      end up with duplicate entry in the hash table, which is fatal.
      
      We also want to keep set_pte_at simpler by not requiring them to do hash
      flush for performance reason. We do that by assuming that set_pte_at() is
      never *ever* called on a PTE that is already valid.
      
      This was the case until the NUMA code went in which broke that assumption.
      
      Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
      way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
      using set_pte_at() and a powerpc specific one using the appropriate
      mechanism needed to keep the hash table in sync.
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      56eecdb9
  19. 15 1月, 2014 1 次提交
    • A
      powerpc/thp: Fix crash on mremap · b3084f4d
      Aneesh Kumar K.V 提交于
      This patch fix the below crash
      
      NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440
      LR [c0000000000439ac] .hash_page+0x18c/0x5e0
      ...
      Call Trace:
      [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable)
      [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0
      [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58
      
      On ppc64 we use the pgtable for storing the hpte slot information and
      store address to the pgtable at a constant offset (PTRS_PER_PMD) from
      pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
      the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
      from new pmd.
      
      We also want to move the withdraw and deposit before the set_pmd so
      that, when page fault find the pmd as trans huge we can be sure that
      pgtable can be located at the offset.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b3084f4d
  20. 21 12月, 2013 1 次提交
    • K
      mm: Fix NULL pointer dereference in madvise(MADV_WILLNEED) support · ee53664b
      Kirill A. Shutemov 提交于
      Sasha Levin found a NULL pointer dereference that is due to a missing
      page table lock, which in turn is due to the pmd entry in question being
      a transparent huge-table entry.
      
      The code - introduced in commit 1998cc04 ("mm: make
      madvise(MADV_WILLNEED) support swap file prefetch") - correctly checks
      for this situation using pmd_none_or_trans_huge_or_clear_bad(), but it
      turns out that that function doesn't work correctly.
      
      pmd_none_or_trans_huge_or_clear_bad() expected that pmd_bad() would
      trigger if the transparent hugepage bit was set, but it doesn't do that
      if pmd_numa() is also set. Note that the NUMA bit only gets set on real
      NUMA machines, so people trying to reproduce this on most normal
      development systems would never actually trigger this.
      
      Fix it by removing the very subtle (and subtly incorrect) expectation,
      and instead just checking pmd_trans_huge() explicitly.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      [ Additionally remove the now stale test for pmd_trans_huge() inside the
        pmd_bad() case - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee53664b
  21. 19 12月, 2013 1 次提交
    • R
      mm: fix TLB flush race between migration, and change_protection_range · 20841405
      Rik van Riel 提交于
      There are a few subtle races, between change_protection_range (used by
      mprotect and change_prot_numa) on one side, and NUMA page migration and
      compaction on the other side.
      
      The basic race is that there is a time window between when the PTE gets
      made non-present (PROT_NONE or NUMA), and the TLB is flushed.
      
      During that time, a CPU may continue writing to the page.
      
      This is fine most of the time, however compaction or the NUMA migration
      code may come in, and migrate the page away.
      
      When that happens, the CPU may continue writing, through the cached
      translation, to what is no longer the current memory location of the
      process.
      
      This only affects x86, which has a somewhat optimistic pte_accessible.
      All other architectures appear to be safe, and will either always flush,
      or flush whenever there is a valid mapping, even with no permissions
      (SPARC).
      
      The basic race looks like this:
      
      CPU A			CPU B			CPU C
      
      						load TLB entry
      make entry PTE/PMD_NUMA
      			fault on entry
      						read/write old page
      			start migrating page
      			change PTE/PMD to new page
      						read/write old page [*]
      flush TLB
      						reload TLB from new entry
      						read/write new page
      						lose data
      
      [*] the old page may belong to a new user at this point!
      
      The obvious fix is to flush remote TLB entries, by making sure that
      pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
      still be accessible if there is a TLB flush pending for the mm.
      
      This should fix both NUMA migration and compaction.
      
      [mgorman@suse.de: fix build]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20841405
  22. 29 8月, 2013 1 次提交
    • M
      s390/mm: implement software referenced bits · 0944fe3f
      Martin Schwidefsky 提交于
      The last remaining use for the storage key of the s390 architecture
      is reference counting. The alternative is to make page table entries
      invalid while they are old. On access the fault handler marks the
      pte/pmd as young which makes the pte/pmd valid if the access rights
      allow read access. The pte/pmd invalidations required for software
      managed reference bits cost a bit of performance, on the other hand
      the RRBE/RRBM instructions to read and reset the referenced bits are
      quite expensive as well.
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      0944fe3f
  23. 14 8月, 2013 2 次提交
  24. 04 7月, 2013 1 次提交
    • P
      mm: soft-dirty bits for user memory changes tracking · 0f8975ec
      Pavel Emelyanov 提交于
      The soft-dirty is a bit on a PTE which helps to track which pages a task
      writes to.  In order to do this tracking one should
      
        1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
        2. Wait some time.
        3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
      
      To do this tracking, the writable bit is cleared from PTEs when the
      soft-dirty bit is.  Thus, after this, when the task tries to modify a
      page at some virtual address the #PF occurs and the kernel sets the
      soft-dirty bit on the respective PTE.
      
      Note, that although all the task's address space is marked as r/o after
      the soft-dirty bits clear, the #PF-s that occur after that are processed
      fast.  This is so, since the pages are still mapped to physical memory,
      and thus all the kernel does is finds this fact out and puts back
      writable, dirty and soft-dirty bits on the PTE.
      
      Another thing to note, is that when mremap moves PTEs they are marked
      with soft-dirty as well, since from the user perspective mremap modifies
      the virtual memory at mremap's new address.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f8975ec
  25. 29 6月, 2013 1 次提交
  26. 20 6月, 2013 1 次提交
  27. 30 4月, 2013 1 次提交
  28. 25 4月, 2013 1 次提交
  29. 14 2月, 2013 1 次提交
    • M
      s390/mm: implement software dirty bits · abf09bed
      Martin Schwidefsky 提交于
      The s390 architecture is unique in respect to dirty page detection,
      it uses the change bit in the per-page storage key to track page
      modifications. All other architectures track dirty bits by means
      of page table entries. This property of s390 has caused numerous
      problems in the past, e.g. see git commit ef5d437f
      "mm: fix XFS oops due to dirty pages without buffers on s390".
      
      To avoid future issues in regard to per-page dirty bits convert
      s390 to a fault based software dirty bit detection mechanism. All
      user page table entries which are marked as clean will be hardware
      read-only, even if the pte is supposed to be writable. A write by
      the user process will trigger a protection fault which will cause
      the user pte to be marked as dirty and the hardware read-only bit
      is removed.
      
      With this change the dirty bit in the storage key is irrelevant
      for Linux as a host, but the storage key is still required for
      KVM guests. The effect is that page_test_and_clear_dirty and the
      related code can be removed. The referenced bit in the storage
      key is still used by the page_test_and_clear_young primitive to
      provide page age information.
      
      For page cache pages of mappings with mapping_cap_account_dirty
      there will not be any change in behavior as the dirty bit tracking
      already uses read-only ptes to control the amount of dirty pages.
      Only for swap cache pages and pages of mappings without
      mapping_cap_account_dirty there can be additional protection faults.
      To avoid an excessive number of additional faults the mk_pte
      primitive checks for PageDirty if the pgprot value allows for writes
      and pre-dirties the pte. That avoids all additional faults for
      tmpfs and shmem pages until these pages are added to the swap cache.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      abf09bed
  30. 19 1月, 2013 1 次提交
  31. 13 12月, 2012 1 次提交
  32. 11 12月, 2012 2 次提交