1. 17 2月, 2014 1 次提交
    • A
      mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit · 56eecdb9
      Aneesh Kumar K.V 提交于
      Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
      a hash table MMU for various reasons (the flush is handled as part of
      the PTE modification when necessary).
      
      ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.
      
      Additionally ppc64 require the tlb flushing to be batched within ptl locks.
      
      The reason to do that is to ensure that the hash page table is in sync with
      linux page table.
      
      We track the hpte index in linux pte and if we clear them without flushing
      hash and drop the ptl lock, we can have another cpu update the pte and can
      end up with duplicate entry in the hash table, which is fatal.
      
      We also want to keep set_pte_at simpler by not requiring them to do hash
      flush for performance reason. We do that by assuming that set_pte_at() is
      never *ever* called on a PTE that is already valid.
      
      This was the case until the NUMA code went in which broke that assumption.
      
      Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
      way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
      using set_pte_at() and a powerpc specific one using the appropriate
      mechanism needed to keep the hash table in sync.
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      56eecdb9
  2. 15 1月, 2014 1 次提交
    • A
      powerpc/thp: Fix crash on mremap · b3084f4d
      Aneesh Kumar K.V 提交于
      This patch fix the below crash
      
      NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440
      LR [c0000000000439ac] .hash_page+0x18c/0x5e0
      ...
      Call Trace:
      [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable)
      [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0
      [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58
      
      On ppc64 we use the pgtable for storing the hpte slot information and
      store address to the pgtable at a constant offset (PTRS_PER_PMD) from
      pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
      the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
      from new pmd.
      
      We also want to move the withdraw and deposit before the set_pmd so
      that, when page fault find the pmd as trans huge we can be sure that
      pgtable can be located at the offset.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b3084f4d
  3. 21 12月, 2013 1 次提交
    • K
      mm: Fix NULL pointer dereference in madvise(MADV_WILLNEED) support · ee53664b
      Kirill A. Shutemov 提交于
      Sasha Levin found a NULL pointer dereference that is due to a missing
      page table lock, which in turn is due to the pmd entry in question being
      a transparent huge-table entry.
      
      The code - introduced in commit 1998cc04 ("mm: make
      madvise(MADV_WILLNEED) support swap file prefetch") - correctly checks
      for this situation using pmd_none_or_trans_huge_or_clear_bad(), but it
      turns out that that function doesn't work correctly.
      
      pmd_none_or_trans_huge_or_clear_bad() expected that pmd_bad() would
      trigger if the transparent hugepage bit was set, but it doesn't do that
      if pmd_numa() is also set. Note that the NUMA bit only gets set on real
      NUMA machines, so people trying to reproduce this on most normal
      development systems would never actually trigger this.
      
      Fix it by removing the very subtle (and subtly incorrect) expectation,
      and instead just checking pmd_trans_huge() explicitly.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      [ Additionally remove the now stale test for pmd_trans_huge() inside the
        pmd_bad() case - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee53664b
  4. 19 12月, 2013 1 次提交
    • R
      mm: fix TLB flush race between migration, and change_protection_range · 20841405
      Rik van Riel 提交于
      There are a few subtle races, between change_protection_range (used by
      mprotect and change_prot_numa) on one side, and NUMA page migration and
      compaction on the other side.
      
      The basic race is that there is a time window between when the PTE gets
      made non-present (PROT_NONE or NUMA), and the TLB is flushed.
      
      During that time, a CPU may continue writing to the page.
      
      This is fine most of the time, however compaction or the NUMA migration
      code may come in, and migrate the page away.
      
      When that happens, the CPU may continue writing, through the cached
      translation, to what is no longer the current memory location of the
      process.
      
      This only affects x86, which has a somewhat optimistic pte_accessible.
      All other architectures appear to be safe, and will either always flush,
      or flush whenever there is a valid mapping, even with no permissions
      (SPARC).
      
      The basic race looks like this:
      
      CPU A			CPU B			CPU C
      
      						load TLB entry
      make entry PTE/PMD_NUMA
      			fault on entry
      						read/write old page
      			start migrating page
      			change PTE/PMD to new page
      						read/write old page [*]
      flush TLB
      						reload TLB from new entry
      						read/write new page
      						lose data
      
      [*] the old page may belong to a new user at this point!
      
      The obvious fix is to flush remote TLB entries, by making sure that
      pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
      still be accessible if there is a TLB flush pending for the mm.
      
      This should fix both NUMA migration and compaction.
      
      [mgorman@suse.de: fix build]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20841405
  5. 29 8月, 2013 1 次提交
    • M
      s390/mm: implement software referenced bits · 0944fe3f
      Martin Schwidefsky 提交于
      The last remaining use for the storage key of the s390 architecture
      is reference counting. The alternative is to make page table entries
      invalid while they are old. On access the fault handler marks the
      pte/pmd as young which makes the pte/pmd valid if the access rights
      allow read access. The pte/pmd invalidations required for software
      managed reference bits cost a bit of performance, on the other hand
      the RRBE/RRBM instructions to read and reset the referenced bits are
      quite expensive as well.
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      0944fe3f
  6. 14 8月, 2013 2 次提交
  7. 04 7月, 2013 1 次提交
    • P
      mm: soft-dirty bits for user memory changes tracking · 0f8975ec
      Pavel Emelyanov 提交于
      The soft-dirty is a bit on a PTE which helps to track which pages a task
      writes to.  In order to do this tracking one should
      
        1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
        2. Wait some time.
        3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
      
      To do this tracking, the writable bit is cleared from PTEs when the
      soft-dirty bit is.  Thus, after this, when the task tries to modify a
      page at some virtual address the #PF occurs and the kernel sets the
      soft-dirty bit on the respective PTE.
      
      Note, that although all the task's address space is marked as r/o after
      the soft-dirty bits clear, the #PF-s that occur after that are processed
      fast.  This is so, since the pages are still mapped to physical memory,
      and thus all the kernel does is finds this fact out and puts back
      writable, dirty and soft-dirty bits on the PTE.
      
      Another thing to note, is that when mremap moves PTEs they are marked
      with soft-dirty as well, since from the user perspective mremap modifies
      the virtual memory at mremap's new address.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f8975ec
  8. 29 6月, 2013 1 次提交
  9. 20 6月, 2013 1 次提交
  10. 30 4月, 2013 1 次提交
  11. 25 4月, 2013 1 次提交
  12. 14 2月, 2013 1 次提交
    • M
      s390/mm: implement software dirty bits · abf09bed
      Martin Schwidefsky 提交于
      The s390 architecture is unique in respect to dirty page detection,
      it uses the change bit in the per-page storage key to track page
      modifications. All other architectures track dirty bits by means
      of page table entries. This property of s390 has caused numerous
      problems in the past, e.g. see git commit ef5d437f
      "mm: fix XFS oops due to dirty pages without buffers on s390".
      
      To avoid future issues in regard to per-page dirty bits convert
      s390 to a fault based software dirty bit detection mechanism. All
      user page table entries which are marked as clean will be hardware
      read-only, even if the pte is supposed to be writable. A write by
      the user process will trigger a protection fault which will cause
      the user pte to be marked as dirty and the hardware read-only bit
      is removed.
      
      With this change the dirty bit in the storage key is irrelevant
      for Linux as a host, but the storage key is still required for
      KVM guests. The effect is that page_test_and_clear_dirty and the
      related code can be removed. The referenced bit in the storage
      key is still used by the page_test_and_clear_young primitive to
      provide page age information.
      
      For page cache pages of mappings with mapping_cap_account_dirty
      there will not be any change in behavior as the dirty bit tracking
      already uses read-only ptes to control the amount of dirty pages.
      Only for swap cache pages and pages of mappings without
      mapping_cap_account_dirty there can be additional protection faults.
      To avoid an excessive number of additional faults the mk_pte
      primitive checks for PageDirty if the pgprot value allows for writes
      and pre-dirties the pte. That avoids all additional faults for
      tmpfs and shmem pages until these pages are added to the swap cache.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      abf09bed
  13. 19 1月, 2013 1 次提交
  14. 13 12月, 2012 1 次提交
  15. 11 12月, 2012 2 次提交
  16. 09 10月, 2012 5 次提交
    • C
      mm: thp: fix the pmd_clear() arguments in pmdp_get_and_clear() · 2d28a227
      Catalin Marinas 提交于
      The CONFIG_TRANSPARENT_HUGEPAGE implementation of pmdp_get_and_clear()
      calls pmd_clear() with 3 arguments instead of 1.
      
      This happens only for !__HAVE_ARCH_PMDP_GET_AND_CLEAR which doesn't seem
      to happen because x86 defines this and it uses pmd_update.
      
      [mhocko@suse.cz: changelog addition]
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NSteve Capper <steve.capper@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d28a227
    • G
      thp: introduce pmdp_invalidate() · 46dcde73
      Gerald Schaefer 提交于
      On s390, a valid page table entry must not be changed while it is attached
      to any CPU.  So instead of pmd_mknotpresent() and set_pmd_at(), an IDTE
      operation would be necessary there.  This patch introduces the
      pmdp_invalidate() function, to allow architecture-specific
      implementations.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46dcde73
    • G
      thp: remove assumptions on pgtable_t type · e3ebcf64
      Gerald Schaefer 提交于
      The thp page table pre-allocation code currently assumes that pgtable_t is
      of type "struct page *".  This may not be true for all architectures, so
      this patch removes that assumption by replacing the functions
      prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
      can be defined architecture-specific.
      
      It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
      operating on a pgtable_t.  Apart from the VM_BUG_ON removal, there will be
      no functional change introduced by this patch.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3ebcf64
    • K
      mm, x86, pat: rework linear pfn-mmap tracking · b3b9c293
      Konstantin Khlebnikov 提交于
      Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.
      
      We can toss mapping address from remap_pfn_range() into
      track_pfn_vma_new(), and collect all PAT-related logic together in
      arch/x86/.
      
      This patch also restores orignal frustration-free is_cow_mapping() check
      in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73a
      ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")
      
      is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
      because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.
      
      [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3b9c293
    • S
      x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn · 5180da41
      Suresh Siddha 提交于
      With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
      attribute and uses it.  Expectation is that the driver reserves the
      memory attributes for the pfn before calling vm_insert_pfn().
      
      remap_pfn_range() (when called for the whole vma) will setup a new
      attribute (based on the prot argument) for the specified pfn range.
      This addresses the legacy usage which typically calls remap_pfn_range()
      with a desired memory attribute.  For ranges smaller than the vma size
      (which is typically not the case), remap_pfn_range() will use the
      existing memory attribute for the pfn range.
      
      Expose two different API's for these different behaviors.
      track_pfn_insert() for tracking the pfn attribute set by vm_insert_pfn()
      and track_pfn_remap() for the remap_pfn_range().
      
      This cleanup also prepares the ground for the track/untrack pfn vma
      routines to take over the ownership of setting PAT specific vm_flag in
      the 'vma'.
      
      [khlebnikov@openvz.org: Clear checks in track_pfn_remap()]
      [akpm@linux-foundation.org: tweak a few comments]
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5180da41
  17. 21 6月, 2012 1 次提交
    • A
      thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE · e4eed03f
      Andrea Arcangeli 提交于
      In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the
      mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under
      Xen.
      
      So instead of dealing only with "consistent" pmdvals in
      pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually
      simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals
      where the low 32bit and high 32bit could be inconsistent (to avoid having
      to use cmpxchg8b).
      
      The only guarantee we get from pmd_read_atomic is that if the low part of
      the pmd was found null, the high part will be null too (so the pmd will be
      considered unstable).  And if the low part of the pmd is found "stable"
      later, then it means the whole pmd was read atomically (because after a
      pmd is stable, neither MADV_DONTNEED nor page faults can alter it anymore,
      and we read the high part after the low part).
      
      In the 32bit PAE x86 case, it is enough to read the low part of the pmdval
      atomically to declare the pmd as "stable" and that's true for THP and no
      THP, furthermore in the THP case we also have a barrier() that will
      prevent any inconsistent pmdvals to be cached by a later re-read of the
      *pmd.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Jonathan Nieder <jrnieder@gmail.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Tested-by: NAndrew Jones <drjones@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4eed03f
  18. 30 5月, 2012 1 次提交
    • A
      mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition · 26c19178
      Andrea Arcangeli 提交于
      When holding the mmap_sem for reading, pmd_offset_map_lock should only
      run on a pmd_t that has been read atomically from the pmdp pointer,
      otherwise we may read only half of it leading to this crash.
      
      PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
       #0 [f06a9dd8] crash_kexec at c049b5ec
       #1 [f06a9e2c] oops_end at c083d1c2
       #2 [f06a9e40] no_context at c0433ded
       #3 [f06a9e64] bad_area_nosemaphore at c043401a
       #4 [f06a9e6c] __do_page_fault at c0434493
       #5 [f06a9eec] do_page_fault at c083eb45
       #6 [f06a9f04] error_code (via page_fault) at c083c5d5
          EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
          00000000
          DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
          CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
       #7 [f06a9f38] _spin_lock at c083bc14
       #8 [f06a9f44] sys_mincore at c0507b7d
       #9 [f06a9fb0] system_call at c083becd
                               start           len
          EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
          DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
          SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
          CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286
      
      This should be a longstanding bug affecting x86 32bit PAE without THP.
      Only archs with 64bit large pmd_t and 32bit unsigned long should be
      affected.
      
      With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
      would partly hide the bug when the pmd transition from none to stable,
      by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
      enabled a new set of problem arises by the fact could then transition
      freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
      So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
      unconditional isn't good idea and it would be a flakey solution.
      
      This should be fully fixed by introducing a pmd_read_atomic that reads
      the pmd in order with THP disabled, or by reading the pmd atomically
      with cmpxchg8b with THP enabled.
      
      Luckily this new race condition only triggers in the places that must
      already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
      is localized there but this bug is not related to THP.
      
      NOTE: this can trigger on x86 32bit systems with PAE enabled with more
      than 4G of ram, otherwise the high part of the pmd will never risk to be
      truncated because it would be zero at all times, in turn so hiding the
      SMP race.
      
      This bug was discovered and fully debugged by Ulrich, quote:
      
      ----
      [..]
      pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
      eax.
      
          496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
          *pmd)
          497 {
          498         /* depend on compiler for an atomic pmd read */
          499         pmd_t pmdval = *pmd;
      
                                      // edi = pmd pointer
      0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
      ...
                                      // edx = PTE page table high address
      0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
      ...
                                      // eax = PTE page table low address
      0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax
      
      [..]
      
      Please note that the PMD is not read atomically. These are two "mov"
      instructions where the high order bits of the PMD entry are fetched
      first. Hence, the above machine code is prone to the following race.
      
      -  The PMD entry {high|low} is 0x0000000000000000.
         The "mov" at 0xc0507a84 loads 0x00000000 into edx.
      
      -  A page fault (on another CPU) sneaks in between the two "mov"
         instructions and instantiates the PMD.
      
      -  The PMD entry {high|low} is now 0x00000003fda38067.
         The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
      ----
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26c19178
  19. 26 5月, 2012 1 次提交
    • C
      arch/tile: allow building Linux with transparent huge pages enabled · 73636b1a
      Chris Metcalf 提交于
      The change adds some infrastructure for managing tile pmd's more generally,
      using pte_pmd() and pmd_pte() methods to translate pmd values to and
      from ptes, since on TILEPro a pmd is really just a nested structure
      holding a pgd (aka pte).  Several existing pmd methods are moved into
      this framework, and a whole raft of additional pmd accessors are defined
      that are used by the transparent hugepage framework.
      
      The tile PTE now has a "client2" bit.  The bit is used to indicate a
      transparent huge page is in the process of being split into subpages.
      
      This change also fixes a generic bug where the return value of the
      generic pmdp_splitting_flush() was incorrect.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      73636b1a
  20. 22 3月, 2012 1 次提交
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  21. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  22. 16 6月, 2011 1 次提交
  23. 23 5月, 2011 1 次提交
    • M
      [S390] merge page_test_dirty and page_clear_dirty · 2d42552d
      Martin Schwidefsky 提交于
      The page_clear_dirty primitive always sets the default storage key
      which resets the access control bits and the fetch protection bit.
      That will surprise a KVM guest that sets non-zero access control
      bits or the fetch protection bit. Merge page_test_dirty and
      page_clear_dirty back to a single function and only clear the
      dirty bit from the storage key.
      
      In addition move the function page_test_and_clear_dirty and
      page_test_and_clear_young to page.h where they belong. This
      requires to change the parameter from a struct page * to a page
      frame number.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      2d42552d
  24. 01 3月, 2011 1 次提交
    • B
      mm: <asm-generic/pgtable.h> must include <linux/mm_types.h> · fbd71844
      Ben Hutchings 提交于
      Commit e2cda322 ("thp: add pmd mangling generic functions") replaced
      some macros in <asm-generic/pgtable.h> with inline functions.
      
      If the functions are to be defined (not all architectures need them)
      then struct vm_area_struct must be defined first.  So include
      <linux/mm_types.h>.
      
      Fixes a build failure seen in Debian:
      
          CC [M]  drivers/media/dvb/mantis/mantis_pci.o
        In file included from arch/arm/include/asm/pgtable.h:460,
                         from drivers/media/dvb/mantis/mantis_pci.c:25:
        include/asm-generic/pgtable.h: In function 'ptep_test_and_clear_young':
        include/asm-generic/pgtable.h:29: error: dereferencing pointer to incomplete type
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbd71844
  25. 17 1月, 2011 1 次提交
    • A
      fix non-x86 build failure in pmdp_get_and_clear · b3697c02
      Andrea Arcangeli 提交于
      pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
      BUG() and they were defined only to diminish the risk of build issues on
      not-x86 archs and to be consistent with the generic pte methods previously
      defined in include/asm-generic/pgtable.h.
      
      But they are causing more trouble than they were supposed to solve, so
      it's simpler not to define them when THP is off.
      
      This is also correcting the export of pmdp_splitting_flush which is
      currently unused (x86 isn't using the generic implementation in
      mm/pgtable-generic.c and no other arch needs that [yet]).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Sam Ravnborg <sam@ravnborg.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3697c02
  26. 14 1月, 2011 2 次提交
  27. 25 10月, 2010 1 次提交
  28. 24 8月, 2010 1 次提交
    • S
      x86, mm: Avoid unnecessary TLB flush · 61c77326
      Shaohua Li 提交于
      In x86, access and dirty bits are set automatically by CPU when CPU accesses
      memory. When we go into the code path of below flush_tlb_fix_spurious_fault(),
      we already set dirty bit for pte and don't need flush tlb. This might mean
      tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
      the CPUs do page write, they will automatically check the bit and no software
      involved.
      
      On the other hand, flush tlb in below position is harmful. Test creates CPU
      number of threads, each thread writes to a same but random address in same vma
      range and we measure the total time. Under a 4 socket system, original time is
      1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
      20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
      tlb flush.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <20100816011655.GA362@sli10-desk.sh.intel.com>
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Andrea Archangeli <aarcange@redhat.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      61c77326
  29. 23 6月, 2009 1 次提交
    • P
      asm-generic: add dummy pgprot_noncached() · 0634a632
      Paul Mundt 提交于
      Most architectures now provide a pgprot_noncached(), the
      remaining ones can simply use an dummy default implementation,
      except for cris and xtensa, which should override the
      default appropriately.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Magnus Damm <magnus.damm@gmail.com>
      0634a632
  30. 30 3月, 2009 2 次提交
  31. 14 1月, 2009 1 次提交
  32. 20 12月, 2008 1 次提交