1. 30 8月, 2014 1 次提交
    • H
      x86,mm: fix pte_special versus pte_numa · b38af472
      Hugh Dickins 提交于
      Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at
      mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
      where zap_pte_range() checks page->mapping to see if PageAnon(page).
      
      Those addresses fit struct pages for pfns d2001 and d2000, and in each
      dump a register or a stack slot showed d2001730 or d2000730: pte flags
      0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
      a hole between cfffffff and 100000000, which would need special access.
      
      Commit c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on
      the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
      pte no longer passes the pte_special() test, so zap_pte_range() goes on
      to try to access a non-existent struct page.
      
      Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to
      complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).  A
      hint that this was a problem was that c46a7c81 added pte_numa() test
      to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast
      path: This was papering over a pte_special() snag when the zero page was
      encountered during zap.  This patch reverts vm_normal_page() to how it
      was before, relying on pte_special().
      
      It still appears that this patch may be incomplete: aren't there other
      places which need to be handling PROTNONE along with PRESENT?  For
      example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a
      PROT_NONE area, that would make it pte_special().  This is side-stepped
      by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there
      are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be
      interesting.
      
      Fixes: c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: <stable@vger.kernel.org>	[3.16]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b38af472
  2. 05 6月, 2014 2 次提交
    • C
      mm: x86 pgtable: require X86_64 for soft-dirty tracker · 2bf01f9f
      Cyrill Gorcunov 提交于
      Tracking dirty status on 2 level pages requires very ugly macros and
      taking into account how old the machines who can operate without PAE
      mode only are, lets drop soft dirty tracker from them for code
      simplicity (note I can't drop all the macros from 2 level pages by now
      since _PAGE_BIT_PROTNONE and _PAGE_BIT_FILE are still used even without
      tracker).
      
      Linus proposed to completely rip off softdirty support on x86-32 (even
      with PAE) and since for CRIU we're not planning to support native x86-32
      mode, lets do that.
      
      (Softdirty tracker is relatively new feature which is mostly used by
      CRIU so I don't expect if such API change would cause problems for
      userspace).
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bf01f9f
    • M
      x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81
      Mel Gorman 提交于
      _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
      faults on x86.  Care is taken such that _PAGE_NUMA is used only in
      situations where the VMA flags distinguish between NUMA hinting faults
      and prot_none faults.  This decision was x86-specific and conceptually
      it is difficult requiring special casing to distinguish between PROTNONE
      and NUMA ptes based on context.
      
      Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
      between an entry that is really unmapped and a page that is protected
      for NUMA hinting faults as if the PTE is not present then a fault will
      be trapped.
      
      Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
      This patch shrinks the maximum possible swap size and uses the bit to
      uniquely distinguish between NUMA hinting ptes and swap ptes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c46a7c81
  3. 25 3月, 2014 1 次提交
    • D
      Revert "xen: properly account for _PAGE_NUMA during xen pte translations" · 5926f87f
      David Vrabel 提交于
      This reverts commit a9c8e4be.
      
      PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT
      is set and pseudo-physical addresses is _PAGE_PRESENT is clear.
      
      This is because during a domain save/restore (migration) the page
      table entries are "canonicalised" and uncanonicalised". i.e., MFNs are
      converted to PFNs during domain save so that on a restore the page
      table entries may be rewritten with the new MFNs on the destination.
      This canonicalisation is only done for PTEs that are present.
      
      This change resulted in writing PTEs with MFNs if _PAGE_PROTNONE (or
      _PAGE_NUMA) was set but _PAGE_PRESENT was clear.  These PTEs would be
      migrated as-is which would result in unexpected behaviour in the
      destination domain.  Either a) the MFN would be translated to the
      wrong PFN/page; b) setting the _PAGE_PRESENT bit would clear the PTE
      because the MFN is no longer owned by the domain; or c) the present
      bit would not get set.
      
      Symptoms include "Bad page" reports when munmapping after migrating a
      domain.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <stable@vger.kernel.org>        [3.12+]
      5926f87f
  4. 05 3月, 2014 1 次提交
  5. 11 2月, 2014 1 次提交
    • M
      xen: properly account for _PAGE_NUMA during xen pte translations · a9c8e4be
      Mel Gorman 提交于
      Steven Noonan forwarded a users report where they had a problem starting
      vsftpd on a Xen paravirtualized guest, with this in dmesg:
      
        BUG: Bad page map in process vsftpd  pte:8000000493b88165 pmd:e9cc01067
        page:ffffea00124ee200 count:0 mapcount:-1 mapping:     (null) index:0x0
        page flags: 0x2ffc0000000014(referenced|dirty)
        addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping:          (null) index:7f97eea74
        CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
        Call Trace:
          dump_stack+0x45/0x56
          print_bad_pte+0x22e/0x250
          unmap_single_vma+0x583/0x890
          unmap_vmas+0x65/0x90
          exit_mmap+0xc5/0x170
          mmput+0x65/0x100
          do_exit+0x393/0x9e0
          do_group_exit+0xcc/0x140
          SyS_exit_group+0x14/0x20
          system_call_fastpath+0x1a/0x1f
        Disabling lock debugging due to kernel taint
        BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
        BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1
      
      The issue could not be reproduced under an HVM instance with the same
      kernel, so it appears to be exclusive to paravirtual Xen guests.  He
      bisected the problem to commit 1667918b ("mm: numa: clear numa
      hinting information on mprotect") that was also included in 3.12-stable.
      
      The problem was related to how xen translates ptes because it was not
      accounting for the _PAGE_NUMA bit.  This patch splits pte_present to add
      a pteval_present helper for use by xen so both bare metal and xen use
      the same code when checking if a PTE is present.
      
      [mgorman@suse.de: wrote changelog, proposed minor modifications]
      [akpm@linux-foundation.org: fix typo in comment]
      Reported-by: NSteven Noonan <steven@uplinklabs.net>
      Tested-by: NSteven Noonan <steven@uplinklabs.net>
      Signed-off-by: NElena Ufimtseva <ufimtseva@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c8e4be
  6. 19 12月, 2013 1 次提交
    • R
      mm: fix TLB flush race between migration, and change_protection_range · 20841405
      Rik van Riel 提交于
      There are a few subtle races, between change_protection_range (used by
      mprotect and change_prot_numa) on one side, and NUMA page migration and
      compaction on the other side.
      
      The basic race is that there is a time window between when the PTE gets
      made non-present (PROT_NONE or NUMA), and the TLB is flushed.
      
      During that time, a CPU may continue writing to the page.
      
      This is fine most of the time, however compaction or the NUMA migration
      code may come in, and migrate the page away.
      
      When that happens, the CPU may continue writing, through the cached
      translation, to what is no longer the current memory location of the
      process.
      
      This only affects x86, which has a somewhat optimistic pte_accessible.
      All other architectures appear to be safe, and will either always flush,
      or flush whenever there is a valid mapping, even with no permissions
      (SPARC).
      
      The basic race looks like this:
      
      CPU A			CPU B			CPU C
      
      						load TLB entry
      make entry PTE/PMD_NUMA
      			fault on entry
      						read/write old page
      			start migrating page
      			change PTE/PMD to new page
      						read/write old page [*]
      flush TLB
      						reload TLB from new entry
      						read/write new page
      						lose data
      
      [*] the old page may belong to a new user at this point!
      
      The obvious fix is to flush remote TLB entries, by making sure that
      pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
      still be accessible if there is a TLB flush pending for the mm.
      
      This should fix both NUMA migration and compaction.
      
      [mgorman@suse.de: fix build]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20841405
  7. 12 9月, 2013 1 次提交
  8. 14 8月, 2013 2 次提交
  9. 07 8月, 2013 1 次提交
  10. 04 7月, 2013 1 次提交
    • P
      mm: soft-dirty bits for user memory changes tracking · 0f8975ec
      Pavel Emelyanov 提交于
      The soft-dirty is a bit on a PTE which helps to track which pages a task
      writes to.  In order to do this tracking one should
      
        1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
        2. Wait some time.
        3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
      
      To do this tracking, the writable bit is cleared from PTEs when the
      soft-dirty bit is.  Thus, after this, when the task tries to modify a
      page at some virtual address the #PF occurs and the kernel sets the
      soft-dirty bit on the respective PTE.
      
      Note, that although all the task's address space is marked as r/o after
      the soft-dirty bits clear, the #PF-s that occur after that are processed
      fast.  This is so, since the pages are still mapped to physical memory,
      and thus all the kernel does is finds this fact out and puts back
      writable, dirty and soft-dirty bits on the PTE.
      
      Another thing to note, is that when mremap moves PTEs they are marked
      with soft-dirty as well, since from the user perspective mremap modifies
      the virtual memory at mremap's new address.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f8975ec
  11. 29 6月, 2013 1 次提交
  12. 13 2月, 2013 1 次提交
    • M
      x86/mm: Check if PUD is large when validating a kernel address · 0ee364eb
      Mel Gorman 提交于
      A user reported the following oops when a backup process reads
      /proc/kcore:
      
       BUG: unable to handle kernel paging request at ffffbb00ff33b000
       IP: [<ffffffff8103157e>] kern_addr_valid+0xbe/0x110
       [...]
      
       Call Trace:
        [<ffffffff811b8aaa>] read_kcore+0x17a/0x370
        [<ffffffff811ad847>] proc_reg_read+0x77/0xc0
        [<ffffffff81151687>] vfs_read+0xc7/0x130
        [<ffffffff811517f3>] sys_read+0x53/0xa0
        [<ffffffff81449692>] system_call_fastpath+0x16/0x1b
      
      Investigation determined that the bug triggered when reading
      system RAM at the 4G mark. On this system, that was the first
      address using 1G pages for the virt->phys direct mapping so the
      PUD is pointing to a physical address, not a PMD page.
      
      The problem is that the page table walker in kern_addr_valid() is
      not checking pud_large() and treats the physical address as if
      it was a PMD.  If it happens to look like pmd_none then it'll
      silently fail, probably returning zeros instead of real data. If
      the data happens to look like a present PMD though, it will be
      walked resulting in the oops above.
      
      This patch adds the necessary pud_large() check.
      
      Unfortunately the problem was not readily reproducible and now
      they are running the backup program without accessing
      /proc/kcore so the patch has not been validated but I think it
      makes sense.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.coM>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: stable@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20130211145236.GX21389@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ee364eb
  13. 26 1月, 2013 1 次提交
  14. 24 1月, 2013 1 次提交
  15. 11 12月, 2012 2 次提交
  16. 18 11月, 2012 3 次提交
  17. 09 10月, 2012 1 次提交
    • A
      mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP · 027ef6c8
      Andrea Arcangeli 提交于
      In many places !pmd_present has been converted to pmd_none.  For pmds
      that's equivalent and pmd_none is quicker so using pmd_none is better.
      
      However (unless we delete pmd_present) we should provide an accurate
      pmd_present too.  This will avoid the risk of code thinking the pmd is non
      present because it's under __split_huge_page_map, see the pmd_mknotpresent
      there and the comment above it.
      
      If the page has been mprotected as PROT_NONE, it would also lead to a
      pmd_present false negative in the same way as the race with
      split_huge_page.
      
      Because the PSE bit stays on at all times (both during split_huge_page and
      when the _PAGE_PROTNONE bit get set), we could only check for the PSE bit,
      but checking the PROTNONE bit too is still good to remember pmd_present
      must always keep PROT_NONE into account.
      
      This explains a not reproducible BUG_ON that was seldom reported on the
      lists.
      
      The same issue is in pmd_large, it would go wrong with both PROT_NONE and
      if it races with split_huge_page.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      027ef6c8
  18. 03 10月, 2012 1 次提交
  19. 18 12月, 2011 1 次提交
  20. 14 1月, 2011 6 次提交
  21. 20 10月, 2010 1 次提交
  22. 24 8月, 2010 1 次提交
    • S
      x86, mm: Avoid unnecessary TLB flush · 61c77326
      Shaohua Li 提交于
      In x86, access and dirty bits are set automatically by CPU when CPU accesses
      memory. When we go into the code path of below flush_tlb_fix_spurious_fault(),
      we already set dirty bit for pte and don't need flush tlb. This might mean
      tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
      the CPUs do page write, they will automatically check the bit and no software
      involved.
      
      On the other hand, flush tlb in below position is harmful. Test creates CPU
      number of threads, each thread writes to a same but random address in same vma
      range and we measure the total time. Under a 4 socket system, original time is
      1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
      20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
      tlb flush.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <20100816011655.GA362@sli10-desk.sh.intel.com>
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Andrea Archangeli <aarcange@redhat.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      61c77326
  23. 24 11月, 2009 2 次提交
  24. 31 8月, 2009 1 次提交
  25. 18 8月, 2009 1 次提交
    • S
      x86, pat: Allow ISA memory range uncacheable mapping requests · 1adcaafe
      Suresh Siddha 提交于
      Max Vozeler reported:
      >  Bug 13877 -  bogl-term broken with CONFIG_X86_PAT=y, works with =n
      >
      >  strace of bogl-term:
      >  814   mmap2(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0)
      >				 = -1 EAGAIN (Resource temporarily unavailable)
      >  814   write(2, "bogl: mmaping /dev/fb0: Resource temporarily unavailable\n",
      >	       57) = 57
      
      PAT code maps the ISA memory range as WB in the PAT attribute, so that
      fixed range MTRR registers define the actual memory type (UC/WC/WT etc).
      
      But the upper level is_new_memtype_allowed() API checks are failing,
      as the request here is for UC and the return tracked type is WB (Tracked type is
      WB as MTRR type for this legacy range potentially will be different for each
      4k page).
      
      Fix is_new_memtype_allowed() by always succeeding the ISA address range
      checks, as the null PAT (WB) and def MTRR fixed range register settings
      satisfy the memory type needs of the applications that map the ISA address
      range.
      Reported-and-Tested-by: NMax Vozeler <xam@debian.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      1adcaafe
  26. 29 6月, 2009 2 次提交
  27. 15 6月, 2009 1 次提交
  28. 13 6月, 2009 1 次提交