1. 26 9月, 2016 1 次提交
    • L
      mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · 38e08854
      Lorenzo Stoakes 提交于
      The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
      defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
      PMDs respectively as requiring balancing upon a subsequent page fault.
      User-defined PROT_NONE memory regions which also have this flag set will
      not normally invoke the NUMA balancing code as do_page_fault() will send
      a segfault to the process before handle_mm_fault() is even called.
      
      However if access_remote_vm() is invoked to access a PROT_NONE region of
      memory, handle_mm_fault() is called via faultin_page() and
      __get_user_pages() without any access checks being performed, meaning
      the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
      region.
      
      A simple means of triggering this problem is to access PROT_NONE mmap'd
      memory using /proc/self/mem which reliably results in the NUMA handling
      functions being invoked when CONFIG_NUMA_BALANCING is set.
      
      This issue was reported in bugzilla (issue 99101) which includes some
      simple repro code.
      
      There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
      added at commit c0e7cad9 to avoid accidentally provoking strange
      behaviour by attempting to apply NUMA balancing to pages that are in
      fact PROT_NONE.  The BUG_ON()'s are consistently triggered by the repro.
      
      This patch moves the PROT_NONE check into mm/memory.c rather than
      invoking BUG_ON() as faulting in these pages via faultin_page() is a
      valid reason for reaching the NUMA check with the PROT_NONE page table
      flag set and is therefore not always a bug.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101Reported-by: NTrevor Saunders <tbsaunde@tbsaunde.org>
      Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38e08854
  2. 10 9月, 2016 1 次提交
    • D
      mm: fix show_smap() for zone_device-pmd ranges · ca120cf6
      Dan Williams 提交于
      Attempting to dump /proc/<pid>/smaps for a process with pmd dax mappings
      currently results in the following VM_BUG_ONs:
      
       kernel BUG at mm/huge_memory.c:1105!
       task: ffff88045f16b140 task.stack: ffff88045be14000
       RIP: 0010:[<ffffffff81268f9b>]  [<ffffffff81268f9b>] follow_trans_huge_pmd+0x2cb/0x340
       [..]
       Call Trace:
        [<ffffffff81306030>] smaps_pte_range+0xa0/0x4b0
        [<ffffffff814c2755>] ? vsnprintf+0x255/0x4c0
        [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
        [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
        [<ffffffff81307656>] show_smap+0xa6/0x2b0
      
       kernel BUG at fs/proc/task_mmu.c:585!
       RIP: 0010:[<ffffffff81306469>]  [<ffffffff81306469>] smaps_pte_range+0x499/0x4b0
       Call Trace:
        [<ffffffff814c2795>] ? vsnprintf+0x255/0x4c0
        [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
        [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
        [<ffffffff81307696>] show_smap+0xa6/0x2b0
      
      These locations are sanity checking page flags that must be set for an
      anonymous transparent huge page, but are not set for the zone_device
      pages associated with dax mappings.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ca120cf6
  3. 27 8月, 2016 1 次提交
    • A
      soft_dirty: fix soft_dirty during THP split · 804dd150
      Andrea Arcangeli 提交于
      While adding proper userfaultfd_wp support with bits in pagetable and
      swap entry to avoid false positives WP userfaults through swap/fork/
      KSM/etc, I've been adding a framework that mostly mirrors soft dirty.
      
      So I noticed in one place I had to add uffd_wp support to the pagetables
      that wasn't covered by soft_dirty and I think it should have.
      
      Example: in the THP migration code migrate_misplaced_transhuge_page()
      pmd_mkdirty is called unconditionally after mk_huge_pmd.
      
      	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
      	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
      
      That sets soft dirty too (it's a false positive for soft dirty, the soft
      dirty bit could be more finegrained and transfer the bit like uffd_wp
      will do..  pmd/pte_uffd_wp() enforces the invariant that when it's set
      pmd/pte_write is not set).
      
      However in the THP split there's no unconditional pmd_mkdirty after
      mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
      entry is created.  The code sets the dirty bit in the struct page
      instead of setting it in the pagetable (which is fully equivalent as far
      as the real dirty bit is concerned, as the whole point of pagetable bits
      is to be eventually flushed out of to the page, but that is not
      equivalent for the soft-dirty bit that gets lost in translation).
      
      This was found by code review only and totally untested as I'm working
      to actually replace soft dirty and I don't have time to test potential
      soft dirty bugfixes as well :).
      
      Transfer the soft_dirty from pmd to pte during THP splits.
      
      This fix avoids losing the soft_dirty bit and avoids userland memory
      corruption in the checkpoint.
      
      Fixes: eef1b3ba ("thp: implement split_huge_pmd()")
      Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@virtuozzo.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      804dd150
  4. 29 7月, 2016 5 次提交
  5. 27 7月, 2016 22 次提交
  6. 15 7月, 2016 2 次提交
    • N
      mm: thp: move pmd check inside ptl for freeze_page() · 33f4751e
      Naoya Horiguchi 提交于
      I found a race condition triggering VM_BUG_ON() in freeze_page(), when
      running a testcase with 3 processes:
        - process 1: keep writing thp,
        - process 2: keep clearing soft-dirty bits from virtual address of process 1
        - process 3: call migratepages for process 1,
      
      The kernel message is like this:
      
        kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci virtio_ring virtio
        CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff880037320000 ti: ffff88007cdd0000 task.ti: ffff88007cdd0000
        RIP: 0010:[<ffffffff811f8e06>]  [<ffffffff811f8e06>] split_huge_page_to_list+0x496/0x590
        RSP: 0018:ffff88007cdd3b70  EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff88007c7b88c0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000700000200 RDI: ffffea0003188000
        RBP: ffff88007cdd3bb8 R08: 0000000000000001 R09: 00003ffffffff000
        R10: ffff880000000000 R11: ffffc000001fffff R12: ffffea0003188000
        R13: ffffea0003188000 R14: 0000000000000000 R15: 0400000000000080
        FS:  00007f8ec241d740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000             CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f8ec1f3ed20 CR3: 000000003707b000 CR4: 00000000000006f0
        Call Trace:
          ? list_del+0xd/0x30
          queue_pages_pte_range+0x4d1/0x590
          __walk_page_range+0x204/0x4e0
          walk_page_range+0x71/0xf0
          queue_pages_range+0x75/0x90
          ? queue_pages_hugetlb+0x190/0x190
          ? new_node_page+0xc0/0xc0
          ? change_prot_numa+0x40/0x40
          migrate_to_node+0x71/0xd0
          do_migrate_pages+0x1c3/0x210
          SyS_migrate_pages+0x261/0x290
          entry_SYSCALL_64_fastpath+0x1a/0xa4
        Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
        RIP   split_huge_page_to_list+0x496/0x590
      
      I'm not sure of the full scenario of the reproduction, but my debug
      showed that split_huge_pmd_address(freeze=true) returned without running
      main code of pmd splitting because pmd_present(*pmd) in precheck somehow
      returned 0.  If this happens, the subsequent try_to_unmap() fails and
      returns non-zero (because page_mapcount() still > 0), and finally
      VM_BUG_ON() fires.  This patch tries to fix it by prechecking pmd state
      inside ptl.
      
      Link: http://lkml.kernel.org/r/1466990929-7452-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33f4751e
    • H
      madvise_free, thp: fix madvise_free_huge_pmd return value after splitting · 9818b8cd
      Huang Ying 提交于
      madvise_free_huge_pmd should return 0 if the fallback PTE operations are
      required.  In madvise_free_huge_pmd, if part pages of THP are discarded,
      the THP will be split and fallback PTE operations should be used if
      splitting succeeds.  But the original code will make fallback PTE
      operations skipped, after splitting succeeds.  Fix that via make
      madvise_free_huge_pmd return 0 after splitting successfully, so that the
      fallback PTE operations will be done.
      
      Link: http://lkml.kernel.org/r/1467135452-16688-1-git-send-email-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9818b8cd
  7. 21 5月, 2016 4 次提交
  8. 20 5月, 2016 2 次提交
    • H
      huge mm: move_huge_pmd does not need new_vma · bf8616d5
      Hugh Dickins 提交于
      Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
      a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
      the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as
      it was in the old vma, alignment and size permitting.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf8616d5
    • J
      mm: rename _count, field of the struct page, to _refcount · 0139aa7b
      Joonsoo Kim 提交于
      Many developers already know that field for reference count of the
      struct page is _count and atomic type.  They would try to handle it
      directly and this could break the purpose of page reference count
      tracepoint.  To prevent direct _count modification, this patch rename it
      to _refcount and add warning message on the code.  After that, developer
      who need to handle reference count will find that field should not be
      accessed directly.
      
      [akpm@linux-foundation.org: fix comments, per Vlastimil]
      [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
      [sfr@canb.auug.org.au: sync ethernet driver changes]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Manish Chopra <manish.chopra@qlogic.com>
      Cc: Yuval Mintz <yuval.mintz@qlogic.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0139aa7b
  9. 13 5月, 2016 1 次提交
    • A
      mm: thp: calculate the mapcount correctly for THP pages during WP faults · 6d0a07ed
      Andrea Arcangeli 提交于
      This will provide fully accuracy to the mapcount calculation in the
      write protect faults, so page pinning will not get broken by false
      positive copy-on-writes.
      
      total_mapcount() isn't the right calculation needed in
      reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
      that is effectively the full accurate return value for page_mapcount()
      if dealing with Transparent Hugepages, however we only use the
      page_trans_huge_mapcount() during COW faults where it strictly needed,
      due to its higher runtime cost.
      
      This also provide at practical zero cost the total_mapcount
      information which is needed to know if we can still relocate the page
      anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
      can reuse the page no matter if it's a pte or a pmd_trans_huge
      triggering the fault, but we can only relocate the page anon_vma to
      the local vma->anon_vma if we're sure it's only this "vma" mapping the
      whole THP physical range.
      
      Kirill A. Shutemov discovered the problem with moving the page
      anon_vma to the local vma->anon_vma in a previous version of this
      patch and another problem in the way page_move_anon_rmap() was called.
      
      Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
      previous version, because reuse_swap_page must be a macro to call
      page_trans_huge_mapcount from swap.h, so this uses a macro again
      instead of an inline function. With this change at least it's a less
      dangerous usage than it was before, because "page" is used only once
      now, while with the previous code reuse_swap_page(page++) would have
      called page_mapcount on page+1 and it would have increased page twice
      instead of just once.
      
      Dean Luick noticed an uninitialized variable that could result in a
      rmap inefficiency for the non-THP case in a previous version.
      
      Mike Marciniszyn said:
      
      : Our RDMA tests are seeing an issue with memory locking that bisects to
      : commit 61f5d698 ("mm: re-enable THP")
      :
      : The test program registers two rather large MRs (512M) and RDMA
      : writes data to a passive peer using the first and RDMA reads it back
      : into the second MR and compares that data.  The sizes are chosen randomly
      : between 0 and 1024 bytes.
      :
      : The test will get through a few (<= 4 iterations) and then gets a
      : compare error.
      :
      : Tracing indicates the kernel logical addresses associated with the individual
      : pages at registration ARE correct , the data in the "RDMA read response only"
      : packets ARE correct.
      :
      : The "corruption" occurs when the packet crosse two pages that are not physically
      : contiguous.   The second page reads back as zero in the program.
      :
      : It looks like the user VA at the point of the compare error no longer points to
      : the same physical address as was registered.
      :
      : This patch totally resolves the issue!
      
      Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Reviewed-by: NDean Luick <dean.luick@intel.com>
      Tested-by: NAlex Williamson <alex.williamson@redhat.com>
      Tested-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Tested-by: NJosh Collier <josh.d.collier@intel.com>
      Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d0a07ed
  10. 06 5月, 2016 1 次提交