1. 29 7月, 2016 3 次提交
  2. 27 7月, 2016 22 次提交
  3. 15 7月, 2016 2 次提交
    • N
      mm: thp: move pmd check inside ptl for freeze_page() · 33f4751e
      Naoya Horiguchi 提交于
      I found a race condition triggering VM_BUG_ON() in freeze_page(), when
      running a testcase with 3 processes:
        - process 1: keep writing thp,
        - process 2: keep clearing soft-dirty bits from virtual address of process 1
        - process 3: call migratepages for process 1,
      
      The kernel message is like this:
      
        kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci virtio_ring virtio
        CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff880037320000 ti: ffff88007cdd0000 task.ti: ffff88007cdd0000
        RIP: 0010:[<ffffffff811f8e06>]  [<ffffffff811f8e06>] split_huge_page_to_list+0x496/0x590
        RSP: 0018:ffff88007cdd3b70  EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff88007c7b88c0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000700000200 RDI: ffffea0003188000
        RBP: ffff88007cdd3bb8 R08: 0000000000000001 R09: 00003ffffffff000
        R10: ffff880000000000 R11: ffffc000001fffff R12: ffffea0003188000
        R13: ffffea0003188000 R14: 0000000000000000 R15: 0400000000000080
        FS:  00007f8ec241d740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000             CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f8ec1f3ed20 CR3: 000000003707b000 CR4: 00000000000006f0
        Call Trace:
          ? list_del+0xd/0x30
          queue_pages_pte_range+0x4d1/0x590
          __walk_page_range+0x204/0x4e0
          walk_page_range+0x71/0xf0
          queue_pages_range+0x75/0x90
          ? queue_pages_hugetlb+0x190/0x190
          ? new_node_page+0xc0/0xc0
          ? change_prot_numa+0x40/0x40
          migrate_to_node+0x71/0xd0
          do_migrate_pages+0x1c3/0x210
          SyS_migrate_pages+0x261/0x290
          entry_SYSCALL_64_fastpath+0x1a/0xa4
        Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
        RIP   split_huge_page_to_list+0x496/0x590
      
      I'm not sure of the full scenario of the reproduction, but my debug
      showed that split_huge_pmd_address(freeze=true) returned without running
      main code of pmd splitting because pmd_present(*pmd) in precheck somehow
      returned 0.  If this happens, the subsequent try_to_unmap() fails and
      returns non-zero (because page_mapcount() still > 0), and finally
      VM_BUG_ON() fires.  This patch tries to fix it by prechecking pmd state
      inside ptl.
      
      Link: http://lkml.kernel.org/r/1466990929-7452-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33f4751e
    • H
      madvise_free, thp: fix madvise_free_huge_pmd return value after splitting · 9818b8cd
      Huang Ying 提交于
      madvise_free_huge_pmd should return 0 if the fallback PTE operations are
      required.  In madvise_free_huge_pmd, if part pages of THP are discarded,
      the THP will be split and fallback PTE operations should be used if
      splitting succeeds.  But the original code will make fallback PTE
      operations skipped, after splitting succeeds.  Fix that via make
      madvise_free_huge_pmd return 0 after splitting successfully, so that the
      fallback PTE operations will be done.
      
      Link: http://lkml.kernel.org/r/1467135452-16688-1-git-send-email-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9818b8cd
  4. 21 5月, 2016 4 次提交
  5. 20 5月, 2016 2 次提交
    • H
      huge mm: move_huge_pmd does not need new_vma · bf8616d5
      Hugh Dickins 提交于
      Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
      a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
      the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as
      it was in the old vma, alignment and size permitting.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf8616d5
    • J
      mm: rename _count, field of the struct page, to _refcount · 0139aa7b
      Joonsoo Kim 提交于
      Many developers already know that field for reference count of the
      struct page is _count and atomic type.  They would try to handle it
      directly and this could break the purpose of page reference count
      tracepoint.  To prevent direct _count modification, this patch rename it
      to _refcount and add warning message on the code.  After that, developer
      who need to handle reference count will find that field should not be
      accessed directly.
      
      [akpm@linux-foundation.org: fix comments, per Vlastimil]
      [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
      [sfr@canb.auug.org.au: sync ethernet driver changes]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Manish Chopra <manish.chopra@qlogic.com>
      Cc: Yuval Mintz <yuval.mintz@qlogic.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0139aa7b
  6. 13 5月, 2016 1 次提交
    • A
      mm: thp: calculate the mapcount correctly for THP pages during WP faults · 6d0a07ed
      Andrea Arcangeli 提交于
      This will provide fully accuracy to the mapcount calculation in the
      write protect faults, so page pinning will not get broken by false
      positive copy-on-writes.
      
      total_mapcount() isn't the right calculation needed in
      reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
      that is effectively the full accurate return value for page_mapcount()
      if dealing with Transparent Hugepages, however we only use the
      page_trans_huge_mapcount() during COW faults where it strictly needed,
      due to its higher runtime cost.
      
      This also provide at practical zero cost the total_mapcount
      information which is needed to know if we can still relocate the page
      anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
      can reuse the page no matter if it's a pte or a pmd_trans_huge
      triggering the fault, but we can only relocate the page anon_vma to
      the local vma->anon_vma if we're sure it's only this "vma" mapping the
      whole THP physical range.
      
      Kirill A. Shutemov discovered the problem with moving the page
      anon_vma to the local vma->anon_vma in a previous version of this
      patch and another problem in the way page_move_anon_rmap() was called.
      
      Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
      previous version, because reuse_swap_page must be a macro to call
      page_trans_huge_mapcount from swap.h, so this uses a macro again
      instead of an inline function. With this change at least it's a less
      dangerous usage than it was before, because "page" is used only once
      now, while with the previous code reuse_swap_page(page++) would have
      called page_mapcount on page+1 and it would have increased page twice
      instead of just once.
      
      Dean Luick noticed an uninitialized variable that could result in a
      rmap inefficiency for the non-THP case in a previous version.
      
      Mike Marciniszyn said:
      
      : Our RDMA tests are seeing an issue with memory locking that bisects to
      : commit 61f5d698 ("mm: re-enable THP")
      :
      : The test program registers two rather large MRs (512M) and RDMA
      : writes data to a passive peer using the first and RDMA reads it back
      : into the second MR and compares that data.  The sizes are chosen randomly
      : between 0 and 1024 bytes.
      :
      : The test will get through a few (<= 4 iterations) and then gets a
      : compare error.
      :
      : Tracing indicates the kernel logical addresses associated with the individual
      : pages at registration ARE correct , the data in the "RDMA read response only"
      : packets ARE correct.
      :
      : The "corruption" occurs when the packet crosse two pages that are not physically
      : contiguous.   The second page reads back as zero in the program.
      :
      : It looks like the user VA at the point of the compare error no longer points to
      : the same physical address as was registered.
      :
      : This patch totally resolves the issue!
      
      Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Reviewed-by: NDean Luick <dean.luick@intel.com>
      Tested-by: NAlex Williamson <alex.williamson@redhat.com>
      Tested-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Tested-by: NJosh Collier <josh.d.collier@intel.com>
      Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d0a07ed
  7. 06 5月, 2016 1 次提交
  8. 29 4月, 2016 2 次提交
  9. 26 3月, 2016 1 次提交
  10. 18 3月, 2016 2 次提交
    • K
      thp: fix deadlock in split_huge_pmd() · 5f737714
      Kirill A. Shutemov 提交于
      split_huge_pmd() tries to munlock page with munlock_vma_page().  That
      requires the page to locked.
      
      If the is locked by caller, we would get a deadlock:
      
      	Unable to find swap-space signature
      	INFO: task trinity-c85:1907 blocked for more than 120 seconds.
      	      Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
      	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      	trinity-c85     D ffff88084d997608     0  1907    309 0x00000000
      	Call Trace:
      	  schedule+0x9f/0x1c0
      	  schedule_timeout+0x48e/0x600
      	  io_schedule_timeout+0x1c3/0x390
      	  bit_wait_io+0x29/0xd0
      	  __wait_on_bit_lock+0x94/0x140
      	  __lock_page+0x1d4/0x280
      	  __split_huge_pmd+0x5a8/0x10f0
      	  split_huge_pmd_address+0x1d9/0x230
      	  try_to_unmap_one+0x540/0xc70
      	  rmap_walk_anon+0x284/0x810
      	  rmap_walk_locked+0x11e/0x190
      	  try_to_unmap+0x1b1/0x4b0
      	  split_huge_page_to_list+0x49d/0x18a0
      	  follow_page_mask+0xa36/0xea0
      	  SyS_move_pages+0xaf3/0x1570
      	  entry_SYSCALL_64_fastpath+0x12/0x6b
      	2 locks held by trinity-c85/1907:
      	 #0:  (&mm->mmap_sem){++++++}, at:  SyS_move_pages+0x933/0x1570
      	 #1:  (&anon_vma->rwsem){++++..}, at:  split_huge_page_to_list+0x402/0x18a0
      
      I don't think the deadlock is triggerable without split_huge_page()
      simplifilcation patchset.
      
      But munlock_vma_page() here is wrong: we want to munlock the page
      unconditionally, no need in rmap lookup, that munlock_vma_page() does.
      
      Let's use clear_page_mlock() instead.  It can be called under ptl.
      
      Fixes: e90309c9 ("thp: allow mlocked THP again")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f737714
    • K
      thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers · fec89c10
      Kirill A. Shutemov 提交于
      freeze_page() and unfreeze_page() helpers evolved in rather complex
      beasts.  It would be nice to cut complexity of this code.
      
      This patch rewrites freeze_page() using standard try_to_unmap().
      unfreeze_page() is rewritten with remove_migration_ptes().
      
      The result is much simpler.
      
      But the new variant is somewhat slower for PTE-mapped THPs.  Current
      helpers iterates over VMAs the compound page is mapped to, and then over
      ptes within this VMA.  New helpers iterates over small page, then over
      VMA the small page mapped to, and only then find relevant pte.
      
      We have short cut for PMD-mapped THP: we directly install migration
      entries on PMD split.
      
      I don't think the slowdown is critical, considering how much simpler
      result is and that split_huge_page() is quite rare nowadays.  It only
      happens due memory pressure or migration.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fec89c10