1. 11 4月, 2020 2 次提交
  2. 08 4月, 2020 7 次提交
    • C
      mm: fix ambiguous comments for better code readability · 552657b7
      chenqiwu 提交于
      The parameter of remap_pfn_range() @pfn passed from the caller is actually
      a page-frame number converted by corresponding physical address of kernel
      memory, the original comment is ambiguous that may mislead the users.
      
      Meanwhile, there is an ambiguous typo "VMM" in the comment of
      vm_area_struct.  So fixing them will make the code more readable.
      Signed-off-by: Nchenqiwu <chenqiwu@xiaomi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      552657b7
    • P
      userfaultfd: wp: support swap and page migration · f45ec5ff
      Peter Xu 提交于
      For either swap and page migration, we all use the bit 2 of the entry to
      identify whether this entry is uffd write-protected.  It plays a similar
      role as the existing soft dirty bit in swap entries but only for keeping
      the uffd-wp tracking for a specific PTE/PMD.
      
      Something special here is that when we want to recover the uffd-wp bit
      from a swap/migration entry to the PTE bit we'll also need to take care of
      the _PAGE_RW bit and make sure it's cleared, otherwise even with the
      _PAGE_UFFD_WP bit we can't trap it at all.
      
      In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
      That can lead to data mismatch if the page that we are going to write
      protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
      also applies/removes the uffd-wp bit even for the swap entries.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f45ec5ff
    • P
      userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork · b569a176
      Peter Xu 提交于
      UFFD_EVENT_FORK support for uffd-wp should be already there, except that
      we should clean the uffd-wp bit if uffd fork event is not enabled.  Detect
      that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
      by VM_UFFD_WP.  Do this for both small PTEs and huge PMDs.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b569a176
    • P
      userfaultfd: wp: apply _PAGE_UFFD_WP bit · 292924b2
      Peter Xu 提交于
      Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
      change_protection() when used with uffd-wp and make sure the two new flags
      are exclusively used.  Then,
      
        - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
          when a range of memory is write protected by uffd
      
        - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
          _PAGE_RW when write protection is resolved from userspace
      
      And use this new interface in mwriteprotect_range() to replace the old
      MM_CP_DIRTY_ACCT.
      
      Do this change for both PTEs and huge PMDs.  Then we can start to identify
      which PTE/PMD is write protected by general (e.g., COW or soft dirty
      tracking), and which is for userfaultfd-wp.
      
      Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
      into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
      can be even more strict when detecting uffd-wp page faults in either
      do_wp_page() or wp_huge_pmd().
      
      After we're with _PAGE_UFFD_WP, a special case is when a page is both
      protected by the general COW logic and also userfault-wp.  Here the
      userfault-wp will have higher priority and will be handled first.  Only
      after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
      the general COW.  These are the steps on what will happen with such a
      page:
      
        1. CPU accesses write protected shared page (so both protected by
           general COW and uffd-wp), blocked by uffd-wp first because in
           do_wp_page we'll handle uffd-wp first, so it has higher priority
           than general COW.
      
        2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
           to remove the uffd-wp bit upon the PTE/PMD.  However here we
           still keep the write bit cleared.  Notify the blocked CPU.
      
        3. The blocked CPU resumes the page fault process with a fault
           retry, during retry it'll notice it was not with the uffd-wp bit
           this time but it is still write protected by general COW, then
           it'll go though the COW path in the fault handler, copy the page,
           apply write bit where necessary, and retry again.
      
        4. The CPU will be able to access this page with write bit set.
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      292924b2
    • A
      userfaultfd: wp: hook userfault handler to write protection fault · 529b930b
      Andrea Arcangeli 提交于
      There are several cases write protection fault happens.  It could be a
      write to zero page, swaped page or userfault write protected page.  When
      the fault happens, there is no way to know if userfault write protect the
      page before.  Here we just blindly issue a userfault notification for vma
      with VM_UFFD_WP regardless if app write protects it yet.  Application
      should be ready to handle such wp fault.
      
      In the swapin case, always swapin as readonly.  This will cause false
      positive userfaults.  We need to decide later if to eliminate them with a
      flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
      
      hugetlbfs wouldn't need to worry about swapouts but and tmpfs would be
      handled by a swap entry bit like anonymous memory.
      
      The main problem with no easy solution to eliminate the false positives,
      will be if/when userfaultfd is extended to real filesystem pagecache.
      When the pagecache is freed by reclaim we can't leave the radix tree
      pinned if the inode and in turn the radix tree is reclaimed as well.
      
      The estimation is that full accuracy and lack of false positives could be
      easily provided only to anonymous memory (as long as there's no fork or as
      long as MADV_DONTFORK is used on the userfaultfd anonymous range) tmpfs
      and hugetlbfs, it's most certainly worth to achieve it but in a later
      incremental patch.
      
      [peterx@redhat.com: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-3-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      529b930b
    • M
      mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE · 396bcc52
      Matthew Wilcox (Oracle) 提交于
      Commit e496cf3d ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
      notes that it should be reverted when the PowerPC problem was fixed.  The
      commit fixing the PowerPC problem (953c66c2) did not revert the
      commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
      CONFIG_TRANSPARENT_HUGEPAGE.  Checking with Kirill and Aneesh, this was an
      oversight, so remove the Kconfig symbol and undo the work of commit
      e496cf3d.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      396bcc52
    • A
      mm/vma: make vma_is_accessible() available for general use · 3122e80e
      Anshuman Khandual 提交于
      Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
      available for general use.  While here, this replaces all remaining open
      encodings for VMA access check with vma_is_accessible().
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NGuo Ren <guoren@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3122e80e
  3. 03 4月, 2020 2 次提交
  4. 25 3月, 2020 1 次提交
    • T
      mm: Split huge pages on write-notify or COW · 327e9fd4
      Thomas Hellstrom (VMware) 提交于
      The functions wp_huge_pmd() and wp_huge_pud() currently relies on the
      huge_fault() callback to split huge page table entries if needed.
      However for module users that requires export of the split_huge_xxx()
      functionality which may be undesired. Instead split pre-existing huge
      page-table entries on VM_FAULT_FALLBACK return.
      
      We currently only do COW and write-notify on the PTE level, so if the
      huge_fault() handler returns VM_FAULT_FALLBACK on wp faults,
      split the huge pages and page-table entries. Also do this for huge PUDs
      if there is no huge_fault() handler and the vma is not anonymous, similar
      to how it's done for PMDs.
      
      Note that fs/dax.c still does the splitting in the huge_fault() handler,
      but as huge_fault() A follow-up patch can remove the dax.c split_huge_pmd()
      if needed.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NThomas Hellstrom (VMware) <thomas_os@shipmail.org>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      327e9fd4
  5. 06 3月, 2020 1 次提交
  6. 16 1月, 2020 2 次提交
    • T
      mm, drm/ttm: Fix vm page protection handling · 5379e4dd
      Thomas Hellstrom 提交于
      TTM graphics buffer objects may, transparently to user-space,  move
      between IO and system memory. When that happens, all PTEs pointing to the
      old location are zapped before the move and then faulted in again if
      needed. When that happens, the page protection caching mode- and
      encryption bits may change and be different from those of
      struct vm_area_struct::vm_page_prot.
      
      We were using an ugly hack to set the page protection correctly.
      Fix that and instead export and use vmf_insert_mixed_prot() or use
      vmf_insert_pfn_prot().
      Also get the default page protection from
      struct vm_area_struct::vm_page_prot rather than using vm_get_page_prot().
      This way we catch modifications done by the vm system for drivers that
      want write-notification.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      5379e4dd
    • T
      mm: Add a vmf_insert_mixed_prot() function · 574c5b3d
      Thomas Hellstrom 提交于
      The TTM module today uses a hack to be able to set a different page
      protection than struct vm_area_struct::vm_page_prot. To be able to do
      this properly, add the needed vm functionality as vmf_insert_mixed_prot().
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      574c5b3d
  7. 18 12月, 2019 1 次提交
    • D
      mm/memory.c: add apply_to_existing_page_range() helper · be1db475
      Daniel Axtens 提交于
      apply_to_page_range() takes an address range, and if any parts of it are
      not covered by the existing page table hierarchy, it allocates memory to
      fill them in.
      
      In some use cases, this is not what we want - we want to be able to
      operate exclusively on PTEs that are already in the tables.
      
      Add apply_to_existing_page_range() for this.  Adjust the walker
      functions for apply_to_page_range to take 'create', which switches them
      between the old and new modes.
      
      This will be used in KASAN vmalloc.
      
      [akpm@linux-foundation.org: reduce code duplication]
      [akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/]
      [akpm@linux-foundation.org: initialize __apply_to_page_range::err]
      Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be1db475
  8. 08 12月, 2019 1 次提交
  9. 05 12月, 2019 2 次提交
    • M
      mm: remove __ARCH_HAS_4LEVEL_HACK and include/asm-generic/4level-fixup.h · f949286c
      Mike Rapoport 提交于
      There are no architectures that use include/asm-generic/4level-fixup.h
      therefore it can be removed along with __ARCH_HAS_4LEVEL_HACK define.
      
      Link: http://lkml.kernel.org/r/1572938135-31886-14-git-send-email-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Anatoly Pugachev <matorola@gmail.com>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Peter Rosin <peda@axentia.se>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Sam Creasey <sammy@sammy.net>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f949286c
    • Y
      mm/memory.c: replace is_zero_pfn with is_huge_zero_pmd for thp · 3cde287b
      Yu Zhao 提交于
      For hugely mapped thp, we use is_huge_zero_pmd() to check if it's zero
      page or not.
      
      We do fill ptes with my_zero_pfn() when we split zero thp pmd, but this
      is not what we have in vm_normal_page_pmd() -- pmd_trans_huge_lock()
      makes sure of it.
      
      This is a trivial fix for /proc/pid/numa_maps, and AFAIK nobody
      complains about it.
      
      Gerald Schaefer asked:
      : Maybe the description could also mention the symptom of this bug?
      : I would assume that it affects anon/dirty accounting in gather_pte_stats(),
      : for huge mappings, if zero page mappings are not correctly recognized.
      
      I came across this while I was looking at the code, so I'm not aware of
      any symptom.
      
      Link: http://lkml.kernel.org/r/20191108192629.201556-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cde287b
  10. 02 12月, 2019 1 次提交
  11. 01 12月, 2019 4 次提交
    • T
      mm/memory.c: fix a huge pud insertion race during faulting · 625110b5
      Thomas Hellstrom 提交于
      A huge pud page can theoretically be faulted in racing with pmd_alloc()
      in __handle_mm_fault().  That will lead to pmd_alloc() returning an
      invalid pmd pointer.
      
      Fix this by adding a pud_trans_unstable() function similar to
      pmd_trans_unstable() and check whether the pud is really stable before
      using the pmd pointer.
      
      Race:
        Thread 1:             Thread 2:                 Comment
        create_huge_pud()                               Fallback - not taken.
                              create_huge_pud()         Taken.
        pmd_alloc()                                     Returns an invalid pointer.
      
      This will result in user-visible huge page data corruption.
      
      Note that this was caught during a code audit rather than a real
      experienced problem.  It looks to me like the only implementation that
      currently creates huge pud pagetable entries is dev_dax_huge_fault()
      which doesn't appear to care much about private (COW) mappings or
      write-tracking which is, I believe, a prerequisite for create_huge_pud()
      falling back on thread 1, but not in thread 2.
      
      Link: http://lkml.kernel.org/r/20191115115808.21181-2-thomas_os@shipmail.org
      Fixes: a00cc7d9 ("mm, x86: add support for PUD-sized transparent hugepages")
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      625110b5
    • J
      rss_stat: add support to detect RSS updates of external mm · e4dcad20
      Joel Fernandes (Google) 提交于
      When a process updates the RSS of a different process, the rss_stat
      tracepoint appears in the context of the process doing the update.  This
      can confuse userspace that the RSS of process doing the update is
      updated, while in reality a different process's RSS was updated.
      
      This issue happens in reclaim paths such as with direct reclaim or
      background reclaim.
      
      This patch adds more information to the tracepoint about whether the mm
      being updated belongs to the current process's context (curr field).  We
      also include a hash of the mm pointer so that the process who the mm
      belongs to can be uniquely identified (mm_id field).
      
      Also vsprintf.c is refactored a bit to allow reuse of hashing code.
      
      [akpm@linux-foundation.org: remove unused local `str']
      [joelaf@google.com: inline call to ptr_to_hashval]
        Link: http://lore.kernel.org/r/20191113153816.14b95acd@gandalf.local.home
        Link: http://lkml.kernel.org/r/20191114164622.GC233237@google.com
      Link: http://lkml.kernel.org/r/20191106024452.81923-1-joel@joelfernandes.orgSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Reported-by: NIoannis Ilkos <ilkos@google.com>
      Acked-by: Petr Mladek <pmladek@suse.com>	[lib/vsprintf.c]
      Cc: Tim Murray <timmurray@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Carmen Jackson <carmenjackson@google.com>
      Cc: Mayank Gupta <mayankgupta@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4dcad20
    • J
      mm: emit tracepoint when RSS changes · b3d1411b
      Joel Fernandes (Google) 提交于
      Useful to track how RSS is changing per TGID to detect spikes in RSS and
      memory hogs.  Several Android teams have been using this patch in
      various kernel trees for half a year now.  Many reported to me it is
      really useful so I'm posting it upstream.
      
      Initial patch developed by Tim Murray.  Changes I made from original
      patch: o Prevent any additional space consumed by mm_struct.
      
      Regarding the fact that the RSS may change too often thus flooding the
      traces - note that, there is some "hysterisis" with this already.  That
      is - We update the counter only if we receive 64 page faults due to
      SPLIT_RSS_ACCOUNTING.  However, during zapping or copying of pte range,
      the RSS is updated immediately which can become noisy/flooding.  In a
      previous discussion, we agreed that BPF or ftrace can be used to rate
      limit the signal if this becomes an issue.
      
      Also note that I added wrappers to trace_rss_stat to prevent compiler
      errors where linux/mm.h is included from tracing code, causing errors
      such as:
      
          CC      kernel/trace/power-traces.o
        In file included from ./include/trace/define_trace.h:102,
                         from ./include/trace/events/kmem.h:342,
                         from ./include/linux/mm.h:31,
                         from ./include/linux/ring_buffer.h:5,
                         from ./include/linux/trace_events.h:6,
                         from ./include/trace/events/power.h:12,
                         from kernel/trace/power-traces.c:15:
        ./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type
           struct trace_entry ent;    \
      
      Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
      Link: http://lkml.kernel.org/r/20191001172817.234886-1-joel@joelfernandes.orgCo-developed-by: NTim Murray <timmurray@google.com>
      Signed-off-by: NTim Murray <timmurray@google.com>
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Carmen Jackson <carmenjackson@google.com>
      Cc: Mayank Gupta <mayankgupta@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3d1411b
    • J
      mm: drop mmap_sem before calling balance_dirty_pages() in write fault · 89b15332
      Johannes Weiner 提交于
      One of our services is observing hanging ps/top/etc under heavy write
      IO, and the task states show this is an mmap_sem priority inversion:
      
      A write fault is holding the mmap_sem in read-mode and waiting for
      (heavily cgroup-limited) IO in balance_dirty_pages():
      
          balance_dirty_pages+0x724/0x905
          balance_dirty_pages_ratelimited+0x254/0x390
          fault_dirty_shared_page.isra.96+0x4a/0x90
          do_wp_page+0x33e/0x400
          __handle_mm_fault+0x6f0/0xfa0
          handle_mm_fault+0xe4/0x200
          __do_page_fault+0x22b/0x4a0
          page_fault+0x45/0x50
      
      Somebody tries to change the address space, contending for the mmap_sem in
      write-mode:
      
          call_rwsem_down_write_failed_killable+0x13/0x20
          do_mprotect_pkey+0xa8/0x330
          SyS_mprotect+0xf/0x20
          do_syscall_64+0x5b/0x100
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      The waiting writer locks out all subsequent readers to avoid lock
      starvation, and several threads can be seen hanging like this:
      
          call_rwsem_down_read_failed+0x14/0x30
          proc_pid_cmdline_read+0xa0/0x480
          __vfs_read+0x23/0x140
          vfs_read+0x87/0x130
          SyS_read+0x42/0x90
          do_syscall_64+0x5b/0x100
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      To fix this, do what we do for cache read faults already: drop the
      mmap_sem before calling into anything IO bound, in this case the
      balance_dirty_pages() function, and return VM_FAULT_RETRY.
      
      Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89b15332
  12. 18 10月, 2019 1 次提交
    • J
      mm: fix double page fault on arm64 if PTE_AF is cleared · 83d116c5
      Jia He 提交于
      When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
      will be a double page fault in __copy_from_user_inatomic of cow_user_page.
      
      To reproduce the bug, the cmd is as follows after you deployed everything:
      make -C src/test/vmmalloc_fork/ TEST_TIME=60m check
      
      Below call trace is from arm64 do_page_fault for debugging purpose:
      [  110.016195] Call trace:
      [  110.016826]  do_page_fault+0x5a4/0x690
      [  110.017812]  do_mem_abort+0x50/0xb0
      [  110.018726]  el1_da+0x20/0xc4
      [  110.019492]  __arch_copy_from_user+0x180/0x280
      [  110.020646]  do_wp_page+0xb0/0x860
      [  110.021517]  __handle_mm_fault+0x994/0x1338
      [  110.022606]  handle_mm_fault+0xe8/0x180
      [  110.023584]  do_page_fault+0x240/0x690
      [  110.024535]  do_mem_abort+0x50/0xb0
      [  110.025423]  el0_da+0x20/0x24
      
      The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
      [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
                     pmd=000000023d4b3003, pte=360000298607bd3
      
      As told by Catalin: "On arm64 without hardware Access Flag, copying from
      user will fail because the pte is old and cannot be marked young. So we
      always end up with zeroed page after fork() + CoW for pfn mappings. we
      don't always have a hardware-managed access flag on arm64."
      
      This patch fixes it by calling pte_mkyoung. Also, the parameter is
      changed because vmf should be passed to cow_user_page()
      
      Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
      in case there can be some obscure use-case (by Kirill).
      
      [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_forkSigned-off-by: NJia He <justin.he@arm.com>
      Reported-by: NYibo Cai <Yibo.Cai@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      83d116c5
  13. 25 9月, 2019 3 次提交
  14. 20 8月, 2019 1 次提交
  15. 19 7月, 2019 1 次提交
  16. 16 7月, 2019 2 次提交
  17. 15 7月, 2019 1 次提交
  18. 13 7月, 2019 5 次提交
  19. 03 7月, 2019 2 次提交