1. 30 6月, 2021 1 次提交
    • M
      mm/page_alloc: fix memory map initialization for descending nodes · 122e093c
      Mike Rapoport 提交于
      On systems with memory nodes sorted in descending order, for instance Dell
      Precision WorkStation T5500, the struct pages for higher PFNs and
      respectively lower nodes, could be overwritten by the initialization of
      struct pages corresponding to the holes in the memory sections.
      
      For example for the below memory layout
      
      [    0.245624] Early memory node ranges
      [    0.248496]   node   1: [mem 0x0000000000001000-0x0000000000090fff]
      [    0.251376]   node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
      [    0.254256]   node   1: [mem 0x0000000100000000-0x0000001423ffffff]
      [    0.257144]   node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in
      the middle of a section and will be considered as a hole during the
      initialization of the last section in node 1.
      
      The wrong initialization of the memory map causes panic on boot when
      CONFIG_DEBUG_VM is enabled.
      
      Reorder loop order of the memory map initialization so that the outer loop
      will always iterate over populated memory regions in the ascending order
      and the inner loop will select the zone corresponding to the PFN range.
      
      This way initialization of the struct pages for the memory holes will be
      always done for the ranges that are actually not populated.
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073
      Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org
      Fixes: 0740a50b ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout")
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Boris Petkov <bp@alien8.de>
      Cc: Robert Shteynfeld <robert.shteynfeld@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      122e093c
  2. 17 6月, 2021 1 次提交
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 22061a1f
      Hugh Dickins 提交于
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22061a1f
  3. 15 5月, 2021 1 次提交
  4. 07 5月, 2021 1 次提交
  5. 06 5月, 2021 3 次提交
    • P
      mm/gup: do not migrate zero page · 9afaf30f
      Pavel Tatashin 提交于
      On some platforms ZERO_PAGE(0) might end-up in a movable zone.  Do not
      migrate zero page in gup during longterm pinning as migration of zero page
      is not allowed.
      
      For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I
      see the following:
      
      Boot#1: zero_pfn  0x48a8d zero_pfn zone: ZONE_DMA32
      Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE
      
      On x86, empty_zero_page is declared in .bss and depending on the loader
      may end up in different physical locations during boots.
      
      Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because
      zero_pfn that they are using is declared in memory.c which is compiled
      with CONFIG_MMU.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-9-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9afaf30f
    • P
      mm: honor PF_MEMALLOC_PIN for all movable pages · 8e3560d9
      Pavel Tatashin 提交于
      PF_MEMALLOC_PIN is only honored for CMA pages, extend this flag to work
      for any allocations from ZONE_MOVABLE by removing __GFP_MOVABLE from
      gfp_mask when this flag is passed in the current context.
      
      Add is_pinnable_page() to return true if page is in a pinnable page.  A
      pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-8-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e3560d9
    • A
      userfaultfd: add minor fault registration mode · 7677f7fd
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling", v9.
      
      Overview
      ========
      
      This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
      When enabled (via the UFFDIO_API ioctl), this feature means that any
      hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
      get events for "minor" faults.  By "minor" fault, I mean the following
      situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
      memory).  One of the mappings is registered with userfaultfd (in minor
      mode), and the other is not.  Via the non-UFFD mapping, the underlying
      pages have already been allocated & filled with some contents.  The UFFD
      mapping has not yet been faulted in; when it is touched for the first
      time, this results in what I'm calling a "minor" fault.  As a concrete
      example, when working with hugetlbfs, we have huge_pte_none(), but
      find_lock_page() finds an existing page.
      
      We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
      is, userspace resolves the fault by either a) doing nothing if the
      contents are already correct, or b) updating the underlying contents using
      the second, non-UFFD mapping (via memcpy/memset or similar, or something
      fancier like RDMA, or etc...).  In either case, userspace issues
      UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
      correct, carry on setting up the mapping".
      
      Use Case
      ========
      
      Consider the use case of VM live migration (e.g. under QEMU/KVM):
      
      1. While a VM is still running, we copy the contents of its memory to a
         target machine. The pages are populated on the target by writing to the
         non-UFFD mapping, using the setup described above. The VM is still running
         (and therefore its memory is likely changing), so this may be repeated
         several times, until we decide the target is "up to date enough".
      
      2. We pause the VM on the source, and start executing on the target machine.
         During this gap, the VM's user(s) will *see* a pause, so it is desirable to
         minimize this window.
      
      3. Between the last time any page was copied from the source to the target, and
         when the VM was paused, the contents of that page may have changed - and
         therefore the copy we have on the target machine is out of date. Although we
         can keep track of which pages are out of date, for VMs with large amounts of
         memory, it is "slow" to transfer this information to the target machine. We
         want to resume execution before such a transfer would complete.
      
      4. So, the guest begins executing on the target machine. The first time it
         touches its memory (via the UFFD-registered mapping), userspace wants to
         intercept this fault. Userspace checks whether or not the page is up to date,
         and if not, copies the updated page from the source machine, via the non-UFFD
         mapping. Finally, whether a copy was performed or not, userspace issues a
         UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
         are correct, carry on setting up the mapping".
      
      We don't have to do all of the final updates on-demand. The userfaultfd manager
      can, in the background, also copy over updated pages once it receives the map of
      which pages are up-to-date or not.
      
      Interaction with Existing APIs
      ==============================
      
      Because this is a feature, a registered VMA could potentially receive both
      missing and minor faults.  I spent some time thinking through how the
      existing API interacts with the new feature:
      
      UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
      allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:
      
      - For non-shared memory or shmem, -EINVAL is returned.
      - For hugetlb, -EFAULT is returned.
      
      UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
      Without modifications, the existing codepath assumes a new page needs to
      be allocated.  This is okay, since userspace must have a second
      non-UFFD-registered mapping anyway, thus there isn't much reason to want
      to use these in any case (just memcpy or memset or similar).
      
      - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
      - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
        in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
      - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
        -ENOENT in that case (regardless of the kind of fault).
      
      Future Work
      ===========
      
      This series only supports hugetlbfs.  I have a second series in flight to
      support shmem as well, extending the functionality.  This series is more
      mature than the shmem support at this point, and the functionality works
      fully on hugetlbfs, so this series can be merged first and then shmem
      support will follow.
      
      This patch (of 6):
      
      This feature allows userspace to intercept "minor" faults.  By "minor"
      faults, I mean the following situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
      mappings is registered with userfaultfd (in minor mode), and the other is
      not.  Via the non-UFFD mapping, the underlying pages have already been
      allocated & filled with some contents.  The UFFD mapping has not yet been
      faulted in; when it is touched for the first time, this results in what
      I'm calling a "minor" fault.  As a concrete example, when working with
      hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
      page.
      
      This commit adds the new registration mode, and sets the relevant flag on
      the VMAs being registered.  In the hugetlb fault path, if we find that we
      have huge_pte_none(), but find_lock_page() does indeed find an existing
      page, then we have a "minor" fault, and if the VMA has the userfaultfd
      registration flag, we call into userfaultfd to handle it.
      
      This is implemented as a new registration mode, instead of an API feature.
      This is because the alternative implementation has significant drawbacks
      [1].
      
      However, doing it this was requires we allocate a VM_* flag for the new
      registration mode.  On 32-bit systems, there are no unused bits, so this
      feature is only supported on architectures with
      CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
      MINOR mode on 32-bit architectures, we return -EINVAL.
      
      [1] https://lore.kernel.org/patchwork/patch/1380226/
      
      [peterx@redhat.com: fix minor fault page leak]
        Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7677f7fd
  6. 01 5月, 2021 9 次提交
  7. 09 4月, 2021 1 次提交
  8. 08 4月, 2021 1 次提交
  9. 26 3月, 2021 1 次提交
    • A
      kasan: fix per-page tags for non-page_alloc pages · cf10bd4c
      Andrey Konovalov 提交于
      To allow performing tag checks on page_alloc addresses obtained via
      page_address(), tag-based KASAN modes store tags for page_alloc
      allocations in page->flags.
      
      Currently, the default tag value stored in page->flags is 0x00.
      Therefore, page_address() returns a 0x00ffff...  address for pages that
      were not allocated via page_alloc.
      
      This might cause problems.  A particular case we encountered is a
      conflict with KFENCE.  If a KFENCE-allocated slab object is being freed
      via kfree(page_address(page) + offset), the address passed to kfree()
      will get tagged with 0x00 (as slab pages keep the default per-page
      tags).  This leads to is_kfence_address() check failing, and a KFENCE
      object ending up in normal slab freelist, which causes memory
      corruptions.
      
      This patch changes the way KASAN stores tag in page-flags: they are now
      stored xor'ed with 0xff.  This way, KASAN doesn't need to initialize
      per-page flags for every created page, which might be slow.
      
      With this change, page_address() returns natively-tagged (with 0xff)
      pointers for pages that didn't have tags set explicitly.
      
      This patch fixes the encountered conflict with KFENCE and prevents more
      similar issues that can occur in the future.
      
      Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
      Fixes: 2813b9c0 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf10bd4c
  10. 14 3月, 2021 1 次提交
    • P
      mm: introduce page_needs_cow_for_dma() for deciding whether cow · 97a7e473
      Peter Xu 提交于
      We've got quite a few places (pte, pmd, pud) that explicitly checked
      against whether we should break the cow right now during fork().  It's
      easier to provide a helper, especially before we work the same thing on
      hugetlbfs.
      
      Since we'll reference is_cow_mapping() in mm.h, move it there too.
      Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
      exported to the whole kernel.  With that we should expect another patch to
      use is_cow_mapping() whenever we can across the kernel since we do use it
      quite a lot but it's always done with raw code against VM_* flags.
      
      Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NJason Gunthorpe <jgg@ziepe.ca>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Gal Pressman <galpress@amazon.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Roland Scheidegger <sroland@vmware.com>
      Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
      Cc: Wei Zhang <wzam@amazon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a7e473
  11. 09 3月, 2021 1 次提交
    • P
      mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels · 5bb1bb35
      Paul E. McKenney 提交于
      The mem_dump_obj() functionality adds a few hundred bytes, which is a
      small price to pay.  Except on kernels built with CONFIG_PRINTK=n, in
      which mem_dump_obj() messages will be suppressed.  This commit therefore
      makes mem_dump_obj() be a static inline empty function on kernels built
      with CONFIG_PRINTK=n and excludes all of its support functions as well.
      This avoids kernel bloat on systems that cannot use mem_dump_obj().
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <linux-mm@kvack.org>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      5bb1bb35
  12. 25 2月, 2021 5 次提交
  13. 09 2月, 2021 1 次提交
    • P
      mm: provide a saner PTE walking API for modules · 9fd6dad1
      Paolo Bonzini 提交于
      Currently, the follow_pfn function is exported for modules but
      follow_pte is not.  However, follow_pfn is very easy to misuse,
      because it does not provide protections (so most of its callers
      assume the page is writable!) and because it returns after having
      already unlocked the page table lock.
      
      Provide instead a simplified version of follow_pte that does
      not have the pmdpp and range arguments.  The older version
      survives as follow_invalidate_pte() for use by fs/dax.c.
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9fd6dad1
  14. 05 2月, 2021 1 次提交
  15. 23 1月, 2021 1 次提交
    • P
      mm: Add mem_dump_obj() to print source of memory block · 8e7f37f2
      Paul E. McKenney 提交于
      There are kernel facilities such as per-CPU reference counts that give
      error messages in generic handlers or callbacks, whose messages are
      unenlightening.  In the case of per-CPU reference-count underflow, this
      is not a problem when creating a new use of this facility because in that
      case the bug is almost certainly in the code implementing that new use.
      However, trouble arises when deploying across many systems, which might
      exercise corner cases that were not seen during development and testing.
      Here, it would be really nice to get some kind of hint as to which of
      several uses the underflow was caused by.
      
      This commit therefore exposes a mem_dump_obj() function that takes
      a pointer to memory (which must still be allocated if it has been
      dynamically allocated) and prints available information on where that
      memory came from.  This pointer can reference the middle of the block as
      well as the beginning of the block, as needed by things like RCU callback
      functions and timer handlers that might not know where the beginning of
      the memory block is.  These functions and handlers can use mem_dump_obj()
      to print out better hints as to where the problem might lie.
      
      The information printed can depend on kernel configuration.  For example,
      the allocation return address can be printed only for slab and slub,
      and even then only when the necessary debug has been enabled.  For slab,
      build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
      to the next power of two or use the SLAB_STORE_USER when creating the
      kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
      boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
      if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
      to enable printing of the allocation-time stack trace.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
      [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
      [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
      [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
      [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
      [ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      8e7f37f2
  16. 21 1月, 2021 3 次提交
  17. 20 1月, 2021 2 次提交
    • W
      mm: Allow architectures to request 'old' entries when prefaulting · 46bdb427
      Will Deacon 提交于
      Commit 5c0a85fa ("mm: make faultaround produce old ptes") changed
      the "faultaround" behaviour to initialise prefaulted PTEs as 'old',
      since this avoids vmscan wrongly assuming that they are hot, despite
      having never been explicitly accessed by userspace. The change has been
      shown to benefit numerous arm64 micro-architectures (with hardware
      access flag) running Android, where both application launch latency and
      direct reclaim time are significantly reduced (by 10%+ and ~80%
      respectively).
      
      Unfortunately, commit 315d09bf ("Revert "mm: make faultaround
      produce old ptes"") reverted the change due to it being identified as
      the cause of a ~6% regression in unixbench on x86. Experiments on a
      variety of recent arm64 micro-architectures indicate that unixbench is
      not affected by the original commit, which appears to yield a 0-1%
      performance improvement.
      
      Since one size does not fit all for the initial state of prefaulted
      PTEs, introduce arch_wants_old_prefaulted_pte(), which allows an
      architecture to opt-in to 'old' prefaulted PTEs at runtime based on
      whatever criteria it may have.
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NVinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NWill Deacon <will@kernel.org>
      46bdb427
    • K
      mm: Cleanup faultaround and finish_fault() codepaths · f9ce0be7
      Kirill A. Shutemov 提交于
      alloc_set_pte() has two users with different requirements: in the
      faultaround code, it called from an atomic context and PTE page table
      has to be preallocated. finish_fault() can sleep and allocate page table
      as needed.
      
      PTL locking rules are also strange, hard to follow and overkill for
      finish_fault().
      
      Let's untangle the mess. alloc_set_pte() has gone now. All locking is
      explicit.
      
      The price is some code duplication to handle huge pages in faultaround
      path, but it should be fine, having overall improvement in readability.
      
      Link: https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@boxSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      [will: s/from from/from/ in comment; spotted by willy]
      Signed-off-by: NWill Deacon <will@kernel.org>
      f9ce0be7
  18. 12 1月, 2021 3 次提交
    • D
      mm: Close race in generic_access_phys · 96667f8a
      Daniel Vetter 提交于
      Way back it was a reasonable assumptions that iomem mappings never
      change the pfn range they point at. But this has changed:
      
      - gpu drivers dynamically manage their memory nowadays, invalidating
        ptes with unmap_mapping_range when buffers get moved
      
      - contiguous dma allocations have moved from dedicated carvetouts to
        cma regions. This means if we miss the unmap the pfn might contain
        pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
      
      - even /dev/mem now invalidates mappings when the kernel requests that
        iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac66
        ("/dev/mem: Revoke mappings when a driver claims the region")
      
      Accessing pfns obtained from ptes without holding all the locks is
      therefore no longer a good idea. Fix this.
      
      Since ioremap might need to manipulate pagetables too we need to drop
      the pt lock and have a retry loop if we raced.
      
      While at it, also add kerneldoc and improve the comment for the
      vma_ops->access function. It's for accessing, not for moving the
      memory from iomem to system memory, as the old comment seemed to
      suggest.
      
      References: 28b2ee20 ("access_process_vm device memory infrastructure")
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Benjamin Herrensmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-samsung-soc@vger.kernel.org
      Cc: linux-media@vger.kernel.org
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20201127164131.2244124-8-daniel.vetter@ffwll.ch
      96667f8a
    • D
      media: videobuf2: Move frame_vector into media subsystem · eb83b8e3
      Daniel Vetter 提交于
      It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR
      symbol from all over the tree (well just one place, somehow omap media
      driver still had this in its Kconfig, despite not using it).
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NHans Verkuil <hverkuil-cisco@xs4all.nl>
      Acked-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Acked-by: NTomasz Figa <tfiga@chromium.org>
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Pawel Osciak <pawel@osciak.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Tomasz Figa <tfiga@chromium.org>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-samsung-soc@vger.kernel.org
      Cc: linux-media@vger.kernel.org
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20201127164131.2244124-7-daniel.vetter@ffwll.ch
      eb83b8e3
    • D
      mm/frame-vector: Use FOLL_LONGTERM · 04769cb1
      Daniel Vetter 提交于
      This is used by media/videbuf2 for persistent dma mappings, not just
      for a single dma operation and then freed again, so needs
      FOLL_LONGTERM.
      
      Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
      locking issues. Rework the code to pull the pup path out from the
      mmap_sem critical section as suggested by Jason.
      
      By relying entirely on the vma checks in pin_user_pages and follow_pfn
      (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
      
      Note that pin_user_pages_fast is a safe replacement despite the
      seeming lack of checking for vma->vm_flasg & (VM_IO | VM_PFNMAP). Such
      ptes are marked with pte_mkspecial (which pup_fast rejects in the
      fastpath), and only architectures supporting that support the
      pin_user_pages_fast fastpath.
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Pawel Osciak <pawel@osciak.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Tomasz Figa <tfiga@chromium.org>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-samsung-soc@vger.kernel.org
      Cc: linux-media@vger.kernel.org
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20201127164131.2244124-6-daniel.vetter@ffwll.ch
      04769cb1
  19. 30 12月, 2020 2 次提交
    • B
      mm: memmap defer init doesn't work as expected · dc2da7b4
      Baoquan He 提交于
      VMware observed a performance regression during memmap init on their
      platform, and bisected to commit 73a6e474 ("mm: memmap_init:
      iterate over memblock regions rather that check each PFN") causing it.
      
      Before the commit:
      
        [0.033176] Normal zone: 1445888 pages used for memmap
        [0.033176] Normal zone: 89391104 pages, LIFO batch:63
        [0.035851] ACPI: PM-Timer IO Port: 0x448
      
      With commit
      
        [0.026874] Normal zone: 1445888 pages used for memmap
        [0.026875] Normal zone: 89391104 pages, LIFO batch:63
        [2.028450] ACPI: PM-Timer IO Port: 0x448
      
      The root cause is the current memmap defer init doesn't work as expected.
      
      Before, memmap_init_zone() was used to do memmap init of one whole zone,
      to initialize all low zones of one numa node, but defer memmap init of
      the last zone in that numa node.  However, since commit 73a6e474,
      function memmap_init() is adapted to iterater over memblock regions
      inside one zone, then call memmap_init_zone() to do memmap init for each
      region.
      
      E.g, on VMware's system, the memory layout is as below, there are two
      memory regions in node 2.  The current code will mistakenly initialize the
      whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
      iniatialize only one memmory section on the 2nd region [mem
      0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
      only one memory section's memmap initialized.  That's why more time is
      costed at the time.
      
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      [    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
      [    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
      [    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
      [    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]
      
      Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
      down the real zone end pfn so that defer_init() can use it to judge
      whether defer need be taken in zone wide.
      
      Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
      Fixes: commit 73a6e474 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reported-by: NRahul Gopakumar <gopakumarr@vmware.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc2da7b4
    • S
      mm: add prototype for __add_to_page_cache_locked() · 6d87d0ec
      Souptick Joarder 提交于
      Otherwise it causes a gcc warning:
      
        mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]
      
      A previous attempt to make this function static led to compilation
      errors when CONFIG_DEBUG_INFO_BTF is enabled because
      __add_to_page_cache_locked() is referred to by BPF code.
      
      Adding a prototype will silence the warning.
      
      Link: https://lkml.kernel.org/r/1608693702-4665-1-git-send-email-jrdr.linux@gmail.comSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d87d0ec
  20. 23 12月, 2020 1 次提交