1. 21 1月, 2021 2 次提交
  2. 20 1月, 2021 2 次提交
    • W
      mm: Allow architectures to request 'old' entries when prefaulting · 46bdb427
      Will Deacon 提交于
      Commit 5c0a85fa ("mm: make faultaround produce old ptes") changed
      the "faultaround" behaviour to initialise prefaulted PTEs as 'old',
      since this avoids vmscan wrongly assuming that they are hot, despite
      having never been explicitly accessed by userspace. The change has been
      shown to benefit numerous arm64 micro-architectures (with hardware
      access flag) running Android, where both application launch latency and
      direct reclaim time are significantly reduced (by 10%+ and ~80%
      respectively).
      
      Unfortunately, commit 315d09bf ("Revert "mm: make faultaround
      produce old ptes"") reverted the change due to it being identified as
      the cause of a ~6% regression in unixbench on x86. Experiments on a
      variety of recent arm64 micro-architectures indicate that unixbench is
      not affected by the original commit, which appears to yield a 0-1%
      performance improvement.
      
      Since one size does not fit all for the initial state of prefaulted
      PTEs, introduce arch_wants_old_prefaulted_pte(), which allows an
      architecture to opt-in to 'old' prefaulted PTEs at runtime based on
      whatever criteria it may have.
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NVinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NWill Deacon <will@kernel.org>
      46bdb427
    • K
      mm: Cleanup faultaround and finish_fault() codepaths · f9ce0be7
      Kirill A. Shutemov 提交于
      alloc_set_pte() has two users with different requirements: in the
      faultaround code, it called from an atomic context and PTE page table
      has to be preallocated. finish_fault() can sleep and allocate page table
      as needed.
      
      PTL locking rules are also strange, hard to follow and overkill for
      finish_fault().
      
      Let's untangle the mess. alloc_set_pte() has gone now. All locking is
      explicit.
      
      The price is some code duplication to handle huge pages in faultaround
      path, but it should be fine, having overall improvement in readability.
      
      Link: https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@boxSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      [will: s/from from/from/ in comment; spotted by willy]
      Signed-off-by: NWill Deacon <will@kernel.org>
      f9ce0be7
  3. 30 12月, 2020 2 次提交
    • B
      mm: memmap defer init doesn't work as expected · dc2da7b4
      Baoquan He 提交于
      VMware observed a performance regression during memmap init on their
      platform, and bisected to commit 73a6e474 ("mm: memmap_init:
      iterate over memblock regions rather that check each PFN") causing it.
      
      Before the commit:
      
        [0.033176] Normal zone: 1445888 pages used for memmap
        [0.033176] Normal zone: 89391104 pages, LIFO batch:63
        [0.035851] ACPI: PM-Timer IO Port: 0x448
      
      With commit
      
        [0.026874] Normal zone: 1445888 pages used for memmap
        [0.026875] Normal zone: 89391104 pages, LIFO batch:63
        [2.028450] ACPI: PM-Timer IO Port: 0x448
      
      The root cause is the current memmap defer init doesn't work as expected.
      
      Before, memmap_init_zone() was used to do memmap init of one whole zone,
      to initialize all low zones of one numa node, but defer memmap init of
      the last zone in that numa node.  However, since commit 73a6e474,
      function memmap_init() is adapted to iterater over memblock regions
      inside one zone, then call memmap_init_zone() to do memmap init for each
      region.
      
      E.g, on VMware's system, the memory layout is as below, there are two
      memory regions in node 2.  The current code will mistakenly initialize the
      whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
      iniatialize only one memmory section on the 2nd region [mem
      0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
      only one memory section's memmap initialized.  That's why more time is
      costed at the time.
      
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      [    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
      [    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
      [    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
      [    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]
      
      Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
      down the real zone end pfn so that defer_init() can use it to judge
      whether defer need be taken in zone wide.
      
      Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
      Fixes: commit 73a6e474 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reported-by: NRahul Gopakumar <gopakumarr@vmware.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc2da7b4
    • S
      mm: add prototype for __add_to_page_cache_locked() · 6d87d0ec
      Souptick Joarder 提交于
      Otherwise it causes a gcc warning:
      
        mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]
      
      A previous attempt to make this function static led to compilation
      errors when CONFIG_DEBUG_INFO_BTF is enabled because
      __add_to_page_cache_locked() is referred to by BPF code.
      
      Adding a prototype will silence the warning.
      
      Link: https://lkml.kernel.org/r/1608693702-4665-1-git-send-email-jrdr.linux@gmail.comSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d87d0ec
  4. 23 12月, 2020 2 次提交
  5. 16 12月, 2020 13 次提交
    • C
      mm: simplify follow_pte{,pmd} · ff5c19ed
      Christoph Hellwig 提交于
      Merge __follow_pte_pmd, follow_pte_pmd and follow_pte into a single
      follow_pte function and just pass two additional NULL arguments for the
      two previous follow_pte callers.
      
      [sfr@canb.auug.org.au: merge fix for "s390/pci: remove races against pte updates"]
        Link: https://lkml.kernel.org/r/20201111221254.7f6a3658@canb.auug.org.au
      
      Link: https://lkml.kernel.org/r/20201029101432.47011-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff5c19ed
    • V
      kernel/power: allow hibernation with page_poison sanity checking · 03b6c9a3
      Vlastimil Babka 提交于
      Page poisoning used to be incompatible with hibernation, as the state of
      poisoned pages was lost after resume, thus enabling CONFIG_HIBERNATION
      forces CONFIG_PAGE_POISONING_NO_SANITY.  For the same reason, the
      poisoning with zeroes variant CONFIG_PAGE_POISONING_ZERO used to disable
      hibernation.  The latter restriction was removed by commit 1ad1410f
      ("PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO") and
      similarly for init_on_free by commit 18451f9f ("PM: hibernate: fix
      crashes with init_on_free=1") by making sure free pages are cleared after
      resume.
      
      We can use the same mechanism to instead poison free pages with
      PAGE_POISON after resume.  This covers both zero and 0xAA patterns.  Thus
      we can remove the Kconfig restriction that disables page poison sanity
      checking when hibernation is enabled.
      
      Link: https://lkml.kernel.org/r/20201113104033.22907-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[hibernation]
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Laura Abbott <labbott@kernel.org>
      Cc: Mateusz Nosek <mateusznosek0@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03b6c9a3
    • V
      mm, page_poison: use static key more efficiently · 8db26a3d
      Vlastimil Babka 提交于
      Commit 11c9c7ed ("mm/page_poison.c: replace bool variable with static
      key") changed page_poisoning_enabled() to a static key check.  However,
      the function is not inlined, so each check still involves a function call
      with overhead not eliminated when page poisoning is disabled.
      
      Analogically to how debug_pagealloc is handled, this patch converts
      page_poisoning_enabled() back to boolean check, and introduces
      page_poisoning_enabled_static() for fast paths.  Both functions are
      inlined.
      
      The function kernel_poison_pages() is also called unconditionally and does
      the static key check inside.  Remove it from there and put it to callers.
      Also split it to two functions kernel_poison_pages() and
      kernel_unpoison_pages() instead of the confusing bool parameter.
      
      Also optimize the check that enables page poisoning instead of
      debug_pagealloc for architectures without proper debug_pagealloc support.
      Move the check to init_mem_debugging_and_hardening() to enable a single
      static key instead of having two static branches in
      page_poisoning_enabled_static().
      
      Link: https://lkml.kernel.org/r/20201113104033.22907-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Laura Abbott <labbott@kernel.org>
      Cc: Mateusz Nosek <mateusznosek0@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8db26a3d
    • V
      mm, page_alloc: do not rely on the order of page_poison and init_on_alloc/free parameters · 04013513
      Vlastimil Babka 提交于
      Patch series "cleanup page poisoning", v3.
      
      I have identified a number of issues and opportunities for cleanup with
      CONFIG_PAGE_POISON and friends:
      
       - interaction with init_on_alloc and init_on_free parameters depends on
         the order of parameters (Patch 1)
      
       - the boot time enabling uses static key, but inefficienty (Patch 2)
      
       - sanity checking is incompatible with hibernation (Patch 3)
      
       - CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have
         init_on_free (Patch 4)
      
       - CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we
         have init_on_free (Patch 5)
      
      This patch (of 5):
      
      Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1
      produces a warning in dmesg that page_poison takes precedence.  However,
      as these warnings are printed in early_param handlers for
      init_on_alloc/free, they are not printed if page_poison is enabled later
      on the command line (handlers are called in the order of their
      parameters), or when init_on_alloc/free is always enabled by the
      respective config option - before the page_poison early param handler is
      called, it is not considered to be enabled.  This is inconsistent.
      
      We can remove the dependency on order by making the init_on_* parameters
      only set a boolean variable, and postponing the evaluation after all early
      params have been processed.  Introduce a new
      init_mem_debugging_and_hardening() function for that, and move the related
      debug_pagealloc processing there as well.
      
      As a result init_mem_debugging_and_hardening() knows always accurately if
      init_on_* and/or page_poison options were enabled.  Thus we can also
      optimize want_init_on_alloc() and want_init_on_free().  We don't need to
      check page_poisoning_enabled() there, we can instead not enable the
      init_on_* static keys at all, if page poisoning is enabled.  This results
      in a simpler and more effective code.
      
      Link: https://lkml.kernel.org/r/20201113104033.22907-1-vbabka@suse.cz
      Link: https://lkml.kernel.org/r/20201113104033.22907-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mateusz Nosek <mateusznosek0@gmail.com>
      Cc: Laura Abbott <labbott@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04013513
    • M
      arch, mm: make kernel_page_present() always available · 32a0de88
      Mike Rapoport 提交于
      For architectures that enable ARCH_HAS_SET_MEMORY having the ability to
      verify that a page is mapped in the kernel direct map can be useful
      regardless of hibernation.
      
      Add RISC-V implementation of kernel_page_present(), update its forward
      declarations and stubs to be a part of set_memory API and remove ugly
      ifdefery in inlcude/linux/mm.h around current declarations of
      kernel_page_present().
      
      Link: https://lkml.kernel.org/r/20201109192128.960-5-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32a0de88
    • M
      arch, mm: restore dependency of __kernel_map_pages() on DEBUG_PAGEALLOC · 5d6ad668
      Mike Rapoport 提交于
      The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must
      never fail.  With this assumption is wouldn't be safe to allow general
      usage of this function.
      
      Moreover, some architectures that implement __kernel_map_pages() have this
      function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
      pages when page allocation debugging is disabled at runtime.
      
      As all the users of __kernel_map_pages() were converted to use
      debug_pagealloc_map_pages() it is safe to make it available only when
      DEBUG_PAGEALLOC is set.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-4-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d6ad668
    • M
      PM: hibernate: make direct map manipulations more explicit · 2abf962a
      Mike Rapoport 提交于
      When DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP is enabled a page may be
      not present in the direct map and has to be explicitly mapped before it
      could be copied.
      
      Introduce hibernate_map_page() and hibernation_unmap_page() that will
      explicitly use set_direct_map_{default,invalid}_noflush() for
      ARCH_HAS_SET_DIRECT_MAP case and debug_pagealloc_{map,unmap}_pages() for
      DEBUG_PAGEALLOC case.
      
      The remapping of the pages in safe_copy_page() presumes that it only
      changes protection bits in an existing PTE and so it is safe to ignore
      return value of set_direct_map_{default,invalid}_noflush().
      
      Still, add a pr_warn() so that future changes in set_memory APIs will not
      silently break hibernation.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-3-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2abf962a
    • M
      mm: introduce debug_pagealloc_{map,unmap}_pages() helpers · 77bc7fd6
      Mike Rapoport 提交于
      Patch series "arch, mm: improve robustness of direct map manipulation", v7.
      
      During recent discussion about KVM protected memory, David raised a
      concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC
      scope [1].
      
      Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
      possible that __kernel_map_pages() would fail, but since this function is
      void, the failure will go unnoticed.
      
      Moreover, there's lack of consistency of __kernel_map_pages() semantics
      across architectures as some guard this function with #ifdef
      DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation
      debugging is disabled at run time and some allow modifying the direct map
      regardless of DEBUG_PAGEALLOC settings.
      
      This set straightens this out by restoring dependency of
      __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
      accordingly.
      
      Since currently the only user of __kernel_map_pages() outside
      DEBUG_PAGEALLOC is hibernation, it is updated to make direct map accesses
      there more explicit.
      
      [1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com
      
      This patch (of 4):
      
      When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
      direct mapping after free_pages().  The pages than need to be mapped back
      before they could be used.  Theese mapping operations use
      __kernel_map_pages() guarded with with debug_pagealloc_enabled().
      
      The only place that calls __kernel_map_pages() without checking whether
      DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
      availability of this function when ARCH_HAS_SET_DIRECT_MAP is set.  Still,
      on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is not
      enabled but set_direct_map_invalid_noflush() may render some pages not
      present in the direct map and hibernation code won't be able to save such
      pages.
      
      To make page allocation debugging and hibernation interaction more robust,
      the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be
      made more explicit.
      
      Start with combining the guard condition and the call to
      __kernel_map_pages() into debug_pagealloc_map_pages() and
      debug_pagealloc_unmap_pages() functions to emphasize that
      __kernel_map_pages() should not be called without DEBUG_PAGEALLOC and use
      these new functions to map/unmap pages when page allocation debugging is
      enabled.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20201109192128.960-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77bc7fd6
    • M
      ia64: remove custom __early_pfn_to_nid() · 03e92a5e
      Mike Rapoport 提交于
      The ia64 implementation of __early_pfn_to_nid() essentially relies on the
      same data as the generic implementation.
      
      The correspondence between memory ranges and nodes is set in memblock
      during early memory initialization in register_active_ranges() function.
      
      The initialization of sparsemem that requires early_pfn_to_nid() happens
      later and it can use the memblock information like the other architectures.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-3-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03e92a5e
    • D
      vm_ops: rename .split() callback to .may_split() · dd3b614f
      Dmitry Safonov 提交于
      Rename the callback to reflect that it's not called *on* or *after* split,
      but rather some time before the splitting to check if it's possible.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.comSigned-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd3b614f
    • D
      mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio · cd544fd1
      Dmitry Safonov 提交于
      As kernel expect to see only one of such mappings, any further operations
      on the VMA-copy may be unexpected by the kernel.  Maybe it's being on the
      safe side, but there doesn't seem to be any expected use-case for this, so
      restrict it now.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
      Fixes: commit e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd544fd1
    • J
      mm: cleanup: remove unused tsk arg from __access_remote_vm · d3f5ffca
      John Hubbard 提交于
      Despite a comment that said that page fault accounting would be charged to
      whatever task_struct* was passed into __access_remote_vm(), the tsk
      argument was actually unused.
      
      Making page fault accounting actually use this task struct is quite a
      project, so there is no point in keeping the tsk argument.
      
      Delete both the comment, and the argument.
      
      [rppt@linux.ibm.com: changelog addition]
      
      Link: https://lkml.kernel.org/r/20201026074137.4147787-1-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3f5ffca
    • S
      mm: memcontrol: account pagetables per node · f0c0c115
      Shakeel Butt 提交于
      For many workloads, pagetable consumption is significant and it makes
      sense to expose it in the memory.stat for the memory cgroups.  However at
      the moment, the pagetables are accounted per-zone.  Converting them to
      per-node and using the right interface will correctly account for the
      memory cgroups as well.
      
      [akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/]
      
      Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0c0c115
  6. 03 12月, 2020 1 次提交
    • R
      mm: memcontrol: Use helpers to read page's memcg data · bcfe06bf
      Roman Gushchin 提交于
      Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
      
      Currently a non-slab kernel page which has been charged to a memory cgroup
      can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
      flag is defined as a page type (like buddy, offline, etc), so it takes a
      bit from a page->mapped counter.  Pages with a type set can't be mapped to
      userspace.
      
      But in general the kmemcg flag has nothing to do with mapping to
      userspace.  It only means that the page has been accounted by the page
      allocator, so it has to be properly uncharged on release.
      
      Some bpf maps are mapping the vmalloc-based memory to userspace, and their
      memory can't be accounted because of this implementation detail.
      
      This patchset removes this limitation by moving the PageKmemcg flag into
      one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
      accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
      adds several checks and removes a couple of obsolete functions.  As the
      result the code became more robust with fewer open-coded bit tricks.
      
      This patch (of 4):
      
      Currently there are many open-coded reads of the page->mem_cgroup pointer,
      as well as a couple of read helpers, which are barely used.
      
      It creates an obstacle on a way to reuse some bits of the pointer for
      storing additional bits of information.  In fact, we already do this for
      slab pages, where the last bit indicates that a pointer has an attached
      vector of objcg pointers instead of a regular memcg pointer.
      
      This commits uses 2 existing helpers and introduces a new helper to
      converts all read sides to calls of these helpers:
        struct mem_cgroup *page_memcg(struct page *page);
        struct mem_cgroup *page_memcg_rcu(struct page *page);
        struct mem_cgroup *page_memcg_check(struct page *page);
      
      page_memcg_check() is intended to be used in cases when the page can be a
      slab page and have a memcg pointer pointing at objcg vector.  It does
      check the lowest bit, and if set, returns NULL.  page_memcg() contains a
      VM_BUG_ON_PAGE() check for the page not being a slab page.
      
      To make sure nobody uses a direct access, struct page's
      mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
      Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
      Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
      bcfe06bf
  7. 19 11月, 2020 1 次提交
  8. 17 11月, 2020 1 次提交
    • S
      mm: Add 'mprotect' hook to struct vm_operations_struct · 95bb7c42
      Sean Christopherson 提交于
      Background
      ==========
      
      1. SGX enclave pages are populated with data by copying from normal memory
         via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
         this series.
      2. It is desirable to be able to restrict those normal memory data sources.
         For instance, to ensure that the source data is executable before
         copying data to an executable enclave page.
      3. Enclave page permissions are dynamic (just like normal permissions) and
         can be adjusted at runtime with mprotect().
      
      This creates a problem because the original data source may have long since
      vanished at the time when enclave page permissions are established (mmap()
      or mprotect()).
      
      The solution (elsewhere in this series) is to force enclave creators to
      declare their paging permission *intent* up front to the ioctl().  This
      intent can be immediately compared to the source data’s mapping and
      rejected if necessary.
      
      The “intent” is also stashed off for later comparison with enclave
      PTEs. This ensures that any future mmap()/mprotect() operations
      performed by the enclave creator or done on behalf of the enclave
      can be compared with the earlier declared permissions.
      
      Problem
      =======
      
      There is an existing mmap() hook which allows SGX to perform this
      permission comparison at mmap() time.  However, there is no corresponding
      ->mprotect() hook.
      
      Solution
      ========
      
      Add a vm_ops->mprotect() hook so that mprotect() operations which are
      inconsistent with any page's stashed intent can be rejected by the driver.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Co-developed-by: NJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: NJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NJethro Beekman <jethro@fortanix.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hdanton@sina.com>
      Cc: linux-mm@kvack.org
      Link: https://lkml.kernel.org/r/20201112220135.165028-11-jarkko@kernel.org
      95bb7c42
  9. 06 11月, 2020 1 次提交
  10. 05 11月, 2020 1 次提交
  11. 03 11月, 2020 1 次提交
    • J
      mm: always have io_remap_pfn_range() set pgprot_decrypted() · f8f6ae5d
      Jason Gunthorpe 提交于
      The purpose of io_remap_pfn_range() is to map IO memory, such as a
      memory mapped IO exposed through a PCI BAR.  IO devices do not
      understand encryption, so this memory must always be decrypted.
      Automatically call pgprot_decrypted() as part of the generic
      implementation.
      
      This fixes a bug where enabling AMD SME causes subsystems, such as RDMA,
      using io_remap_pfn_range() to expose BAR pages to user space to fail.
      The CPU will encrypt access to those BAR pages instead of passing
      unencrypted IO directly to the device.
      
      Places not mapping IO should use remap_pfn_range().
      
      Fixes: aca20d54 ("x86/mm: Add support to make use of Secure Memory Encryption")
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Dave Young" <dyoung@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/0-v1-025d64bdf6c4+e-amd_sme_fix_jgg@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8f6ae5d
  12. 19 10月, 2020 1 次提交
    • M
      mm/madvise: pass mm to do_madvise · 0726b01e
      Minchan Kim 提交于
      Patch series "introduce memory hinting API for external process", v9.
      
      Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
      that, application could give hints to kernel what memory range are
      preferred to be reclaimed.  However, in some platform(e.g., Android), the
      information required to make the hinting decision is not known to the app.
      Instead, it is known to a centralized userspace daemon(e.g.,
      ActivityManagerService), and that daemon must be able to initiate reclaim
      on its own without any app involvement.
      
      To solve the concern, this patch introduces new syscall -
      process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
      has some differences.
      
      1. It needs pidfd of target process to provide the hint
      
      2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
         moment.  Other hints in madvise will be opened when there are explicit
         requests from community to prevent unexpected bugs we couldn't support.
      
      3. Only privileged processes can do something for other process's
         address space.
      
      For more detail of the new API, please see "mm: introduce external memory
      hinting API" description in this patchset.
      
      This patch (of 3):
      
      In upcoming patches, do_madvise will be called from external process
      context so we shouldn't asssume "current" is always hinted process's
      task_struct.
      
      Furthermore, we must not access mm_struct via task->mm, but obtain it via
      access_mm() once (in the following patch) and only use that pointer [1],
      so pass it to do_madvise() as well.  Note the vma->vm_mm pointers are
      safe, so we can use them further down the call stack.
      
      And let's pass current->mm as arguments of do_madvise so it shouldn't
      change existing behavior but prepare next patch to make review easy.
      
      [vbabka@suse.cz: changelog tweak]
      [minchan@kernel.org: use current->mm for io_uring]
        Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
      [akpm@linux-foundation.org: fix it for upstream changes]
      [akpm@linux-foundation.org: whoops]
      [rdunlap@infradead.org: add missing includes]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0726b01e
  13. 17 10月, 2020 4 次提交
  14. 14 10月, 2020 3 次提交
  15. 28 9月, 2020 1 次提交
  16. 27 9月, 2020 1 次提交
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
  17. 18 9月, 2020 1 次提交
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
  18. 04 9月, 2020 1 次提交
    • C
      arm64: mte: Add PROT_MTE support to mmap() and mprotect() · 9f341931
      Catalin Marinas 提交于
      To enable tagging on a memory range, the user must explicitly opt in via
      a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
      memory type in the AttrIndx field of a pte, simplify the or'ing of these
      bits over the protection_map[] attributes by making MT_NORMAL index 0.
      
      There are two conditions for arch_vm_get_page_prot() to return the
      MT_NORMAL_TAGGED memory type: (1) the user requested it via PROT_MTE,
      registered as VM_MTE in the vm_flags, and (2) the vma supports MTE,
      decided during the mmap() call (only) and registered as VM_MTE_ALLOWED.
      
      arch_calc_vm_prot_bits() is responsible for registering the user request
      as VM_MTE. The newly introduced arch_calc_vm_flag_bits() sets
      VM_MTE_ALLOWED if the mapping is MAP_ANONYMOUS. An MTE-capable
      filesystem (RAM-based) may be able to set VM_MTE_ALLOWED during its
      mmap() file ops call.
      
      In addition, update VM_DATA_DEFAULT_FLAGS to allow mprotect(PROT_MTE) on
      stack or brk area.
      
      The Linux mmap() syscall currently ignores unknown PROT_* flags. In the
      presence of MTE, an mmap(PROT_MTE) on a file which does not support MTE
      will not report an error and the memory will not be mapped as Normal
      Tagged. For consistency, mprotect(PROT_MTE) will not report an error
      either if the memory range does not support MTE. Two subsequent patches
      in the series will propose tightening of this behaviour.
      Co-developed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      9f341931
  19. 24 8月, 2020 1 次提交