1. 17 10月, 2020 14 次提交
    • L
      mm: don't panic when links can't be created in sysfs · 90c7eaeb
      Laurent Dufour 提交于
      At boot time, or when doing memory hot-add operations, if the links in
      sysfs can't be created, the system is still able to run, so just report
      the error in the kernel log rather than BUG_ON and potentially make system
      unusable because the callpath can be called with locks held.
      
      Since the number of memory blocks managed could be high, the messages are
      rate limited.
      
      As a consequence, link_mem_sections() has no status to report anymore.
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90c7eaeb
    • D
      kernel/resource: make iomem_resource implicit in release_mem_region_adjustable() · cb8e3c8b
      David Hildenbrand 提交于
      "mem" in the name already indicates the root, similar to
      release_mem_region() and devm_request_mem_region().  Make it implicit.
      The only single caller always passes iomem_resource, other parents are not
      applicable.
      Suggested-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb8e3c8b
    • D
      mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources · 9ca6551e
      David Hildenbrand 提交于
      Some add_memory*() users add memory in small, contiguous memory blocks.
      Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
      
      This can quickly result in a lot of memory resources, whereby the actual
      resource boundaries are not of interest (e.g., it might be relevant for
      DIMMs, exposed via /proc/iomem to user space).  We really want to merge
      added resources in this scenario where possible.
      
      Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
      either created within add_memory*() or passed via add_memory_resource()
      shall be marked mergeable and merged with applicable siblings.
      
      To implement that, we need a kernel/resource interface to mark selected
      System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
      merging.
      
      Note: We really want to merge after the whole operation succeeded, not
      directly when adding a resource to the resource tree (it would break
      add_memory_resource() and require splitting resources again when the
      operation failed - e.g., due to -ENOMEM).
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ca6551e
    • D
      mm/memory_hotplug: prepare passing flags to add_memory() and friends · b6117199
      David Hildenbrand 提交于
      We soon want to pass flags, e.g., to mark added System RAM resources.
      mergeable.  Prepare for that.
      
      This patch is based on a similar patch by Oscar Salvador:
      
      https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.deSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Juergen Gross <jgross@suse.com> # Xen related part
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NWei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6117199
    • D
      kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED · 7cf603d1
      David Hildenbrand 提交于
      IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
      always set to 0 by hardware.  This is far from beautiful (and confusing),
      and the bit only applies to SYSRAM.  So let's move it out of the
      bus-specific (PnP) defined bits.
      
      We'll add another SYSRAM specific bit soon.  If we ever need more bits for
      other purposes, we can steal some from "desc", or reshuffle/regroup what
      we have.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf603d1
    • D
      kernel/resource: make release_mem_region_adjustable() never fail · ec62d04e
      David Hildenbrand 提交于
      Patch series "selective merging of system ram resources", v4.
      
      Some add_memory*() users add memory in small, contiguous memory blocks.
      Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
      
      This can quickly result in a lot of memory resources, whereby the actual
      resource boundaries are not of interest (e.g., it might be relevant for
      DIMMs, exposed via /proc/iomem to user space).  We really want to merge
      added resources in this scenario where possible.
      
      Resources are effectively stored in a list-based tree.  Having a lot of
      resources not only wastes memory, it also makes traversing that tree more
      expensive, and makes /proc/iomem explode in size (e.g., requiring
      kexec-tools to manually merge resources when creating a kdump header.  The
      current kexec-tools resource count limit does not allow for more than
      ~100GB of memory with a memory block size of 128MB on x86-64).
      
      Let's allow to selectively merge system ram resources by specifying a new
      flag for add_memory*().  Patch #5 contains a /proc/iomem example.  Only
      tested with virtio-mem.
      
      This patch (of 8):
      
      Let's make sure splitting a resource on memory hotunplug will never fail.
      This will become more relevant once we merge selected System RAM resources
      - then, we'll trigger that case more often on memory hotunplug.
      
      In general, this function is already unlikely to fail.  When we remove
      memory, we free up quite a lot of metadata (memmap, page tables, memory
      block device, etc.).  The only reason it could really fail would be when
      injecting allocation errors.
      
      All other error cases inside release_mem_region_adjustable() seem to be
      sanity checks if the function would be abused in different context - let's
      add WARN_ON_ONCE() in these cases so we can catch them.
      
      [natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable]
        Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com
        Link: https://github.com/ClangBuiltLinux/linux/issues/1159Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monn <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec62d04e
    • D
      mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory · b30c5927
      David Hildenbrand 提交于
      Currently, it can happen that pages are allocated (and freed) via the
      buddy before we finished basic memory onlining.
      
      For example, pages are exposed to the buddy and can be allocated before we
      actually mark the sections online.  Allocated pages could suddenly fail
      pfn_to_online_page() checks.  We had similar issues with pcp handling,
      when pages are allocated+freed before we reach zone_pcp_update() in
      online_pages() [1].
      
      Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
      impossible.  Once done with the heavy lifting, use
      undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
      freelist, marking them ready for allocation.  Similar to offline_pages(),
      we have to manually adjust zone->nr_isolate_pageblock.
      
      [1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.orgSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30c5927
    • D
      mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone() · d882c006
      David Hildenbrand 提交于
      On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
      un-isolate the pages after memory onlining is complete.  Let's allow
      passing in the migratetype.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d882c006
    • D
      mm/memory_hotplug: simplify page onlining · aac65321
      David Hildenbrand 提交于
      We don't allow to offline memory with holes, all boot memory is online,
      and all hotplugged memory cannot have holes.
      
      We can now simplify onlining of pages.  As we only allow to online/offline
      full sections and sections always span full MAX_ORDER_NR_PAGES, we can
      just process MAX_ORDER - 1 pages without further special handling.
      
      The number of onlined pages simply corresponds to the number of pages we
      were requested to online.
      
      While at it, refine the comment regarding the callback not exposing all
      pages to the buddy.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-8-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aac65321
    • D
      mm/page_isolation: simplify return value of start_isolate_page_range() · 3fa0c7c7
      David Hildenbrand 提交于
      Callers no longer need the number of isolated pageblocks.  Let's simplify.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fa0c7c7
    • D
      mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages() · ea15153c
      David Hildenbrand 提交于
      We make sure that we cannot have any memory holes right at the beginning
      of offline_pages() and we only support to online/offline full sections.
      Both, sections and pageblocks are a power of two in size, and sections
      always span full pageblocks.
      
      We can directly calculate the number of isolated pageblocks from nr_pages.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-6-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea15153c
    • D
      mm/memory_hotplug: simplify page offlining · 0a1a9a00
      David Hildenbrand 提交于
      We make sure that we cannot have any memory holes right at the beginning
      of offline_pages().  We no longer need walk_system_ram_range() and can
      call test_pages_isolated() and __offline_isolated_pages() directly.
      
      offlined_pages always corresponds to nr_pages, so we can simplify that.
      
      [akpm@linux-foundation.org: patch conflict resolution]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a1a9a00
    • D
      mm/memory_hotplug: enforce section granularity when onlining/offlining · 4986fac1
      David Hildenbrand 提交于
      Already two people (including me) tried to offline subsections, because
      the function looks like it can deal with it.  But we really can only
      online/offline full sections that are properly aligned (e.g., we can only
      mark full sections online/offline via SECTION_IS_ONLINE).
      
      Add a simple safety net to document the restriction now.  Current users
      (core and powernv/memtrace) respect these restrictions.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4986fac1
    • D
      mm/memory_hotplug: inline __offline_pages() into offline_pages() · 73a11c96
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: online_pages()/offline_pages() cleanups", v2.
      
      These are a bunch of cleanups for online_pages()/offline_pages() and
      related code, mostly getting rid of memory hole handling that is no longer
      necessary.  There is only a single walk_system_ram_range() call left in
      offline_pages(), to make sure we don't have any memory holes.  I had some
      of these patches lying around for a longer time but didn't have time to
      polish them.
      
      In addition, the last patch marks all pageblocks of memory to get onlined
      MIGRATE_ISOLATE, so pages that have just been exposed to the buddy cannot
      get allocated before onlining is complete.  Once heavy lifting is done,
      the pageblocks are set to MIGRATE_MOVABLE, such that allocations are
      possible.
      
      I played with DIMMs and virtio-mem on x86-64 and didn't spot any
      surprises.  I verified that the numer of isolated pageblocks is correctly
      handled when onlining/offlining.
      
      This patch (of 10):
      
      There is only a single user, offline_pages(). Let's inline, to make
      it look more similar to online_pages().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Link: https://lkml.kernel.org/r/20200819175957.28465-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20200819175957.28465-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73a11c96
  2. 14 10月, 2020 1 次提交
    • D
      mm/memory_hotplug: introduce default phys_to_target_node() implementation · a035b6bf
      Dan Williams 提交于
      In preparation to set a fallback value for dev_dax->target_node, introduce
      generic fallback helpers for phys_to_target_node()
      
      A generic implementation based on node-data or memblock was proposed, but
      as noted by Mike:
      
          "Here again, I would prefer to add a weak default for
           phys_to_target_node() because the "generic" implementation is not really
           generic.
      
           The fallback to reserved ranges is x86 specfic because on x86 most of
           the reserved areas is not in memblock.memory. AFAIK, no other
           architecture does this."
      
      The info message in the generic memory_add_physaddr_to_nid()
      implementation is fixed up to properly reflect that
      memory_add_physaddr_to_nid() communicates "online" node info and
      phys_to_target_node() indicates "target / to-be-onlined" node info.
      
      [akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
        Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a035b6bf
  3. 27 9月, 2020 2 次提交
    • L
      mm: don't rely on system state to detect hot-plug operations · f85086f9
      Laurent Dufour 提交于
      In register_mem_sect_under_node() the system_state's value is checked to
      detect whether the call is made during boot time or during an hot-plug
      operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
      because regular memory is registered at SYSTEM_SCHEDULING state.  In
      addition, memory hot-plug operation can be triggered at this system
      state by the ACPI [1].  So checking against the system state is not
      enough.
      
      The consequence is that on system with interleaved node's ranges like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      This can be seen on PowerPC LPAR after multiple memory hot-plug and
      hot-unplug operations are done.  At the next reboot the node's memory
      ranges can be interleaved and since the call to link_mem_sections() is
      made in topology_init() while the system is in the SYSTEM_SCHEDULING
      state, the node's id is not checked, and the sections registered to
      multiple nodes:
      
        $ ls -l /sys/devices/system/memory/memory21/node*
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
      
      In that case, the system is able to boot but if later one of theses
      memory blocks is hot-unplugged and then hot-plugged, the sysfs
      inconsistency is detected and this is triggering a BUG_ON():
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This patch addresses the root cause by not relying on the system_state
      value to detect whether the call is due to a hot-plug operation.  An
      extra parameter is added to link_mem_sections() detailing whether the
      operation is due to a hot-plug operation.
      
      [1] According to Oscar Salvador, using this qemu command line, ACPI
      memory hotplug operations are raised at SYSTEM_SCHEDULING state:
      
        $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
              -m size=$MEM,slots=255,maxmem=4294967296k  \
              -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
              -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
              -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
              -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
              -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
              -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
              -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
              -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
      
      Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f85086f9
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
  4. 20 9月, 2020 1 次提交
  5. 15 8月, 2020 1 次提交
  6. 13 8月, 2020 4 次提交
    • J
      mm/migrate: introduce a standard migration target allocation function · 19fc7bed
      Joonsoo Kim 提交于
      There are some similar functions for migration target allocation.  Since
      there is no fundamental difference, it's better to keep just one rather
      than keeping all variants.  This patch implements base migration target
      allocation function.  In the following patches, variants will be converted
      to use this function.
      
      Changes should be mechanical, but, unfortunately, there are some
      differences.  First, some callers' nodemask is assgined to NULL since NULL
      nodemask will be considered as all available nodes, that is,
      &node_states[N_MEMORY].  Second, for hugetlb page allocation, gfp_mask is
      redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
      user provided gfp_mask has it.  This is because future caller of this
      function requires to set this node constaint.  Lastly, if provided nodeid
      is NUMA_NO_NODE, nodeid is set up to the node where migration source
      lives.  It helps to remove simple wrappers for setting up the nodeid.
      
      Note that PageHighmem() call in previous function is changed to open-code
      "is_highmem_idx()" since it provides more readability.
      
      [akpm@linux-foundation.org: tweak patch title, per Vlastimil]
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19fc7bed
    • C
      mm, memory_hotplug: update pcp lists everytime onlining a memory block · de1193f0
      Charan Teja Reddy 提交于
      When onlining a first memory block in a zone, pcp lists are not updated
      thus pcp struct will have the default setting of ->high = 0,->batch = 1.
      
      This means till the second memory block in a zone(if it have) is onlined
      the pcp lists of this zone will not contain any pages because pcp's
      ->count is always greater than ->high thus free_pcppages_bulk() is called
      to free batch size(=1) pages every time system wants to add a page to the
      pcp list through free_unref_page().
      
      To put this in a word, system is not using benefits offered by the pcp
      lists when there is a single onlineable memory block in a zone.  Correct
      this by always updating the pcp lists when memory block is onlined.
      
      Fixes: 1f522509 ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset")
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de1193f0
    • J
      mm/memory_hotplug: fix unpaired mem_hotplug_begin/done · b4223a51
      Jia He 提交于
      When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
      online at that time), mem_hotplug_begin/done is unpaired in such case.
      
      Therefore a warning:
       Call Trace:
        percpu_up_write+0x33/0x40
        try_remove_memory+0x66/0x120
        ? _cond_resched+0x19/0x30
        remove_memory+0x2b/0x40
        dev_dax_kmem_remove+0x36/0x72 [kmem]
        device_release_driver_internal+0xf0/0x1c0
        device_release_driver+0x12/0x20
        bus_remove_device+0xe1/0x150
        device_del+0x17b/0x3e0
        unregister_dev_dax+0x29/0x60
        devm_action_release+0x15/0x20
        release_nodes+0x19a/0x1e0
        devres_release_all+0x3f/0x50
        device_release_driver_internal+0x100/0x1c0
        driver_detach+0x4c/0x8f
        bus_remove_driver+0x5c/0xd0
        driver_unregister+0x31/0x50
        dax_pmem_exit+0x10/0xfe0 [dax_pmem]
      
      Fixes: f1037ec0 ("mm/memory_hotplug: fix remove_memory() lockdep splat")
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>	[5.6+]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chuhong Yuan <hslester96@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
      Cc: Kaly Xin <Kaly.Xin@arm.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4223a51
    • J
      mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() · d622ecec
      Jia He 提交于
      This is to introduce a general dummy helper.  memory_add_physaddr_to_nid()
      is a fallback option to get the nid in case NUMA_NO_NID is detected.
      
      After this patch, arm64/sh/s390 can simply use the general dummy version.
      PowerPC/x86/ia64 will still use their specific version.
      
      This is the preparation to set a fallback value for dev_dax->target_node.
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chuhong Yuan <hslester96@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
      Cc: Kaly Xin <Kaly.Xin@arm.com>
      Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d622ecec
  7. 08 8月, 2020 2 次提交
  8. 26 6月, 2020 1 次提交
    • B
      mm/memory_hotplug.c: fix false softlockup during pfn range removal · b7e3debd
      Ben Widawsky 提交于
      When working with very large nodes, poisoning the struct pages (for which
      there will be very many) can take a very long time.  If the system is
      using voluntary preemptions, the software watchdog will not be able to
      detect forward progress.  This patch addresses this issue by offering to
      give up time like __remove_pages() does.  This behavior was introduced in
      v5.6 with: commit d33695b1 ("mm/memory_hotplug: poison memmap in
      remove_pfn_range_from_zone()")
      
      Alternately, init_page_poison could do this cond_resched(), but it seems
      to me that the caller of init_page_poison() is what actually knows whether
      or not it should relax its own priority.
      
      Based on Dan's notes, I think this is perfectly safe: commit f931ab47
      ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
      
      Aside from fixing the lockup, it is also a friendlier thing to do on lower
      core systems that might wipe out large chunks of hotplug memory (probably
      not a very common case).
      
      Fixes this kind of splat:
      
        watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
        irq event stamp: 138450
        hardirqs last  enabled at (138449): [<ffffffffa1001f26>] trace_hardirqs_on_thunk+0x1a/0x1c
        hardirqs last disabled at (138450): [<ffffffffa1001f42>] trace_hardirqs_off_thunk+0x1a/0x1c
        softirqs last  enabled at (138448): [<ffffffffa1e00347>] __do_softirq+0x347/0x456
        softirqs last disabled at (138443): [<ffffffffa10c416d>] irq_exit+0x7d/0xb0
        CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
        Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
        RIP: 0010:memset_erms+0x9/0x10
        Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
        Call Trace:
         remove_pfn_range_from_zone+0x3a/0x380
         memunmap_pages+0x17f/0x280
         release_nodes+0x22a/0x260
         __device_release_driver+0x172/0x220
         device_driver_detach+0x3e/0xa0
         unbind_store+0x113/0x130
         kernfs_fop_write+0xdc/0x1c0
         vfs_write+0xde/0x1d0
         ksys_write+0x58/0xd0
         do_syscall_64+0x5a/0x120
         entry_SYSCALL_64_after_hwframe+0x49/0xb3
        Built 2 zonelists, mobility grouping on.  Total pages: 49050381
        Policy zone: Normal
        Built 3 zonelists, mobility grouping on.  Total pages: 49312525
        Policy zone: Normal
      
      David said: "It really only is an issue for devmem.  Ordinary
      hotplugged system memory is not affected (onlined/offlined in memory
      block granularity)."
      
      Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
      Fixes: commit d33695b1 ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reported-by: N"Scargall, Steve" <steve.scargall@intel.com>
      Reported-by: NBen Widawsky <ben.widawsky@intel.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7e3debd
  9. 05 6月, 2020 8 次提交
    • E
    • D
      mm/memory_hotplug: introduce add_memory_driver_managed() · 7b7b2721
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: Interface to add driver-managed system
      ram", v4.
      
      kexec (via kexec_load()) can currently not properly handle memory added
      via dax/kmem, and will have similar issues with virtio-mem.  kexec-tools
      will currently add all memory to the fixed-up initial firmware memmap.  In
      case of dax/kmem, this means that - in contrast to a proper reboot - how
      that persistent memory will be used can no longer be configured by the
      kexec'd kernel.  In case of virtio-mem it will be harmful, because that
      memory might contain inaccessible pieces that require coordination with
      hypervisor first.
      
      In both cases, we want to let the driver in the kexec'd kernel handle
      detecting and adding the memory, like during an ordinary reboot.
      Introduce add_memory_driver_managed().  More on the samentics are in patch
      #1.
      
      In the future, we might want to make this behavior configurable for
      dax/kmem- either by configuring it in the kernel (which would then also
      allow to configure kexec_file_load()) or in kexec-tools by also adding
      "System RAM (kmem)" memory from /proc/iomem to the fixed-up initial
      firmware memmap.
      
      More on the motivation can be found in [1] and [2].
      
      [1] https://lkml.kernel.org/r/20200429160803.109056-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20200430102908.10107-1-david@redhat.com
      
      This patch (of 3):
      
      Some device drivers rely on memory they managed to not get added to the
      initial (firmware) memmap as system RAM - so it's not used as initial
      system RAM by the kernel and the driver is under control.  While this is
      the case during cold boot and after a reboot, kexec is not aware of that
      and might add such memory to the initial (firmware) memmap of the kexec
      kernel.  We need ways to teach kernel and userspace that this system ram
      is different.
      
      For example, dax/kmem allows to decide at runtime if persistent memory is
      to be used as system ram.  Another future user is virtio-mem, which has to
      coordinate with its hypervisor to deal with inaccessible parts within
      memory resources.
      
      We want to let users in the kernel (esp. kexec) but also user space
      (esp. kexec-tools) know that this memory has different semantics and
      needs to be handled differently:
      1. Don't create entries in /sys/firmware/memmap/
      2. Name the memory resource "System RAM ($DRIVER)" (exposed via
         /proc/iomem) ($DRIVER might be "kmem", "virtio_mem").
      3. Flag the memory resource IORESOURCE_MEM_DRIVER_MANAGED
      
      /sys/firmware/memmap/ [1] represents the "raw firmware-provided memory
      map" because "on most architectures that firmware-provided memory map is
      modified afterwards by the kernel itself".  The primary user is kexec on
      x86-64.  Since commit d96ae530 ("memory-hotplug: create
      /sys/firmware/memmap entry for new memory"), we add all hotplugged memory
      to that firmware memmap - which makes perfect sense for traditional memory
      hotplug on x86-64, where real HW will also add hotplugged DIMMs to the
      firmware memmap.  We replicate what the "raw firmware-provided memory map"
      looks like after hot(un)plug.
      
      To keep things simple, let the user provide the full resource name instead
      of only the driver name - this way, we don't have to manually
      allocate/craft strings for memory resources.  Also use the resource name
      to make decisions, to avoid passing additional flags.  In case the name
      isn't "System RAM", it's special.
      
      We don't have to worry about firmware_map_remove() on the removal path.
      If there is no entry, it will simply return with -EINVAL.
      
      We'll adapt dax/kmem in a follow-up patch.
      
      [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmapSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200508084217.9160-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200508084217.9160-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b7b2721
    • D
      mm/memory_hotplug: handle memblocks only with CONFIG_ARCH_KEEP_MEMBLOCK · 52219aea
      David Hildenbrand 提交于
      The comment in add_memory_resource() is stale: hotadd_new_pgdat() will no
      longer call get_pfn_range_for_nid(), as a hotadded pgdat will simply span
      no pages at all, until memory is moved to the zone/node via
      move_pfn_range_to_zone() - e.g., when onlining memory blocks.
      
      The only archs that care about memblocks for hotplugged memory (either for
      iterating over all system RAM or testing for memory validity) are arm64,
      s390x, and powerpc - due to CONFIG_ARCH_KEEP_MEMBLOCK.  Without
      CONFIG_ARCH_KEEP_MEMBLOCK, we can simply stop messing with memblocks.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200422155353.25381-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52219aea
    • D
      mm/memory_hotplug: set node_start_pfn of hotadded pgdat to 0 · c68ab18c
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: handle memblocks only with
      CONFIG_ARCH_KEEP_MEMBLOCK", v1.
      
      A hotadded node/pgdat will span no pages at all, until memory is moved to
      the zone/node via move_pfn_range_to_zone() -> resize_pgdat_range - e.g.,
      when onlining memory blocks.  We don't have to initialize the
      node_start_pfn to the memory we are adding.
      
      This patch (of 2):
      
      Especially, there is an inconsistency:
       - Hotplugging memory to a memory-less node with cpus: node_start_pf ==  0
       - Offlining and removing last memory from a node: node_start_pfn == 0
       - Hotplugging memory to a memory-less node without cpus: node_start_pfn != 0
      
      As soon as memory is onlined, node_start_pfn is overwritten with the
      actual start.  E.g., when adding two DIMMs but only onlining one of both,
      only that DIMM (with online memory blocks) is spanned by the node.
      
      Currently, the validity of node_start_pfn really is linked to
      node_spanned_pages != 0.  With node_spanned_pages == 0 (e.g., before
      onlining memory), it has no meaning.
      
      So let's stop setting node_start_pfn, just to be overwritten via
      move_pfn_range_to_zone().  This avoids confusion when looking at the code,
      wondering which magic will be performed with the node_start_pfn in this
      function, when hotadding a pgdat.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200422155353.25381-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200422155353.25381-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c68ab18c
    • D
      mm/memory_hotplug: remove is_mem_section_removable() · 04f3465c
      David Hildenbrand 提交于
      Fortunately, all users of is_mem_section_removable() are gone.  Get rid of
      it, including some now unnecessary functions.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Link: http://lkml.kernel.org/r/20200407135416.24093-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f3465c
    • V
      mm/memory_hotplug: refrain from adding memory into an impossible node · fa6d9ec7
      Vishal Verma 提交于
      A misbehaving qemu created a situation where the ACPI SRAT table
      advertised one fewer proximity domains than intended.  The NFIT table did
      describe all the expected proximity domains.  This caused the device dax
      driver to assign an impossible target_node to the device, and when
      hotplugged as system memory, this would fail with the following signature:
      
         BUG: kernel NULL pointer dereference, address: 0000000000000088
         #PF: supervisor read access in kernel mode
         #PF: error_code(0x0000) - not-present page
         PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
         Oops: 0000 [#1] SMP PTI
         CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
         RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
         Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44 00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b 84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0 80 00 00 00 fe f0 80 a3 38 20
         RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
         RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
         RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
         RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
         R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
         R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
         FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
         Call Trace:
          kswapd+0x103/0x520
          kthread+0x120/0x140
          ret_from_fork+0x3a/0x50
      
      Add a check in the add_memory path to fail if the node to which we are
      adding memory is in the node_possible_map
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200416225438.15208-1-vishal.l.verma@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa6d9ec7
    • D
      mm/memory_hotplug: Introduce offline_and_remove_memory() · 08b3acd7
      David Hildenbrand 提交于
      virtio-mem wants to offline and remove a memory block once it unplugged
      all subblocks (e.g., using alloc_contig_range()). Let's provide
      an interface to do that from a driver. virtio-mem already supports to
      offline partially unplugged memory blocks. Offlining a fully unplugged
      memory block will not require to migrate any pages. All unplugged
      subblocks are PageOffline() and have a reference count of 0 - so
      offlining code will simply skip them.
      
      All we need is an interface to offline and remove the memory from kernel
      module context, where we don't have access to the memory block devices
      (esp. find_memory_block() and device_offline()) and the device hotplug
      lock.
      
      To keep things simple, allow to only work on a single memory block.
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      08b3acd7
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
  10. 04 6月, 2020 2 次提交
    • J
      mm/page_alloc: integrate classzone_idx and high_zoneidx · 97a225e6
      Joonsoo Kim 提交于
      classzone_idx is just different name for high_zoneidx now.  So, integrate
      them and add some comment to struct alloc_context in order to reduce
      future confusion about the meaning of this variable.
      
      The accessor, ac_classzone_idx() is also removed since it isn't needed
      after integration.
      
      In addition to integration, this patch also renames high_zoneidx to
      highest_zoneidx since it represents more precise meaning.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ye Xiaolong <xiaolong.ye@intel.com>
      Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a225e6
    • M
      mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option · 3f08a302
      Mike Rapoport 提交于
      CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
      nodes and zones structures between the systems that have region to node
      mapping in memblock and those that don't.
      
      Currently all the NUMA architectures enable this option and for the
      non-NUMA systems we can presume that all the memory belongs to node 0 and
      therefore the compile time configuration option is not required.
      
      The remaining few architectures that use DISCONTIGMEM without NUMA are
      easily updated to use memblock_add_node() instead of memblock_add() and
      thus have proper correspondence of memblock regions to NUMA nodes.
      
      Still, free_area_init_node() must have a backward compatible version
      because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
      different.  Once all the architectures will use the new semantics, the
      entire compatibility layer can be dropped.
      
      To avoid addition of extra run time memory to store node id for
      architectures that keep memblock but have only a single node, the node id
      field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
      the corresponding accessors presume that in those cases it is always 0.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f08a302
  11. 11 4月, 2020 2 次提交
    • L
      mm/memory_hotplug: add pgprot_t to mhp_params · bfeb022f
      Logan Gunthorpe 提交于
      devm_memremap_pages() is currently used by the PCI P2PDMA code to create
      struct page mappings for IO memory.  At present, these mappings are
      created with PAGE_KERNEL which implies setting the PAT bits to be WB.
      However, on x86, an mtrr register will typically override this and force
      the cache type to be UC-.  In the case firmware doesn't set this
      register it is effectively WB and will typically result in a machine
      check exception when it's accessed.
      
      Other arches are not currently likely to function correctly seeing they
      don't have any MTRR registers to fall back on.
      
      To solve this, provide a way to specify the pgprot value explicitly to
      arch_add_memory().
      
      Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
      simple change to pass the pgprot_t down to their respective functions
      which set up the page tables.  For x86_32, set the page tables
      explicitly using _set_memory_prot() (seeing they are already mapped).
      
      For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
      should be fine, for now, seeing these architectures don't support
      ZONE_DEVICE.
      
      A check in __add_pages() is also added to ensure the pgprot parameter
      was set for all arches.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfeb022f
    • L
      mm/memory_hotplug: rename mhp_restrictions to mhp_params · f5637d3b
      Logan Gunthorpe 提交于
      The mhp_restrictions struct really doesn't specify anything resembling a
      restriction anymore so rename it to be mhp_params as it is a list of
      extended parameters.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5637d3b
  12. 08 4月, 2020 2 次提交