1. 04 6月, 2020 15 次提交
    • W
      mm/page_alloc.c: rename free_pages_check() to check_free_page() · 534fe5e3
      Wei Yang 提交于
      free_pages_check() is the counterpart of check_new_page().  Rename it to
      use the same naming convention.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-5-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      534fe5e3
    • W
      mm/page_alloc.c: rename free_pages_check_bad() to check_free_page_bad() · 0d0c48a2
      Wei Yang 提交于
      free_pages_check_bad() is the counterpart of check_new_page_bad().  Rename
      it to use the same naming convention.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-4-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d0c48a2
    • W
      mm/page_alloc.c: bad_flags is not necessary for bad_page() · 82a3241a
      Wei Yang 提交于
      After commit 5b57b8f2 ("mm/debug.c: always print flags in
      dump_page()"), page->flags is always printed for a bad page.  It is not
      necessary to have bad_flags any more.
      Suggested-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-3-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82a3241a
    • W
      mm/page_alloc.c: bad_[reason|flags] is not necessary when PageHWPoison · 833d8a42
      Wei Yang 提交于
      Patch series "mm/page_alloc.c: cleanup on check page", v3.
      
      This patchset does some cleanup related to check page.
      
      1. Remove unnecessary bad_reason assignment
      2. Remove bad_flags to bad_page()
      3. Rename function for naming convention
      4. Extract common part to check page
      
      Thanks for suggestions from David Rientjes and Anshuman Khandual.
      
      This patch (of 5):
      
      Since function returns directly, bad_[reason|flags] is not used any where.
      And move this to the first.
      
      This is a following cleanup for commit e570f56c ("mm:
      check_new_page_bad() directly returns in __PG_HWPOISON case")
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-2-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      833d8a42
    • M
      mm: simplify find_min_pfn_with_active_regions() · 8a1b25fe
      Mike Rapoport 提交于
      find_min_pfn_with_active_regions() calls find_min_pfn_for_node() with nid
      parameter set to MAX_NUMNODES.  This makes the find_min_pfn_for_node()
      traverse all memblock memory regions although the first PFN in the system
      can be easily found with memblock_start_of_DRAM().
      
      Use memblock_start_of_DRAM() in find_min_pfn_with_active_regions() and drop
      now unused find_min_pfn_for_node().
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-21-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a1b25fe
    • M
      mm: clean up free_area_init_node() and its helpers · 854e8848
      Mike Rapoport 提交于
      free_area_init_node() now always uses memblock info and the zone PFN
      limits so it does not need the backwards compatibility functions to
      calculate the zone spanned and absent pages.  The removal of the compat_
      versions of zone_{abscent,spanned}_pages_in_node() in turn, makes
      zone_size and zhole_size parameters unused.
      
      The node_start_pfn is determined by get_pfn_range_for_nid(), so there is
      no need to pass it to free_area_init_node().
      
      As a result, the only required parameter to free_area_init_node() is the
      node ID, all the rest are removed along with no longer used
      compat_zone_{abscent,spanned}_pages_in_node() helpers.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-20-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854e8848
    • M
      mm: rename free_area_init_node() to free_area_init_memoryless_node() · bc9331a1
      Mike Rapoport 提交于
      free_area_init_node() is only used by x86 to initialize a memory-less
      nodes.  Make its name reflect this and drop all the function parameters
      except node ID as they are anyway zero.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-19-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc9331a1
    • M
      mm: free_area_init: allow defining max_zone_pfn in descending order · 51930df5
      Mike Rapoport 提交于
      Some architectures (e.g.  ARC) have the ZONE_HIGHMEM zone below the
      ZONE_NORMAL.  Allowing free_area_init() parse max_zone_pfn array even it
      is sorted in descending order allows using free_area_init() on such
      architectures.
      
      Add top -> down traversal of max_zone_pfn array in free_area_init() and
      use the latter in ARC node/zone initialization.
      
      [rppt@kernel.org: ARC fix]
        Link: http://lkml.kernel.org/r/20200504153901.GM14260@kernel.org
      [rppt@linux.ibm.com: arc: free_area_init(): take into account PAE40 mode]
        Link: http://lkml.kernel.org/r/20200507205900.GH683243@linux.ibm.com
      [akpm@linux-foundation.org: declare arch_has_descending_max_zone_pfns()]
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Link: http://lkml.kernel.org/r/20200412194859.12663-18-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51930df5
    • M
      mm: remove early_pfn_in_nid() and CONFIG_NODES_SPAN_OTHER_NODES · acd3f5c4
      Mike Rapoport 提交于
      The memmap_init() function was made to iterate over memblock regions and
      as the result the early_pfn_in_nid() function became obsolete.  Since
      CONFIG_NODES_SPAN_OTHER_NODES is only used to pick a stub or a real
      implementation of early_pfn_in_nid(), it is also not needed anymore.
      
      Remove both early_pfn_in_nid() and the CONFIG_NODES_SPAN_OTHER_NODES.
      Co-developed-by: NHoan Tran <Hoan@os.amperecomputing.com>
      Signed-off-by: NHoan Tran <Hoan@os.amperecomputing.com>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-17-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acd3f5c4
    • B
      mm: memmap_init: iterate over memblock regions rather that check each PFN · 73a6e474
      Baoquan He 提交于
      When called during boot the memmap_init_zone() function checks if each PFN
      is valid and actually belongs to the node being initialized using
      early_pfn_valid() and early_pfn_in_nid().
      
      Each such check may cost up to O(log(n)) where n is the number of memory
      banks, so for large amount of memory overall time spent in early_pfn*()
      becomes substantial.
      
      Since the information is anyway present in memblock, we can iterate over
      memblock memory regions in memmap_init() and only call memmap_init_zone()
      for PFN ranges that are know to be valid and in the appropriate node.
      
      [cai@lca.pw: fix a compilation warning from Clang]
        Link: http://lkml.kernel.org/r/CF6E407F-17DC-427C-8203-21979FB882EF@lca.pw
      [bhe@redhat.com: fix the incorrect hole in fast_isolate_freepages()]
        Link: http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
        Link: http://lkml.kernel.org/r/20200521014407.29690-1-bhe@redhat.comSigned-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200412194859.12663-16-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73a6e474
    • M
      mm: use free_area_init() instead of free_area_init_nodes() · 9691a071
      Mike Rapoport 提交于
      free_area_init() has effectively became a wrapper for
      free_area_init_nodes() and there is no point of keeping it.  Still
      free_area_init() name is shorter and more general as it does not imply
      necessity to initialize multiple nodes.
      
      Rename free_area_init_nodes() to free_area_init(), update the callers and
      drop old version of free_area_init().
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-6-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9691a071
    • M
      mm: free_area_init: use maximal zone PFNs rather than zone sizes · fa3354e4
      Mike Rapoport 提交于
      Currently, architectures that use free_area_init() to initialize memory
      map and node and zone structures need to calculate zone and hole sizes.
      We can use free_area_init_nodes() instead and let it detect the zone
      boundaries while the architectures will only have to supply the possible
      limits for the zones.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-5-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa3354e4
    • M
      mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option · 3f08a302
      Mike Rapoport 提交于
      CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
      nodes and zones structures between the systems that have region to node
      mapping in memblock and those that don't.
      
      Currently all the NUMA architectures enable this option and for the
      non-NUMA systems we can presume that all the memory belongs to node 0 and
      therefore the compile time configuration option is not required.
      
      The remaining few architectures that use DISCONTIGMEM without NUMA are
      easily updated to use memblock_add_node() instead of memblock_add() and
      thus have proper correspondence of memblock regions to NUMA nodes.
      
      Still, free_area_init_node() must have a backward compatible version
      because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
      different.  Once all the architectures will use the new semantics, the
      entire compatibility layer can be dropped.
      
      To avoid addition of extra run time memory to store node id for
      architectures that keep memblock but have only a single node, the node id
      field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
      the corresponding accessors presume that in those cases it is always 0.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f08a302
    • M
      mm: make early_pfn_to_nid() and related defintions close to each other · 6f24fbd3
      Mike Rapoport 提交于
      early_pfn_to_nid() and its helper __early_pfn_to_nid() are spread around
      include/linux/mm.h, include/linux/mmzone.h and mm/page_alloc.c.
      
      Drop unused stub for __early_pfn_to_nid() and move its actual generic
      implementation close to its users.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-3-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f24fbd3
    • M
      mm: memblock: replace dereferences of memblock_region.nid with API calls · d622abf7
      Mike Rapoport 提交于
      Patch series "mm: rework free_area_init*() funcitons".
      
      After the discussion [1] about removal of CONFIG_NODES_SPAN_OTHER_NODES
      and CONFIG_HAVE_MEMBLOCK_NODE_MAP options, I took it a bit further and
      updated the node/zone initialization.
      
      Since all architectures have memblock, it is possible to use only the
      newer version of free_area_init_node() that calculates the zone and node
      boundaries based on memblock node mapping and architectural limits on
      possible zone PFNs.
      
      The architectures that still determined zone and hole sizes can be
      switched to the generic code and the old code that took those zone and
      hole sizes can be simply removed.
      
      And, since it all started from the removal of
      CONFIG_NODES_SPAN_OTHER_NODES, the memmap_init() is now updated to iterate
      over memblocks and so it does not need to perform early_pfn_to_nid() query
      for every PFN.
      
      [1] https://lore.kernel.org/lkml/1585420282-25630-1-git-send-email-Hoan@os.amperecomputing.com
      
      This patch (of 21):
      
      There are several places in the code that directly dereference
      memblock_region.nid despite this field being defined only when
      CONFIG_HAVE_MEMBLOCK_NODE_MAP=y.
      
      Replace these with calls to memblock_get_region_nid() to improve code
      robustness and to avoid possible breakage when
      CONFIG_HAVE_MEMBLOCK_NODE_MAP will be removed.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200412194859.12663-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d622abf7
  2. 03 6月, 2020 2 次提交
    • C
      mm: remove the pgprot argument to __vmalloc · 88dca4ca
      Christoph Hellwig 提交于
      The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
      Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWei Liu <wei.liu@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88dca4ca
    • N
      mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead · 8d92890b
      NeilBrown 提交于
      After an NFS page has been written it is considered "unstable" until a
      COMMIT request succeeds.  If the COMMIT fails, the page will be
      re-written.
      
      These "unstable" pages are currently accounted as "reclaimable", either
      in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
      'reclaimable' count.  This might have made sense when sending the COMMIT
      required a separate action by the VFS/MM (e.g.  releasepage() used to
      send a COMMIT).  However now that all writes generated by ->writepages()
      will automatically be followed by a COMMIT (since commit 919e3bd9
      ("NFS: Ensure we commit after writeback is complete")) it makes more
      sense to treat them as writeback pages.
      
      So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
      NR_WRITEBACK and WB_WRITEBACK.
      
      A particular effect of this change is that when
      wb_check_background_flush() calls wb_over_bg_threshold(), the latter
      will report 'true' a lot less often as the 'unstable' pages are no
      longer considered 'dirty' (as there is nothing that writeback can do
      about them anyway).
      
      Currently wb_check_background_flush() will trigger writeback to NFS even
      when there are relatively few dirty pages (if there are lots of unstable
      pages), this can result in small writes going to the server (10s of
      Kilobytes rather than a Megabyte) which hurts throughput.  With this
      patch, there are fewer writes which are each larger on average.
      
      Where the NR_UNSTABLE_NFS count was included in statistics
      virtual-files, the entry is retained, but the value is hard-coded as
      zero.  static trace points and warning printks which mentioned this
      counter no longer report it.
      
      [akpm@linux-foundation.org: re-layout comment]
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Acked-by: Michal Hocko <mhocko@suse.com>	[mm]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d92890b
  3. 15 5月, 2020 1 次提交
  4. 08 5月, 2020 2 次提交
    • H
      mm: limit boost_watermark on small zones · 14f69140
      Henry Willard 提交于
      Commit 1c30844d ("mm: reclaim small amounts of memory when an
      external fragmentation event occurs") adds a boost_watermark() function
      which increases the min watermark in a zone by at least
      pageblock_nr_pages or the number of pages in a page block.
      
      On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
      512M.  It does this regardless of the number of managed pages managed in
      the zone or the likelihood of success.
      
      This can put the zone immediately under water in terms of allocating
      pages from the zone, and can cause a small machine to fail immediately
      due to OoM.  Unlike set_recommended_min_free_kbytes(), which
      substantially increases min_free_kbytes and is tied to THP,
      boost_watermark() can be called even if THP is not active.
      
      The problem is most likely to appear on architectures such as Arm64
      where pageblock_nr_pages is very large.
      
      It is desirable to run the kdump capture kernel in as small a space as
      possible to avoid wasting memory.  In some architectures, such as Arm64,
      there are restrictions on where the capture kernel can run, and
      therefore, the space available.  A capture kernel running in 768M can
      fail due to OoM immediately after boost_watermark() sets the min in zone
      DMA32, where most of the memory is, to 512M.  It fails even though there
      is over 500M of free memory.  With boost_watermark() suppressed, the
      capture kernel can run successfully in 448M.
      
      This patch limits boost_watermark() to boosting a zone's min watermark
      only when there are enough pages that the boost will produce positive
      results.  In this case that is estimated to be four times as many pages
      as pageblock_nr_pages.
      
      Mel said:
      
      : There is no harm in marking it stable.  Clearly it does not happen very
      : often but it's not impossible.  32-bit x86 is a lot less common now
      : which would previously have been vulnerable to triggering this easily.
      : ppc64 has a larger base page size but typically only has one zone.
      : arm64 is likely the most vulnerable, particularly when CMA is
      : configured with a small movable zone.
      
      Fixes: 1c30844d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NHenry Willard <henry.willard@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14f69140
    • D
      mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() · e84fe99b
      David Hildenbrand 提交于
      Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
      e.g., while booting up.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
        Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
        RIP: __pageblock_pfn_to_page+0x134/0x1c0
        Call Trace:
         set_zone_contiguous+0x56/0x70
         page_alloc_init_late+0x166/0x176
         kernel_init_freeable+0xfa/0x255
         kernel_init+0xa/0x106
         ret_from_fork+0x35/0x40
      
      The issue becomes visible when having a lot of memory (e.g., 4TB)
      assigned to a single NUMA node - a system that can easily be created
      using QEMU.  Inside VMs on a hypervisor with quite some memory
      overcommit, this is fairly easy to trigger.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e84fe99b
  5. 11 4月, 2020 2 次提交
  6. 08 4月, 2020 4 次提交
    • A
      mm: introduce Reported pages · 36e66c55
      Alexander Duyck 提交于
      In order to pave the way for free page reporting in virtualized
      environments we will need a way to get pages out of the free lists and
      identify those pages after they have been returned.  To accomplish this,
      this patch adds the concept of a Reported Buddy, which is essentially
      meant to just be the Uptodate flag used in conjunction with the Buddy page
      type.
      
      To prevent the reported pages from leaking outside of the buddy lists I
      added a check to clear the PageReported bit in the del_page_from_free_list
      function.  As a result any reported page that is split, merged, or
      allocated will have the flag cleared prior to the PageBuddy value being
      cleared.
      
      The process for reporting pages is fairly simple.  Once we free a page
      that meets the minimum order for page reporting we will schedule a worker
      thread to start 2s or more in the future.  That worker thread will begin
      working from the lowest supported page reporting order up to MAX_ORDER - 1
      pulling unreported pages from the free list and storing them in the
      scatterlist.
      
      When processing each individual free list it is necessary for the worker
      thread to release the zone lock when it needs to stop and report the full
      scatterlist of pages.  To reduce the work of the next iteration the worker
      thread will rotate the free list so that the first unreported page in the
      free list becomes the first entry in the list.
      
      It will then call a reporting function providing information on how many
      entries are in the scatterlist.  Once the function completes it will
      return the pages to the free area from which they were allocated and start
      over pulling more pages from the free areas until there are no longer
      enough pages to report on to keep the worker busy, or we have processed as
      many pages as were contained in the free area when we started processing
      the list.
      
      The worker thread will work in a round-robin fashion making its way though
      each zone requesting reporting, and through each reportable free list
      within that zone.  Once all free areas within the zone have been processed
      it will check to see if there have been any requests for reporting while
      it was processing.  If so it will reschedule the worker thread to start up
      again in roughly 2s and exit.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36e66c55
    • A
      mm: add function __putback_isolated_page · 624f58d8
      Alexander Duyck 提交于
      There are cases where we would benefit from avoiding having to go through
      the allocation and free cycle to return an isolated page.
      
      Examples for this might include page poisoning in which we isolate a page
      and then put it back in the free list without ever having actually
      allocated it.
      
      This will enable us to also avoid notifiers for the future free page
      reporting which will need to avoid retriggering page reporting when
      returning pages that have been reported on.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      624f58d8
    • A
      mm: use zone and order instead of free area in free_list manipulators · 6ab01363
      Alexander Duyck 提交于
      In order to enable the use of the zone from the list manipulator functions
      I will need access to the zone pointer.  As it turns out most of the
      accessors were always just being directly passed &zone->free_area[order]
      anyway so it would make sense to just fold that into the function itself
      and pass the zone and order as arguments instead of the free area.
      
      In order to be able to reference the zone we need to move the declaration
      of the functions down so that we have the zone defined before we define
      the list manipulation functions.  Since the functions are only used in the
      file mm/page_alloc.c we can just move them there to reduce noise in the
      header.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pagupta@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ab01363
    • A
      mm: adjust shuffle code to allow for future coalescing · a2129f24
      Alexander Duyck 提交于
      Patch series "mm / virtio: Provide support for free page reporting", v17.
      
      This series provides an asynchronous means of reporting free guest pages
      to a hypervisor so that the memory associated with those pages can be
      dropped and reused by other processes and/or guests on the host.  Using
      this it is possible to avoid unnecessary I/O to disk and greatly improve
      performance in the case of memory overcommit on the host.
      
      When enabled we will be performing a scan of free memory every 2 seconds
      while pages of sufficiently high order are being freed.  In each pass at
      least one sixteenth of each free list will be reported.  By doing this we
      avoid racing against other threads that may be causing a high amount of
      memory churn.
      
      The lowest page order currently scanned when reporting pages is
      pageblock_order so that this feature will not interfere with the use of
      Transparent Huge Pages in the case of virtualization.
      
      Currently this is only in use by virtio-balloon however there is the hope
      that at some point in the future other hypervisors might be able to make
      use of it.  In the virtio-balloon/QEMU implementation the hypervisor is
      currently using MADV_DONTNEED to indicate to the host kernel that the page
      is currently free.  It will be zeroed and faulted back into the guest the
      next time the page is accessed.
      
      To track if a page is reported or not the Uptodate flag was repurposed and
      used as a Reported flag for Buddy pages.  We walk though the free list
      isolating pages and adding them to the scatterlist until we either
      encounter the end of the list or have processed at least one sixteenth of
      the pages that were listed in nr_free prior to us starting.  If we fill
      the scatterlist before we reach the end of the list we rotate the list so
      that the first unreported page we encounter is moved to the head of the
      list as that is where we will resume after we have freed the reported
      pages back into the tail of the list.
      
      Below are the results from various benchmarks.  I primarily focused on two
      tests.  The first is the will-it-scale/page_fault2 test, and the other is
      a modified version of will-it-scale/page_fault1 that was enabled to use
      THP.  I did this as it allows for better visibility into different parts
      of the memory subsystem.  The guest is running with 32G for RAM on one
      node of a E5-2630 v3.  The host has had some features such as CPU turbo
      disabled in the BIOS.
      
      Test                   page_fault1 (THP)    page_fault2
      Name            tasks  Process Iter  STDEV  Process Iter  STDEV
      Baseline            1    1012402.50  0.14%     361855.25  0.81%
                         16    8827457.25  0.09%    3282347.00  0.34%
      
      Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
                         16    8784741.75  0.39%    3240669.25  0.48%
      
      Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
                         16    8756219.00  0.24%    3226608.75  0.97%
      
      Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
       page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
      
      Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
       shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
      
      The results above are for a baseline with a linux-next-20191219 kernel,
      that kernel with this patch set applied but page reporting disabled in
      virtio-balloon, the patches applied and page reporting fully enabled, the
      patches enabled with page shuffling enabled, and the patches applied with
      page shuffling enabled and an RFC patch that makes used of MADV_FREE in
      QEMU.  These results include the deviation seen between the average value
      reported here versus the high and/or low value.  I observed that during
      the test memory usage for the first three tests never dropped whereas with
      the patches fully enabled the VM would drop to using only a few GB of the
      host's memory when switching from memhog to page fault tests.
      
      Any of the overhead visible with this patch set enabled seems due to page
      faults caused by accessing the reported pages and the host zeroing the
      page before giving it back to the guest.  This overhead is much more
      visible when using THP than with standard 4K pages.  In addition page
      shuffling seemed to increase the amount of faults generated due to an
      increase in memory churn.  The overehad is reduced when using MADV_FREE as
      we can avoid the extra zeroing of the pages when they are reintroduced to
      the host, as can be seen when the RFC is applied with shuffling enabled.
      
      The overall guest size is kept fairly small to only a few GB while the
      test is running.  If the host memory were oversubscribed this patch set
      should result in a performance improvement as swapping memory in the host
      can be avoided.
      
      A brief history on the background of free page reporting can be found at:
      https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/
      
      This patch (of 9):
      
      Move the head/tail adding logic out of the shuffle code and into the
      __free_one_page function since ultimately that is where it is really
      needed anyway.  By doing this we should be able to reduce the overhead and
      can consolidate all of the list addition bits in one spot.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2129f24
  7. 03 4月, 2020 10 次提交
  8. 04 2月, 2020 4 次提交
    • A
      mm/memmap_init: update variable name in memmap_init_zone · 1f8d75c1
      Aneesh Kumar K.V 提交于
      Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.
      
      This series fixes the access of uninitialized memmaps when shrinking
      zones/nodes and when removing memory.  Also, it contains all fixes for
      crashes that can be triggered when removing certain namespace using
      memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
      
      We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
      more involved (we don't have SECTION_IS_ONLINE as an indicator), and
      shrinking is only of limited use (set_zone_contiguous() cannot detect the
      ZONE_DEVICE as contiguous).
      
      We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
      code to a minimum.  Shrinking is especially necessary to keep
      zone->contiguous set where possible, especially, on memory unplug of DIMMs
      at zone boundaries.
      
      --------------------------------------------------------------------------
      
      Zones are now properly shrunk when offlining memory blocks or when
      onlining failed.  This allows to properly shrink zones on memory unplug
      even if the separate memory blocks of a DIMM were onlined to different
      zones or re-onlined to a different zone after offlining.
      
      Example:
      
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  0
              present  0
              managed  0
      :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
      :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  98304
              present  65536
              managed  65536
      :/# echo 0 > /sys/devices/system/memory/memory43/online
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  32768
              present  32768
              managed  32768
      :/# echo 0 > /sys/devices/system/memory/memory41/online
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  0
              present  0
              managed  0
      
      This patch (of 6):
      
      The third argument is actually number of pages.  Change the variable name
      from size to nr_pages to indicate this better.
      
      No functional change in this patch.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f8d75c1
    • D
      mm: factor out next_present_section_nr() · 4c605881
      David Hildenbrand 提交于
      Let's move it to the header and use the shorter variant from
      mm/page_alloc.c (the original one will also check
      "__highest_present_section_nr + 1", which is not necessary).  While at
      it, make the section_nr in next_pfn() const.
      
      In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
      exceed __highest_present_section_nr, which doesn't make a difference in
      the caller as it is big enough (>= all sane end_pfn).
      
      Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c605881
    • D
      mm/page_alloc: fix and rework pfn handling in memmap_init_zone() · 948c436e
      David Hildenbrand 提交于
      Let's update the pfn manually whenever we continue the loop.  This makes
      the code easier to read but also less error prone (and we can directly fix
      one issue).
      
      When overlap_memmap_init() returns true, pfn is updated to
      "memblock_region_memory_end_pfn(r)".  So it already points at the *next*
      pfn to process.  Incrementing the pfn another time is wrong, we might
      leave one uninitialized.  I spotted this by inspecting the code, so I have
      no idea if this is relevant in practise (with kernelcore=mirror).
      
      Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
      Fixes: a9a9e77f ("mm: move mirrored memory specific code outside of memmap_init_zone")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948c436e
    • D
      mm/page_alloc.c: initialize memmap of unavailable memory directly · 4b094b78
      David Hildenbrand 提交于
      Let's make sure that all memory holes are actually marked PageReserved(),
      that page_to_pfn() produces reliable results, and that these pages are not
      detected as "mmap" pages due to the mapcount.
      
      E.g., booting a x86-64 QEMU guest with 4160 MB:
      
      [    0.010585] Early memory node ranges
      [    0.010586]   node   0: [mem 0x0000000000001000-0x000000000009efff]
      [    0.010588]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
      [    0.010589]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
      
      max_pfn is 0x144000.
      
      Before this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000800           16384       64  ___________M_______________________________        mmap
                   total           16384       64
      
      After this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000100000000           16384       64  ___________________________r_______________        reserved
                   total           16384       64
      
      IOW, especially the unavailable physical memory ("memory hole") in the
      last section would not get properly marked PageReserved() and is indicated
      to be "mmap" memory.
      
      Drop the trace of that function from include/linux/mm.h - nobody else
      needs it, and rename it accordingly.
      
      Note: The fake zone/node might not be covered by the zone/node span.  This
      is not an urgent issue (for now, we had the same node/zone due to the
      zeroing).  We'll need a clean way to mark memory holes (e.g., using a page
      type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
      marking these memory holes PageReserved().
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b094b78