1. 08 8月, 2020 3 次提交
  2. 17 7月, 2020 1 次提交
  3. 04 7月, 2020 1 次提交
  4. 09 6月, 2020 1 次提交
  5. 05 6月, 2020 2 次提交
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
    • D
      virtio-mem: Paravirtualized memory hotunplug part 2 · 255f5985
      David Hildenbrand 提交于
      We also want to unplug online memory (contained in online memory blocks
      and, therefore, managed by the buddy), and eventually replug it later.
      
      When requested to unplug memory, we use alloc_contig_range() to allocate
      subblocks in online memory blocks (so we are the owner) and send them to
      our hypervisor. When requested to plug memory, we can replug such memory
      using free_contig_range() after asking our hypervisor.
      
      We also want to mark all allocated pages PG_offline, so nobody will
      touch them. To differentiate pages that were never onlined when
      onlining the memory block from pages allocated via alloc_contig_range(), we
      use PageDirty(). Based on this flag, virtio_mem_fake_online() can either
      online the pages for the first time or use free_contig_range().
      
      It is worth noting that there are no guarantees on how much memory can
      actually get unplugged again. All device memory might completely be
      fragmented with unmovable data, such that no subblock can get unplugged.
      
      We are not touching the ZONE_MOVABLE. If memory is onlined to the
      ZONE_MOVABLE, it can only get unplugged after that memory was offlined
      manually by user space. In normal operation, virtio-mem memory is
      suggested to be onlined to ZONE_NORMAL. In the future, we will try to
      make unplug more likely to succeed.
      
      Add a module parameter to control if online memory shall be touched.
      
      As we want to access alloc_contig_range()/free_contig_range() from
      kernel module context, export the symbols.
      
      Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages
      are on the same node, in the same zone, and contain no holes.
      
      Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      255f5985
  6. 04 6月, 2020 32 次提交
    • M
      mm/vmscan.c: change prototype for shrink_page_list · 730ec8c0
      Maninder Singh 提交于
      commit 3c710c1a ("mm, vmscan extract shrink_page_list reclaim counters
      into a struct") changed data type for the function, so changing return
      type for funciton and its caller.
      Signed-off-by: NVaneet Narang <v.narang@samsung.com>
      Signed-off-by: NManinder Singh <maninder1.s@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Amit Sahrawat <a.sahrawat@samsung.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      730ec8c0
    • C
    • D
      mm: make deferred init's max threads arch-specific · ecd09650
      Daniel Jordan 提交于
      Using padata during deferred init has only been tested on x86, so for now
      limit it to this architecture.
      
      If another arch wants this, it can find the max thread limit that's best
      for it and override deferred_page_init_max_threads().
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-8-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecd09650
    • D
      mm: parallelize deferred_init_memmap() · e4443149
      Daniel Jordan 提交于
      Deferred struct page init is a significant bottleneck in kernel boot.
      Optimizing it maximizes availability for large-memory systems and allows
      spinning up short-lived VMs as needed without having to leave them
      running.  It also benefits bare metal machines hosting VMs that are
      sensitive to downtime.  In projects such as VMM Fast Restart[1], where
      guest state is preserved across kexec reboot, it helps prevent application
      and network timeouts in the guests.
      
      Multithread to take full advantage of system memory bandwidth.
      
      The maximum number of threads is capped at the number of CPUs on the node
      because speedups always improve with additional threads on every system
      tested, and at this phase of boot, the system is otherwise idle and
      waiting on page init to finish.
      
      Helper threads operate on section-aligned ranges to both avoid false
      sharing when setting the pageblock's migrate type and to avoid accessing
      uninitialized buddy pages, though max order alignment is enough for the
      latter.
      
      The minimum chunk size is also a section.  There was benefit to using
      multiple threads even on relatively small memory (1G) systems, and this is
      the smallest size that the alignment allows.
      
      The time (milliseconds) is the slowest node to initialize since boot
      blocks until all nodes finish.  intel_pstate is loaded in active mode
      without hwp and with turbo enabled, and intel_idle is active as well.
      
          Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
            2 nodes * 26 cores * 2 threads = 104 CPUs
            384G/node = 768G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   4089.7 (  8.1)         --   1785.7 (  7.6)
             2% (  1)       1.7%   4019.3 (  1.5)       3.8%   1717.7 ( 11.8)
            12% (  6)      34.9%   2662.7 (  2.9)      79.9%    359.3 (  0.6)
            25% ( 13)      39.9%   2459.0 (  3.6)      91.2%    157.0 (  0.0)
            37% ( 19)      39.2%   2485.0 ( 29.7)      90.4%    172.0 ( 28.6)
            50% ( 26)      39.3%   2482.7 ( 25.7)      90.3%    173.7 ( 30.0)
            75% ( 39)      39.0%   2495.7 (  5.5)      89.4%    190.0 (  1.0)
           100% ( 52)      40.2%   2443.7 (  3.8)      92.3%    138.0 (  1.0)
      
          Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
            1 node * 16 cores * 2 threads = 32 CPUs
            192G/node = 192G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1988.7 (  9.6)         --   1096.0 ( 11.5)
             3% (  1)       1.1%   1967.0 ( 17.6)       0.3%   1092.7 ( 11.0)
            12% (  4)      41.1%   1170.3 ( 14.2)      73.8%    287.0 (  3.6)
            25% (  8)      47.1%   1052.7 ( 21.9)      83.9%    177.0 ( 13.5)
            38% ( 12)      48.9%   1016.3 ( 12.1)      86.8%    144.7 (  1.5)
            50% ( 16)      48.9%   1015.7 (  8.1)      87.8%    134.0 (  4.4)
            75% ( 24)      49.1%   1012.3 (  3.1)      88.1%    130.3 (  2.3)
           100% ( 32)      49.5%   1004.0 (  5.3)      88.5%    125.7 (  2.1)
      
          Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
            2 nodes * 18 cores * 2 threads = 72 CPUs
            128G/node = 256G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1680.0 (  4.6)         --    627.0 (  4.0)
             3% (  1)       0.3%   1675.7 (  4.5)      -0.2%    628.0 (  3.6)
            11% (  4)      25.6%   1250.7 (  2.1)      67.9%    201.0 (  0.0)
            25% (  9)      30.7%   1164.0 ( 17.3)      81.8%    114.3 ( 17.7)
            36% ( 13)      31.4%   1152.7 ( 10.8)      84.0%    100.3 ( 17.9)
            50% ( 18)      31.5%   1150.7 (  9.3)      83.9%    101.0 ( 14.1)
            75% ( 27)      31.7%   1148.0 (  5.6)      84.5%     97.3 (  6.4)
           100% ( 36)      32.0%   1142.3 (  4.0)      85.6%     90.0 (  1.0)
      
          AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
            1 node * 8 cores * 2 threads = 16 CPUs
            64G/node = 64G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1029.3 ( 25.1)         --    240.7 (  1.5)
             6% (  1)      -0.6%   1036.0 (  7.8)      -2.2%    246.0 (  0.0)
            12% (  2)      11.8%    907.7 (  8.6)      44.7%    133.0 (  1.0)
            25% (  4)      13.9%    886.0 ( 10.6)      62.6%     90.0 (  6.0)
            38% (  6)      17.8%    845.7 ( 14.2)      69.1%     74.3 (  3.8)
            50% (  8)      16.8%    856.0 ( 22.1)      72.9%     65.3 (  5.7)
            75% ( 12)      15.4%    871.0 ( 29.2)      79.8%     48.7 (  7.4)
           100% ( 16)      21.0%    813.7 ( 21.0)      80.5%     47.0 (  5.2)
      
      Server-oriented distros that enable deferred page init sometimes run in
      small VMs, and they still benefit even though the fraction of boot time
      saved is smaller:
      
          AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
            1 node * 2 cores * 2 threads = 4 CPUs
            16G/node = 16G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --    716.0 ( 14.0)         --     49.7 (  0.6)
            25% (  1)       1.8%    703.0 (  5.3)      -4.0%     51.7 (  0.6)
            50% (  2)       1.6%    704.7 (  1.2)      43.0%     28.3 (  0.6)
            75% (  3)       2.7%    696.7 ( 13.1)      49.7%     25.0 (  0.0)
           100% (  4)       4.1%    687.0 ( 10.4)      55.7%     22.0 (  0.0)
      
          Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
            1 node * 2 cores * 2 threads = 4 CPUs
            14G/node = 14G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --    787.7 (  6.4)         --    122.3 (  0.6)
            25% (  1)       0.2%    786.3 ( 10.8)      -2.5%    125.3 (  2.1)
            50% (  2)       5.9%    741.0 ( 13.9)      37.6%     76.3 ( 19.7)
            75% (  3)       8.3%    722.0 ( 19.0)      49.9%     61.3 (  3.2)
           100% (  4)       9.3%    714.7 (  9.5)      56.4%     53.3 (  1.5)
      
      On Josh's 96-CPU and 192G memory system:
      
          Without this patch series:
          [    0.487132] node 0 initialised, 23398907 pages in 292ms
          [    0.499132] node 1 initialised, 24189223 pages in 304ms
          ...
          [    0.629376] Run /sbin/init as init process
      
          With this patch series:
          [    0.231435] node 1 initialised, 24189223 pages in 32ms
          [    0.236718] node 0 initialised, 23398907 pages in 36ms
      
      [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdfSigned-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-7-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4443149
    • D
      mm: don't track number of pages during deferred initialization · 89c7c402
      Daniel Jordan 提交于
      Deferred page init used to report the number of pages initialized:
      
        node 0 initialised, 32439114 pages in 97ms
      
      Tracking this makes the code more complicated when using multiple threads.
      Given that the statistic probably has limited value, especially since a
      zone grows on demand so that the page count can vary, just remove it.
      
      The boot message now looks like
      
        node 0 deferred pages initialised in 97ms
      Suggested-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-6-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c7c402
    • P
      mm: call cond_resched() from deferred_init_memmap() · da97f2d5
      Pavel Tatashin 提交于
      Now that deferred pages are initialized with interrupts enabled we can
      replace touch_nmi_watchdog() with cond_resched(), as it was before
      3a2d7fa8.
      
      For now, we cannot do the same in deferred_grow_zone() as it is still
      initializes pages with interrupts disabled.
      
      This change fixes RCU problem described in
      https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com
      
      [   60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   60.475000] rcu:  1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
      [   60.475000] rcu:  (detected by 0, t=60002 jiffies, g=-1199, q=1)
      [   60.475000] Sending NMI from CPU 0 to CPUs 1:
      [    1.760091] NMI backtrace for cpu 1
      [    1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
      [    1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
      [    1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
      [    1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
      [    1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
      [    1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
      [    1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
      [    1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
      [    1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
      [    1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
      [    1.760091] FS:  0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
      [    1.760091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
      [    1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    1.760091] Call Trace:
      [    1.760091]  deferred_init_pages+0x8f/0xbf
      [    1.760091]  deferred_init_memmap+0x184/0x29d
      [    1.760091]  ? deferred_free_pages.isra.97+0xba/0xba
      [    1.760091]  kthread+0x112/0x130
      [    1.760091]  ? kthread_flush_work_fn+0x10/0x10
      [    1.760091]  ret_from_fork+0x35/0x40
      [   89.123011] node 0 initialised, 1055935372 pages in 88650ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: NYiqian Wei <yiwei@redhat.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da97f2d5
    • P
      mm: initialize deferred pages with interrupts enabled · 3d060856
      Pavel Tatashin 提交于
      Initializing struct pages is a long task and keeping interrupts disabled
      for the duration of this operation introduces a number of problems.
      
      1. jiffies are not updated for long period of time, and thus incorrect time
         is reported. See proposed solution and discussion here:
         lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      2. It prevents farther improving deferred page initialization by allowing
         intra-node multi-threading.
      
      We are keeping interrupts disabled to solve a rather theoretical problem
      that was never observed in real world (See 3a2d7fa8).
      
      Let's keep interrupts enabled. In case we ever encounter a scenario where
      an interrupt thread wants to allocate large amount of memory this early in
      boot we can deal with that by growing zone (see deferred_grow_zone()) by
      the needed amount before starting deferred_init_memmap() threads.
      
      Before:
      [    1.232459] node 0 initialised, 12058412 pages in 1ms
      
      After:
      [    1.632580] node 0 initialised, 12051227 pages in 436ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d060856
    • D
      mm/pagealloc.c: call touch_nmi_watchdog() on max order boundaries in deferred init · 117003c3
      Daniel Jordan 提交于
      Patch series "initialize deferred pages with interrupts enabled", v4.
      
      Keep interrupts enabled during deferred page initialization in order to
      make code more modular and allow jiffies to update.
      
      Original approach, and discussion can be found here:
       http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      
      This patch (of 3):
      
      deferred_init_memmap() disables interrupts the entire time, so it calls
      touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
      will run with interrupts enabled, at which point cond_resched() should be
      used instead.
      
      deferred_grow_zone() makes the same watchdog calls through code shared
      with deferred init but will continue to run with interrupts disabled, so
      it can't call cond_resched().
      
      Pull the watchdog calls up to these two places to allow the first to be
      changed later, independently of the second.  The frequency reduces from
      twice per pageblock (init and free) to once per max order block.
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      117003c3
    • A
      mm/page_alloc: restrict and formalize compound_page_dtors[] · ae70eddd
      Anshuman Khandual 提交于
      Restrict elements in compound_page_dtors[] array per NR_COMPOUND_DTORS and
      explicitly position them according to enum compound_dtor_id.  This
      improves protection against possible misalignment between
      compound_page_dtors[] and enum compound_dtor_id later on.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Link: http://lkml.kernel.org/r/1589795958-19317-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae70eddd
    • C
      mm, page_alloc: reset the zone->watermark_boost early · aa092591
      Charan Teja Reddy 提交于
      Updating the zone watermarks by any means, like min_free_kbytes,
      water_mark_scale_factor etc, when ->watermark_boost is set will result in
      higher low and high watermarks than the user asked.
      
      Below are the steps to reproduce the problem on system setup of Android
      kernel running on Snapdragon hardware.
      
      1) Default settings of the system are as below:
      
         #cat /proc/sys/vm/min_free_kbytes = 5162
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      8340
      		high     8539
      
      2) Monitor the zone->watermark_boost(by adding a debug print in the
         kernel) and whenever it is greater than zero value, write the same
         value of min_free_kbytes obtained from step 1.
      
         #echo 5162 > /proc/sys/vm/min_free_kbytes
      
      3) Then read the zone watermarks in the system while the
         ->watermark_boost is zero.  This should show the same values of
         watermarks as step 1 but shown a higher values than asked.
      
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      21148
      		high     21347
      
      These higher values are because of updating the zone watermarks using the
      macro min_wmark_pages(zone) which also adds the zone->watermark_boost.
      
      	#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] +
      					z->watermark_boost)
      
      So the steps that lead to the issue are:
      
      1) On the extfrag event, watermarks are boosted by storing the required
         value in ->watermark_boost.
      
      2) User tries to update the zone watermarks level in the system through
         min_free_kbytes or watermark_scale_factor.
      
      3) Later, when kswapd woke up, it resets the zone->watermark_boost to
         zero.
      
      In step 2), we use the min_wmark_pages() macro to store the watermarks
      in the zone structure thus the values are always offsetted by
      ->watermark_boost value. This can be avoided by resetting the
      ->watermark_boost to zero before it is used.
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa092591
    • S
      mm/page_alloc.c: reset numa stats for boot pagesets · b418a0f9
      Sandipan Das 提交于
      Initially, the per-cpu pagesets of each zone are set to the boot pagesets.
      The real pagesets are allocated later but before that happens, page
      allocations do occur and the numa stats for the boot pagesets get
      incremented since they are common to all zones at that point.
      
      The real pagesets, however, are allocated for the populated zones only.
      Unpopulated zones, like those associated with memory-less nodes, continue
      using the boot pageset and end up skewing the numa stats of the
      corresponding node.
      
      E.g.
      
        $ numactl -H
        available: 2 nodes (0-1)
        node 0 cpus: 0 1 2 3
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 1 cpus: 4 5 6 7
        node 1 size: 8131 MB
        node 1 free: 6980 MB
        node distances:
        node   0   1
          0:  10  40
          1:  40  10
      
        $ numastat
                                   node0           node1
        numa_hit                     108           56495
        numa_miss                      0               0
        numa_foreign                   0               0
        interleave_hit                 0            4537
        local_node                   108           31547
        other_node                     0           24948
      
      Hence, the boot pageset stats need to be cleared after the real pagesets
      are allocated.
      
      After this point, the stats of the boot pagesets do not change as page
      allocations requested for a memory-less node will either fail (if
      __GFP_THISNODE is used) or get fulfilled by a preferred zone of a
      different node based on the fallback zonelist.
      
      [sandipan@linux.ibm.com: v3]
        Link: http://lkml.kernel.org/r/20200511170356.162531-1-sandipan@linux.ibm.comSigned-off-by: NSandipan Das <sandipan@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Link: http://lkml.kernel.org/r/9c9c2d1b15e37f6e6bf32f99e3100035e90c4ac9.1588868430.git.sandipan@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b418a0f9
    • W
      mm: rename gfpflags_to_migratetype to gfp_migratetype for same convention · 01c0bfe0
      Wei Yang 提交于
      Pageblock migrate type is encoded in GFP flags, just as zone_type and
      zonelist.
      
      Currently we use gfp_zone() and gfp_zonelist() to extract related
      information, it would be proper to use the same naming convention for
      migrate type.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200329080823.7735-1-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01c0bfe0
    • W
      mm/page_alloc.c: use NODE_MASK_NONE in build_zonelists() · d0ddf49b
      Wei Yang 提交于
      Slightly simplify the code by initializing user_mask with NODE_MASK_NONE,
      instead of later calling nodes_clear().  This saves a line of code.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200330220840.21228-1-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0ddf49b
    • J
      mm/page_alloc: integrate classzone_idx and high_zoneidx · 97a225e6
      Joonsoo Kim 提交于
      classzone_idx is just different name for high_zoneidx now.  So, integrate
      them and add some comment to struct alloc_context in order to reduce
      future confusion about the meaning of this variable.
      
      The accessor, ac_classzone_idx() is also removed since it isn't needed
      after integration.
      
      In addition to integration, this patch also renames high_zoneidx to
      highest_zoneidx since it represents more precise meaning.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ye Xiaolong <xiaolong.ye@intel.com>
      Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a225e6
    • B
      mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty · f6366156
      Baoquan He 提交于
      When requesting memory allocation from a specific zone is not satisfied,
      it will fall to lower zone to try allocating memory.  In this case, lower
      zone's ->lowmem_reserve[] will help protect its own memory resource.  The
      higher the relevant ->lowmem_reserve[] is, the harder the upper zone can
      get memory from this lower zone.
      
      However, this protection mechanism should be applied to populated zone,
      but not an empty zone. So filling ->lowmem_reserve[] for empty zone is
      not necessary, and may mislead people that it's valid data in that zone.
      
      Node 2, zone      DMA
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 1024, 1024)
      Node 2, zone    DMA32
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 1024, 1024)
      Node 2, zone   Normal
        per-node stats
            nr_inactive_anon 0
            nr_active_anon 143
            nr_inactive_file 0
            nr_active_file 0
            nr_unevictable 0
            nr_slab_reclaimable 45
            nr_slab_unreclaimable 254
      
      Here clear out zone->lowmem_reserve[] if zone is empty.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200402140113.3696-3-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6366156
    • B
      mm/page_alloc.c: only tune sysctl_lowmem_reserve_ratio value once when changing it · 86aaf255
      Baoquan He 提交于
      Patch series "improvements about lowmem_reserve and /proc/zoneinfo", v2.
      
      This patch (of 3):
      
      When people write to /proc/sys/vm/lowmem_reserve_ratio to change
      sysctl_lowmem_reserve_ratio[], setup_per_zone_lowmem_reserve() is called
      to recalculate all ->lowmem_reserve[] for each zone of all nodes as below:
      
      static void setup_per_zone_lowmem_reserve(void)
      {
      ...
      	for_each_online_pgdat(pgdat) {
      		for (j = 0; j < MAX_NR_ZONES; j++) {
      			...
      			while (idx) {
      				...
      				if (sysctl_lowmem_reserve_ratio[idx] < 1) {
      					sysctl_lowmem_reserve_ratio[idx] = 0;
      					lower_zone->lowmem_reserve[j] = 0;
                                      } else {
      				...
      			}
      		}
      	}
      }
      
      Meanwhile, here, sysctl_lowmem_reserve_ratio[idx] will be tuned if its
      value is smaller than '1'.  As we know, sysctl_lowmem_reserve_ratio[] is
      set for zone without regarding to which node it belongs to.  That means
      the tuning will be done on all nodes, even though it has been done in the
      first node.
      
      And the tuning will be done too even when init_per_zone_wmark_min() calls
      setup_per_zone_lowmem_reserve(), where actually nobody tries to change
      sysctl_lowmem_reserve_ratio[].
      
      So now move the tuning into lowmem_reserve_ratio_sysctl_handler(), to make
      code logic more reasonable.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20200402140113.3696-1-bhe@redhat.com
      Link: http://lkml.kernel.org/r/20200402140113.3696-2-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86aaf255
    • B
      mm/page_alloc.c: remove unused free_bootmem_with_active_regions · 4ca7be24
      Baoquan He 提交于
      Since commit 397dc00e ("mips: sgi-ip27: switch from DISCONTIGMEM
      to SPARSEMEM"), the last caller of free_bootmem_with_active_regions() was
      gone.  Now no user calls it any more.
      
      Let's remove it.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200402143455.5145-1-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ca7be24
    • R
      mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocations · 16867664
      Roman Gushchin 提交于
      Currently a cma area is barely used by the page allocator because it's
      used only as a fallback from movable, however kswapd tries hard to make
      sure that the fallback path isn't used.
      
      This results in a system evicting memory and pushing data into swap, while
      lots of CMA memory is still available.  This happens despite the fact that
      alloc_contig_range is perfectly capable of moving any movable allocations
      out of the way of an allocation.
      
      To effectively use the cma area let's alter the rules: if the zone has
      more free cma pages than the half of total free pages in the zone, use cma
      pageblocks first and fallback to movable blocks in the case of failure.
      
      [guro@fb.com: ifdef the cma-specific code]
        Link: http://lkml.kernel.org/r/20200311225832.GA178154@carbon.DHCP.thefacebook.comCo-developed-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Link: http://lkml.kernel.org/r/20200306150102.3e77354b@imladris.surriel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16867664
    • W
      mm/page_alloc.c: extract check_[new|free]_page_bad() common part to page_bad_reason() · 58b7f119
      Wei Yang 提交于
      We share similar code in check_[new|free]_page_bad() to get the page's bad
      reason.
      
      Let's extract it and reduce code duplication.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-6-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58b7f119
    • W
      mm/page_alloc.c: rename free_pages_check() to check_free_page() · 534fe5e3
      Wei Yang 提交于
      free_pages_check() is the counterpart of check_new_page().  Rename it to
      use the same naming convention.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-5-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      534fe5e3
    • W
      mm/page_alloc.c: rename free_pages_check_bad() to check_free_page_bad() · 0d0c48a2
      Wei Yang 提交于
      free_pages_check_bad() is the counterpart of check_new_page_bad().  Rename
      it to use the same naming convention.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-4-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d0c48a2
    • W
      mm/page_alloc.c: bad_flags is not necessary for bad_page() · 82a3241a
      Wei Yang 提交于
      After commit 5b57b8f2 ("mm/debug.c: always print flags in
      dump_page()"), page->flags is always printed for a bad page.  It is not
      necessary to have bad_flags any more.
      Suggested-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-3-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82a3241a
    • W
      mm/page_alloc.c: bad_[reason|flags] is not necessary when PageHWPoison · 833d8a42
      Wei Yang 提交于
      Patch series "mm/page_alloc.c: cleanup on check page", v3.
      
      This patchset does some cleanup related to check page.
      
      1. Remove unnecessary bad_reason assignment
      2. Remove bad_flags to bad_page()
      3. Rename function for naming convention
      4. Extract common part to check page
      
      Thanks for suggestions from David Rientjes and Anshuman Khandual.
      
      This patch (of 5):
      
      Since function returns directly, bad_[reason|flags] is not used any where.
      And move this to the first.
      
      This is a following cleanup for commit e570f56c ("mm:
      check_new_page_bad() directly returns in __PG_HWPOISON case")
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20200411220357.9636-2-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      833d8a42
    • M
      mm: simplify find_min_pfn_with_active_regions() · 8a1b25fe
      Mike Rapoport 提交于
      find_min_pfn_with_active_regions() calls find_min_pfn_for_node() with nid
      parameter set to MAX_NUMNODES.  This makes the find_min_pfn_for_node()
      traverse all memblock memory regions although the first PFN in the system
      can be easily found with memblock_start_of_DRAM().
      
      Use memblock_start_of_DRAM() in find_min_pfn_with_active_regions() and drop
      now unused find_min_pfn_for_node().
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-21-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a1b25fe
    • M
      mm: clean up free_area_init_node() and its helpers · 854e8848
      Mike Rapoport 提交于
      free_area_init_node() now always uses memblock info and the zone PFN
      limits so it does not need the backwards compatibility functions to
      calculate the zone spanned and absent pages.  The removal of the compat_
      versions of zone_{abscent,spanned}_pages_in_node() in turn, makes
      zone_size and zhole_size parameters unused.
      
      The node_start_pfn is determined by get_pfn_range_for_nid(), so there is
      no need to pass it to free_area_init_node().
      
      As a result, the only required parameter to free_area_init_node() is the
      node ID, all the rest are removed along with no longer used
      compat_zone_{abscent,spanned}_pages_in_node() helpers.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-20-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854e8848
    • M
      mm: rename free_area_init_node() to free_area_init_memoryless_node() · bc9331a1
      Mike Rapoport 提交于
      free_area_init_node() is only used by x86 to initialize a memory-less
      nodes.  Make its name reflect this and drop all the function parameters
      except node ID as they are anyway zero.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-19-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc9331a1
    • M
      mm: free_area_init: allow defining max_zone_pfn in descending order · 51930df5
      Mike Rapoport 提交于
      Some architectures (e.g.  ARC) have the ZONE_HIGHMEM zone below the
      ZONE_NORMAL.  Allowing free_area_init() parse max_zone_pfn array even it
      is sorted in descending order allows using free_area_init() on such
      architectures.
      
      Add top -> down traversal of max_zone_pfn array in free_area_init() and
      use the latter in ARC node/zone initialization.
      
      [rppt@kernel.org: ARC fix]
        Link: http://lkml.kernel.org/r/20200504153901.GM14260@kernel.org
      [rppt@linux.ibm.com: arc: free_area_init(): take into account PAE40 mode]
        Link: http://lkml.kernel.org/r/20200507205900.GH683243@linux.ibm.com
      [akpm@linux-foundation.org: declare arch_has_descending_max_zone_pfns()]
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Link: http://lkml.kernel.org/r/20200412194859.12663-18-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51930df5
    • M
      mm: remove early_pfn_in_nid() and CONFIG_NODES_SPAN_OTHER_NODES · acd3f5c4
      Mike Rapoport 提交于
      The memmap_init() function was made to iterate over memblock regions and
      as the result the early_pfn_in_nid() function became obsolete.  Since
      CONFIG_NODES_SPAN_OTHER_NODES is only used to pick a stub or a real
      implementation of early_pfn_in_nid(), it is also not needed anymore.
      
      Remove both early_pfn_in_nid() and the CONFIG_NODES_SPAN_OTHER_NODES.
      Co-developed-by: NHoan Tran <Hoan@os.amperecomputing.com>
      Signed-off-by: NHoan Tran <Hoan@os.amperecomputing.com>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-17-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acd3f5c4
    • B
      mm: memmap_init: iterate over memblock regions rather that check each PFN · 73a6e474
      Baoquan He 提交于
      When called during boot the memmap_init_zone() function checks if each PFN
      is valid and actually belongs to the node being initialized using
      early_pfn_valid() and early_pfn_in_nid().
      
      Each such check may cost up to O(log(n)) where n is the number of memory
      banks, so for large amount of memory overall time spent in early_pfn*()
      becomes substantial.
      
      Since the information is anyway present in memblock, we can iterate over
      memblock memory regions in memmap_init() and only call memmap_init_zone()
      for PFN ranges that are know to be valid and in the appropriate node.
      
      [cai@lca.pw: fix a compilation warning from Clang]
        Link: http://lkml.kernel.org/r/CF6E407F-17DC-427C-8203-21979FB882EF@lca.pw
      [bhe@redhat.com: fix the incorrect hole in fast_isolate_freepages()]
        Link: http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
        Link: http://lkml.kernel.org/r/20200521014407.29690-1-bhe@redhat.comSigned-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200412194859.12663-16-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73a6e474
    • M
      mm: use free_area_init() instead of free_area_init_nodes() · 9691a071
      Mike Rapoport 提交于
      free_area_init() has effectively became a wrapper for
      free_area_init_nodes() and there is no point of keeping it.  Still
      free_area_init() name is shorter and more general as it does not imply
      necessity to initialize multiple nodes.
      
      Rename free_area_init_nodes() to free_area_init(), update the callers and
      drop old version of free_area_init().
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-6-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9691a071
    • M
      mm: free_area_init: use maximal zone PFNs rather than zone sizes · fa3354e4
      Mike Rapoport 提交于
      Currently, architectures that use free_area_init() to initialize memory
      map and node and zone structures need to calculate zone and hole sizes.
      We can use free_area_init_nodes() instead and let it detect the zone
      boundaries while the architectures will only have to supply the possible
      limits for the zones.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-5-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa3354e4
    • M
      mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option · 3f08a302
      Mike Rapoport 提交于
      CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
      nodes and zones structures between the systems that have region to node
      mapping in memblock and those that don't.
      
      Currently all the NUMA architectures enable this option and for the
      non-NUMA systems we can presume that all the memory belongs to node 0 and
      therefore the compile time configuration option is not required.
      
      The remaining few architectures that use DISCONTIGMEM without NUMA are
      easily updated to use memblock_add_node() instead of memblock_add() and
      thus have proper correspondence of memblock regions to NUMA nodes.
      
      Still, free_area_init_node() must have a backward compatible version
      because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
      different.  Once all the architectures will use the new semantics, the
      entire compatibility layer can be dropped.
      
      To avoid addition of extra run time memory to store node id for
      architectures that keep memblock but have only a single node, the node id
      field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
      the corresponding accessors presume that in those cases it is always 0.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f08a302