1. 14 10月, 2020 2 次提交
  2. 12 10月, 2020 1 次提交
  3. 04 10月, 2020 1 次提交
  4. 27 9月, 2020 1 次提交
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
  5. 22 8月, 2020 2 次提交
    • C
      mm, page_alloc: fix core hung in free_pcppages_bulk() · 88e8ac11
      Charan Teja Reddy 提交于
      The following race is observed with the repeated online, offline and a
      delay between two successive online of memory blocks of movable zone.
      
      P1						P2
      
      Online the first memory block in
      the movable zone. The pcp struct
      values are initialized to default
      values,i.e., pcp->high = 0 &
      pcp->batch = 1.
      
      					Allocate the pages from the
      					movable zone.
      
      Try to Online the second memory
      block in the movable zone thus it
      entered the online_pages() but yet
      to call zone_pcp_update().
      					This process is entered into
      					the exit path thus it tries
      					to release the order-0 pages
      					to pcp lists through
      					free_unref_page_commit().
      					As pcp->high = 0, pcp->count = 1
      					proceed to call the function
      					free_pcppages_bulk().
      Update the pcp values thus the
      new pcp values are like, say,
      pcp->high = 378, pcp->batch = 63.
      					Read the pcp's batch value using
      					READ_ONCE() and pass the same to
      					free_pcppages_bulk(), pcp values
      					passed here are, batch = 63,
      					count = 1.
      
      					Since num of pages in the pcp
      					lists are less than ->batch,
      					then it will stuck in
      					while(list_empty(list)) loop
      					with interrupts disabled thus
      					a core hung.
      
      Avoid this by ensuring free_pcppages_bulk() is called with proper count of
      pcp list pages.
      
      The mentioned race is some what easily reproducible without [1] because
      pcp's are not updated for the first memory block online and thus there is
      a enough race window for P2 between alloc+free and pcp struct values
      update through onlining of second memory block.
      
      With [1], the race still exists but it is very narrow as we update the pcp
      struct values for the first memory block online itself.
      
      This is not limited to the movable zone, it could also happen in cases
      with the normal zone (e.g., hotplug to a node that only has DMA memory, or
      no other memory yet).
      
      [1]: https://patchwork.kernel.org/patch/11696389/
      
      Fixes: 5f8dcc21 ("page-allocator: split per-cpu list into one-list-per-migrate-type")
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: <stable@vger.kernel.org> [2.6+]
      Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88e8ac11
    • D
      mm: include CMA pages in lowmem_reserve at boot · e08d3fdf
      Doug Berger 提交于
      The lowmem_reserve arrays provide a means of applying pressure against
      allocations from lower zones that were targeted at higher zones.  Its
      values are a function of the number of pages managed by higher zones and
      are assigned by a call to the setup_per_zone_lowmem_reserve() function.
      
      The function is initially called at boot time by the function
      init_per_zone_wmark_min() and may be called later by accesses of the
      /proc/sys/vm/lowmem_reserve_ratio sysctl file.
      
      The function init_per_zone_wmark_min() was moved up from a module_init to
      a core_initcall to resolve a sequencing issue with khugepaged.
      Unfortunately this created a sequencing issue with CMA page accounting.
      
      The CMA pages are added to the managed page count of a zone when
      cma_init_reserved_areas() is called at boot also as a core_initcall.  This
      makes it uncertain whether the CMA pages will be added to the managed page
      counts of their zones before or after the call to
      init_per_zone_wmark_min() as it becomes dependent on link order.  With the
      current link order the pages are added to the managed count after the
      lowmem_reserve arrays are initialized at boot.
      
      This means the lowmem_reserve values at boot may be lower than the values
      used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
      ratio values are unchanged.
      
      In many cases the difference is not significant, but for example
      an ARM platform with 1GB of memory and the following memory layout
      
        cma: Reserved 256 MiB at 0x0000000030000000
        Zone ranges:
          DMA      [mem 0x0000000000000000-0x000000002fffffff]
          Normal   empty
          HighMem  [mem 0x0000000030000000-0x000000003fffffff]
      
      would result in 0 lowmem_reserve for the DMA zone.  This would allow
      userspace to deplete the DMA zone easily.
      
      Funnily enough
      
        $ cat /proc/sys/vm/lowmem_reserve_ratio
      
      would fix up the situation because as a side effect it forces
      setup_per_zone_lowmem_reserve.
      
      This commit breaks the link order dependency by invoking
      init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
      have the chance to be properly accounted in their zone(s) and allowing
      the lowmem_reserve arrays to receive consistent values.
      
      Fixes: bc22af74 ("mm: update min_free_kbytes from khugepaged after core initialization")
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e08d3fdf
  6. 15 8月, 2020 1 次提交
  7. 13 8月, 2020 2 次提交
  8. 08 8月, 2020 15 次提交
  9. 17 7月, 2020 1 次提交
  10. 04 7月, 2020 1 次提交
  11. 09 6月, 2020 1 次提交
  12. 05 6月, 2020 2 次提交
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
    • D
      virtio-mem: Paravirtualized memory hotunplug part 2 · 255f5985
      David Hildenbrand 提交于
      We also want to unplug online memory (contained in online memory blocks
      and, therefore, managed by the buddy), and eventually replug it later.
      
      When requested to unplug memory, we use alloc_contig_range() to allocate
      subblocks in online memory blocks (so we are the owner) and send them to
      our hypervisor. When requested to plug memory, we can replug such memory
      using free_contig_range() after asking our hypervisor.
      
      We also want to mark all allocated pages PG_offline, so nobody will
      touch them. To differentiate pages that were never onlined when
      onlining the memory block from pages allocated via alloc_contig_range(), we
      use PageDirty(). Based on this flag, virtio_mem_fake_online() can either
      online the pages for the first time or use free_contig_range().
      
      It is worth noting that there are no guarantees on how much memory can
      actually get unplugged again. All device memory might completely be
      fragmented with unmovable data, such that no subblock can get unplugged.
      
      We are not touching the ZONE_MOVABLE. If memory is onlined to the
      ZONE_MOVABLE, it can only get unplugged after that memory was offlined
      manually by user space. In normal operation, virtio-mem memory is
      suggested to be onlined to ZONE_NORMAL. In the future, we will try to
      make unplug more likely to succeed.
      
      Add a module parameter to control if online memory shall be touched.
      
      As we want to access alloc_contig_range()/free_contig_range() from
      kernel module context, export the symbols.
      
      Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages
      are on the same node, in the same zone, and contain no holes.
      
      Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      255f5985
  13. 04 6月, 2020 10 次提交
    • M
      mm/vmscan.c: change prototype for shrink_page_list · 730ec8c0
      Maninder Singh 提交于
      commit 3c710c1a ("mm, vmscan extract shrink_page_list reclaim counters
      into a struct") changed data type for the function, so changing return
      type for funciton and its caller.
      Signed-off-by: NVaneet Narang <v.narang@samsung.com>
      Signed-off-by: NManinder Singh <maninder1.s@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Amit Sahrawat <a.sahrawat@samsung.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      730ec8c0
    • C
    • D
      mm: make deferred init's max threads arch-specific · ecd09650
      Daniel Jordan 提交于
      Using padata during deferred init has only been tested on x86, so for now
      limit it to this architecture.
      
      If another arch wants this, it can find the max thread limit that's best
      for it and override deferred_page_init_max_threads().
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-8-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecd09650
    • D
      mm: parallelize deferred_init_memmap() · e4443149
      Daniel Jordan 提交于
      Deferred struct page init is a significant bottleneck in kernel boot.
      Optimizing it maximizes availability for large-memory systems and allows
      spinning up short-lived VMs as needed without having to leave them
      running.  It also benefits bare metal machines hosting VMs that are
      sensitive to downtime.  In projects such as VMM Fast Restart[1], where
      guest state is preserved across kexec reboot, it helps prevent application
      and network timeouts in the guests.
      
      Multithread to take full advantage of system memory bandwidth.
      
      The maximum number of threads is capped at the number of CPUs on the node
      because speedups always improve with additional threads on every system
      tested, and at this phase of boot, the system is otherwise idle and
      waiting on page init to finish.
      
      Helper threads operate on section-aligned ranges to both avoid false
      sharing when setting the pageblock's migrate type and to avoid accessing
      uninitialized buddy pages, though max order alignment is enough for the
      latter.
      
      The minimum chunk size is also a section.  There was benefit to using
      multiple threads even on relatively small memory (1G) systems, and this is
      the smallest size that the alignment allows.
      
      The time (milliseconds) is the slowest node to initialize since boot
      blocks until all nodes finish.  intel_pstate is loaded in active mode
      without hwp and with turbo enabled, and intel_idle is active as well.
      
          Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
            2 nodes * 26 cores * 2 threads = 104 CPUs
            384G/node = 768G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   4089.7 (  8.1)         --   1785.7 (  7.6)
             2% (  1)       1.7%   4019.3 (  1.5)       3.8%   1717.7 ( 11.8)
            12% (  6)      34.9%   2662.7 (  2.9)      79.9%    359.3 (  0.6)
            25% ( 13)      39.9%   2459.0 (  3.6)      91.2%    157.0 (  0.0)
            37% ( 19)      39.2%   2485.0 ( 29.7)      90.4%    172.0 ( 28.6)
            50% ( 26)      39.3%   2482.7 ( 25.7)      90.3%    173.7 ( 30.0)
            75% ( 39)      39.0%   2495.7 (  5.5)      89.4%    190.0 (  1.0)
           100% ( 52)      40.2%   2443.7 (  3.8)      92.3%    138.0 (  1.0)
      
          Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
            1 node * 16 cores * 2 threads = 32 CPUs
            192G/node = 192G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1988.7 (  9.6)         --   1096.0 ( 11.5)
             3% (  1)       1.1%   1967.0 ( 17.6)       0.3%   1092.7 ( 11.0)
            12% (  4)      41.1%   1170.3 ( 14.2)      73.8%    287.0 (  3.6)
            25% (  8)      47.1%   1052.7 ( 21.9)      83.9%    177.0 ( 13.5)
            38% ( 12)      48.9%   1016.3 ( 12.1)      86.8%    144.7 (  1.5)
            50% ( 16)      48.9%   1015.7 (  8.1)      87.8%    134.0 (  4.4)
            75% ( 24)      49.1%   1012.3 (  3.1)      88.1%    130.3 (  2.3)
           100% ( 32)      49.5%   1004.0 (  5.3)      88.5%    125.7 (  2.1)
      
          Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
            2 nodes * 18 cores * 2 threads = 72 CPUs
            128G/node = 256G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1680.0 (  4.6)         --    627.0 (  4.0)
             3% (  1)       0.3%   1675.7 (  4.5)      -0.2%    628.0 (  3.6)
            11% (  4)      25.6%   1250.7 (  2.1)      67.9%    201.0 (  0.0)
            25% (  9)      30.7%   1164.0 ( 17.3)      81.8%    114.3 ( 17.7)
            36% ( 13)      31.4%   1152.7 ( 10.8)      84.0%    100.3 ( 17.9)
            50% ( 18)      31.5%   1150.7 (  9.3)      83.9%    101.0 ( 14.1)
            75% ( 27)      31.7%   1148.0 (  5.6)      84.5%     97.3 (  6.4)
           100% ( 36)      32.0%   1142.3 (  4.0)      85.6%     90.0 (  1.0)
      
          AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
            1 node * 8 cores * 2 threads = 16 CPUs
            64G/node = 64G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --   1029.3 ( 25.1)         --    240.7 (  1.5)
             6% (  1)      -0.6%   1036.0 (  7.8)      -2.2%    246.0 (  0.0)
            12% (  2)      11.8%    907.7 (  8.6)      44.7%    133.0 (  1.0)
            25% (  4)      13.9%    886.0 ( 10.6)      62.6%     90.0 (  6.0)
            38% (  6)      17.8%    845.7 ( 14.2)      69.1%     74.3 (  3.8)
            50% (  8)      16.8%    856.0 ( 22.1)      72.9%     65.3 (  5.7)
            75% ( 12)      15.4%    871.0 ( 29.2)      79.8%     48.7 (  7.4)
           100% ( 16)      21.0%    813.7 ( 21.0)      80.5%     47.0 (  5.2)
      
      Server-oriented distros that enable deferred page init sometimes run in
      small VMs, and they still benefit even though the fraction of boot time
      saved is smaller:
      
          AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
            1 node * 2 cores * 2 threads = 4 CPUs
            16G/node = 16G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --    716.0 ( 14.0)         --     49.7 (  0.6)
            25% (  1)       1.8%    703.0 (  5.3)      -4.0%     51.7 (  0.6)
            50% (  2)       1.6%    704.7 (  1.2)      43.0%     28.3 (  0.6)
            75% (  3)       2.7%    696.7 ( 13.1)      49.7%     25.0 (  0.0)
           100% (  4)       4.1%    687.0 ( 10.4)      55.7%     22.0 (  0.0)
      
          Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
            1 node * 2 cores * 2 threads = 4 CPUs
            14G/node = 14G memory
      
                         kernel boot                 deferred init
                         ------------------------    ------------------------
          node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
                (  0)         --    787.7 (  6.4)         --    122.3 (  0.6)
            25% (  1)       0.2%    786.3 ( 10.8)      -2.5%    125.3 (  2.1)
            50% (  2)       5.9%    741.0 ( 13.9)      37.6%     76.3 ( 19.7)
            75% (  3)       8.3%    722.0 ( 19.0)      49.9%     61.3 (  3.2)
           100% (  4)       9.3%    714.7 (  9.5)      56.4%     53.3 (  1.5)
      
      On Josh's 96-CPU and 192G memory system:
      
          Without this patch series:
          [    0.487132] node 0 initialised, 23398907 pages in 292ms
          [    0.499132] node 1 initialised, 24189223 pages in 304ms
          ...
          [    0.629376] Run /sbin/init as init process
      
          With this patch series:
          [    0.231435] node 1 initialised, 24189223 pages in 32ms
          [    0.236718] node 0 initialised, 23398907 pages in 36ms
      
      [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdfSigned-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-7-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4443149
    • D
      mm: don't track number of pages during deferred initialization · 89c7c402
      Daniel Jordan 提交于
      Deferred page init used to report the number of pages initialized:
      
        node 0 initialised, 32439114 pages in 97ms
      
      Tracking this makes the code more complicated when using multiple threads.
      Given that the statistic probably has limited value, especially since a
      zone grows on demand so that the page count can vary, just remove it.
      
      The boot message now looks like
      
        node 0 deferred pages initialised in 97ms
      Suggested-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-6-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c7c402
    • P
      mm: call cond_resched() from deferred_init_memmap() · da97f2d5
      Pavel Tatashin 提交于
      Now that deferred pages are initialized with interrupts enabled we can
      replace touch_nmi_watchdog() with cond_resched(), as it was before
      3a2d7fa8.
      
      For now, we cannot do the same in deferred_grow_zone() as it is still
      initializes pages with interrupts disabled.
      
      This change fixes RCU problem described in
      https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com
      
      [   60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   60.475000] rcu:  1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
      [   60.475000] rcu:  (detected by 0, t=60002 jiffies, g=-1199, q=1)
      [   60.475000] Sending NMI from CPU 0 to CPUs 1:
      [    1.760091] NMI backtrace for cpu 1
      [    1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
      [    1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
      [    1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
      [    1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
      [    1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
      [    1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
      [    1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
      [    1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
      [    1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
      [    1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
      [    1.760091] FS:  0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
      [    1.760091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
      [    1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    1.760091] Call Trace:
      [    1.760091]  deferred_init_pages+0x8f/0xbf
      [    1.760091]  deferred_init_memmap+0x184/0x29d
      [    1.760091]  ? deferred_free_pages.isra.97+0xba/0xba
      [    1.760091]  kthread+0x112/0x130
      [    1.760091]  ? kthread_flush_work_fn+0x10/0x10
      [    1.760091]  ret_from_fork+0x35/0x40
      [   89.123011] node 0 initialised, 1055935372 pages in 88650ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: NYiqian Wei <yiwei@redhat.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da97f2d5
    • P
      mm: initialize deferred pages with interrupts enabled · 3d060856
      Pavel Tatashin 提交于
      Initializing struct pages is a long task and keeping interrupts disabled
      for the duration of this operation introduces a number of problems.
      
      1. jiffies are not updated for long period of time, and thus incorrect time
         is reported. See proposed solution and discussion here:
         lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      2. It prevents farther improving deferred page initialization by allowing
         intra-node multi-threading.
      
      We are keeping interrupts disabled to solve a rather theoretical problem
      that was never observed in real world (See 3a2d7fa8).
      
      Let's keep interrupts enabled. In case we ever encounter a scenario where
      an interrupt thread wants to allocate large amount of memory this early in
      boot we can deal with that by growing zone (see deferred_grow_zone()) by
      the needed amount before starting deferred_init_memmap() threads.
      
      Before:
      [    1.232459] node 0 initialised, 12058412 pages in 1ms
      
      After:
      [    1.632580] node 0 initialised, 12051227 pages in 436ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d060856
    • D
      mm/pagealloc.c: call touch_nmi_watchdog() on max order boundaries in deferred init · 117003c3
      Daniel Jordan 提交于
      Patch series "initialize deferred pages with interrupts enabled", v4.
      
      Keep interrupts enabled during deferred page initialization in order to
      make code more modular and allow jiffies to update.
      
      Original approach, and discussion can be found here:
       http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      
      This patch (of 3):
      
      deferred_init_memmap() disables interrupts the entire time, so it calls
      touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
      will run with interrupts enabled, at which point cond_resched() should be
      used instead.
      
      deferred_grow_zone() makes the same watchdog calls through code shared
      with deferred init but will continue to run with interrupts disabled, so
      it can't call cond_resched().
      
      Pull the watchdog calls up to these two places to allow the first to be
      changed later, independently of the second.  The frequency reduces from
      twice per pageblock (init and free) to once per max order block.
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      117003c3
    • A
      mm/page_alloc: restrict and formalize compound_page_dtors[] · ae70eddd
      Anshuman Khandual 提交于
      Restrict elements in compound_page_dtors[] array per NR_COMPOUND_DTORS and
      explicitly position them according to enum compound_dtor_id.  This
      improves protection against possible misalignment between
      compound_page_dtors[] and enum compound_dtor_id later on.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Link: http://lkml.kernel.org/r/1589795958-19317-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae70eddd
    • C
      mm, page_alloc: reset the zone->watermark_boost early · aa092591
      Charan Teja Reddy 提交于
      Updating the zone watermarks by any means, like min_free_kbytes,
      water_mark_scale_factor etc, when ->watermark_boost is set will result in
      higher low and high watermarks than the user asked.
      
      Below are the steps to reproduce the problem on system setup of Android
      kernel running on Snapdragon hardware.
      
      1) Default settings of the system are as below:
      
         #cat /proc/sys/vm/min_free_kbytes = 5162
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      8340
      		high     8539
      
      2) Monitor the zone->watermark_boost(by adding a debug print in the
         kernel) and whenever it is greater than zero value, write the same
         value of min_free_kbytes obtained from step 1.
      
         #echo 5162 > /proc/sys/vm/min_free_kbytes
      
      3) Then read the zone watermarks in the system while the
         ->watermark_boost is zero.  This should show the same values of
         watermarks as step 1 but shown a higher values than asked.
      
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      21148
      		high     21347
      
      These higher values are because of updating the zone watermarks using the
      macro min_wmark_pages(zone) which also adds the zone->watermark_boost.
      
      	#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] +
      					z->watermark_boost)
      
      So the steps that lead to the issue are:
      
      1) On the extfrag event, watermarks are boosted by storing the required
         value in ->watermark_boost.
      
      2) User tries to update the zone watermarks level in the system through
         min_free_kbytes or watermark_scale_factor.
      
      3) Later, when kswapd woke up, it resets the zone->watermark_boost to
         zero.
      
      In step 2), we use the min_wmark_pages() macro to store the watermarks
      in the zone structure thus the values are always offsetted by
      ->watermark_boost value. This can be avoided by resetting the
      ->watermark_boost to zero before it is used.
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa092591