1. 30 6月, 2021 12 次提交
    • M
      mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM · 43b02ba9
      Mike Rapoport 提交于
      After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
      configuration option is equivalent to FLATMEM.
      
      Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43b02ba9
    • M
      mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA · a9ee6cf5
      Mike Rapoport 提交于
      After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
      configuration options are equivalent.
      
      Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
      
      Done with
      
      	$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
      		$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
      	$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
      		$(git grep -wl NEED_MULTIPLE_NODES)
      
      with manual tweaks afterwards.
      
      [rppt@linux.ibm.com: fix arm boot crash]
        Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9ee6cf5
    • M
      mm: remove CONFIG_DISCONTIGMEM · bb1c50d3
      Mike Rapoport 提交于
      There are no architectures that support DISCONTIGMEM left.
      
      Remove the configuration option and the dead code it was guarding in the
      generic memory management code.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb1c50d3
    • D
      mm: drop SECTION_SHIFT in code comments · 777c00f5
      Dong Aisheng 提交于
      Actually SECTIONS_SHIFT is used in the kernel code, so the code comments
      is strictly incorrect.  And since commit bbeae5b0 ("mm: move page
      flags layout to separate header"), SECTIONS_SHIFT definition has been
      moved to include/linux/page-flags-layout.h, since code itself looks quite
      straighforward, instead of moving the code comment into the new place as
      well, we just simply remove it.
      
      This also fixed a checkpatch complain derived from the original code:
      WARNING: please, no space before tabs
      + * SECTIONS_SHIFT    ^I^I#bits space required to store a section #$
      
      Link: https://lkml.kernel.org/r/20210531091908.1738465-2-aisheng.dong@nxp.comSigned-off-by: NDong Aisheng <aisheng.dong@nxp.com>
      Suggested-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NYu Zhao <yuzhao@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777c00f5
    • M
      mm/page_alloc: introduce vm.percpu_pagelist_high_fraction · 74f44822
      Mel Gorman 提交于
      This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
      similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
      both pcp->batch and pcp->high with the higher pcp->high potentially
      reducing zone->lock contention.  However, the higher pcp->batch value also
      potentially increased allocation latency while the PCP was refilled.  This
      sysctl only adjusts pcp->high so that zone->lock contention is potentially
      reduced but allocation latency during a PCP refill remains the same.
      
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  649
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=8
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  35071
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=64
                    high:  4383
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=0
                    high:  649
                    batch: 63
      
      [mgorman@techsingularity.net: fix documentation]
        Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74f44822
    • M
      mm/page_alloc: limit the number of pages on PCP lists when reclaim is active · c49c2c47
      Mel Gorman 提交于
      When kswapd is active then direct reclaim is potentially active.  In
      either case, it is possible that a zone would be balanced if pages were
      not trapped on PCP lists.  Instead of draining remote pages, simply limit
      the size of the PCP lists while kswapd is active.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c49c2c47
    • M
      mm/page_alloc: scale the number of pages that are batch freed · 3b12e7e9
      Mel Gorman 提交于
      When a task is freeing a large number of order-0 pages, it may acquire the
      zone->lock multiple times freeing pages in batches.  This may
      unnecessarily contend on the zone lock when freeing very large number of
      pages.  This patch adapts the size of the batch based on the recent
      pattern to scale the batch size for subsequent frees.
      
      As the machines I used were not large enough to test this are not large
      enough to illustrate a problem, a debugging patch shows patterns like the
      following (slightly editted for clarity)
      
      Baseline vanilla kernel
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
      
      With patches
        time-unmap-7724    [...] free_pcppages_bulk: free  126 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  252 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  504 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b12e7e9
    • M
      mm/page_alloc: delete vm.percpu_pagelist_fraction · bbbecb35
      Mel Gorman 提交于
      Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.
      
      The per-cpu page allocator (PCP) is meant to reduce contention on the zone
      lock but the sizing of batch and high is archaic and neither takes the
      zone size into account or the number of CPUs local to a zone.  With larger
      zones and more CPUs per node, the contention is getting worse.
      Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
      and high values means that the sysctl can reduce zone lock contention but
      also increase allocation latencies.
      
      This series disassociates pcp->high from pcp->batch and then scales
      pcp->high based on the size of the local zone with limited impact to
      reclaim and accounting for active CPUs but leaves pcp->batch static.  It
      also adapts the number of pages that can be on the pcp list based on
      recent freeing patterns.
      
      The motivation is partially to adjust to larger memory sizes but is also
      driven by the fact that large batches of page freeing via release_pages()
      often shows zone contention as a major part of the problem.  Another is a
      bug report based on an older kernel where a multi-terabyte process can
      takes several minutes to exit.  A workaround was to use
      vm.percpu_pagelist_fraction to increase the pcp->high value but testing
      indicated that a production workload could not use the same values because
      of an increase in allocation latencies.  Unfortunately, I cannot reproduce
      this test case myself as the multi-terabyte machines are in active use but
      it should alleviate the problem.
      
      The series aims to address both and partially acts as a pre-requisite.
      pcp only works with order-0 which is useless for SLUB (when using high
      orders) and THP (unconditionally).  To store high-order pages on PCP, the
      pcp->high values need to be increased first.
      
      This patch (of 6):
      
      The vm.percpu_pagelist_fraction is used to increase the batch and high
      limits for the per-cpu page allocator (PCP).  The intent behind the sysctl
      is to reduce zone lock acquisition when allocating/freeing pages but it
      has a problem.  While it can decrease contention, it can also increase
      latency on the allocation side due to unreasonably large batch sizes.
      This leads to games where an administrator adjusts
      percpu_pagelist_fraction on the fly to work around contention and
      allocation latency problems.
      
      This series aims to alleviate the problems with zone lock contention while
      avoiding the allocation-side latency problems.  For the purposes of
      review, it's easier to remove this sysctl now and reintroduce a similar
      sysctl later in the series that deals only with pcp->high.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbbecb35
    • M
      mm/vmstat: convert NUMA statistics to basic NUMA counters · f19298b9
      Mel Gorman 提交于
      NUMA statistics are maintained on the zone level for hits, misses, foreign
      etc but nothing relies on them being perfectly accurate for functional
      correctness.  The counters are used by userspace to get a general overview
      of a workloads NUMA behaviour but the page allocator incurs a high cost to
      maintain perfect accuracy similar to what is required for a vmstat like
      NR_FREE_PAGES.  There even is a sysctl vm.numa_stat to allow userspace to
      turn off the collection of NUMA statistics like NUMA_HIT.
      
      This patch converts NUMA_HIT and friends to be NUMA events with similar
      accuracy to VM events.  There is a possibility that slight errors will be
      introduced but the overall trend as seen by userspace will be similar.
      The counters are no longer updated from vmstat_refresh context as it is
      unnecessary overhead for counters that may never be read by userspace.
      Note that counters could be maintained at the node level to save space but
      it would have a user-visible impact due to /proc/zoneinfo.
      
      [lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19298b9
    • M
      mm/page_alloc: convert per-cpu list protection to local_lock · dbbee9d5
      Mel Gorman 提交于
      There is a lack of clarity of what exactly
      local_irq_save/local_irq_restore protects in page_alloc.c .  It conflates
      the protection of per-cpu page allocation structures with per-cpu vmstat
      deltas.
      
      This patch protects the PCP structure using local_lock which for most
      configurations is identical to IRQ enabling/disabling.  The scope of the
      lock is still wider than it should be but this is decreased later.
      
      It is possible for the local_lock to be embedded safely within struct
      per_cpu_pages but it adds complexity to free_unref_page_list.
      
      [akpm@linux-foundation.org: coding style fixes]
      [mgorman@techsingularity.net: work around a pahole limitation with zero-sized struct pagesets]
        Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net
      [lkp@intel.com: Make pagesets static]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbbee9d5
    • M
      mm/page_alloc: split per cpu page lists and zone stats · 28f836b6
      Mel Gorman 提交于
      The PCP (per-cpu page allocator in page_alloc.c) shares locking
      requirements with vmstat and the zone lock which is inconvenient and
      causes some issues.  For example, the PCP list and vmstat share the same
      per-cpu space meaning that it's possible that vmstat updates dirty cache
      lines holding per-cpu lists across CPUs unless padding is used.  Second,
      PREEMPT_RT does not want to disable IRQs for too long in the page
      allocator.
      
      This series splits the locking requirements and uses locks types more
      suitable for PREEMPT_RT, reduces the time when special locking is required
      for stats and reduces the time when IRQs need to be disabled on
      !PREEMPT_RT kernels.
      
      Why local_lock?  PREEMPT_RT considers the following sequence to be unsafe
      as documented in Documentation/locking/locktypes.rst
      
         local_irq_disable();
         spin_lock(&lock);
      
      The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
      -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock).  While it's possible to
      separate this out, it generally means there are points where we enable
      IRQs and reenable them again immediately.  To prevent a migration and the
      per-cpu pointer going stale, migrate_disable is also needed.  That is a
      custom lock that is similar, but worse, than local_lock.  Furthermore, on
      PREEMPT_RT, it's undesirable to leave IRQs disabled for too long.  By
      converting to local_lock which disables migration on PREEMPT_RT, the
      locking requirements can be separated and start moving the protections for
      PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking.  As a
      bonus, local_lock also means that PROVE_LOCKING does something useful.
      
      After that, it's obvious that zone_statistics incurs too much overhead and
      leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels.
      zone_statistics uses perfectly accurate counters requiring IRQs be
      disabled for parallel RMW sequences when inaccurate ones like vm_events
      would do.  The series makes the NUMA statistics (NUMA_HIT and friends)
      inaccurate counters that then require no special protection on
      !PREEMPT_RT.
      
      The bulk page allocator can then do stat updates in bulk with IRQs enabled
      which should improve the efficiency.  Technically, this could have been
      done without the local_lock and vmstat conversion work and the order
      simply reflects the timing of when different series were implemented.
      
      Finally, there are places where we conflate IRQs being disabled for the
      PCP with the IRQ-safe zone spinlock.  The remainder of the series reduces
      the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
      By the end of the series, page_alloc.c does not call local_irq_save so the
      locking scope is a bit clearer.  The one exception is that modifying
      NR_FREE_PAGES still happens in places where it's known the IRQs are
      disabled as it's harmless for PREEMPT_RT and would be expensive to split
      the locking there.
      
      No performance data is included because despite the overhead of the stats,
      it's within the noise for most workloads on !PREEMPT_RT.  However, Jesper
      Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
      3.60GHz CPU on the first version of this series.  Focusing on the array
      variant of the bulk page allocator reveals the following.
      
      (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
      ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
      
               Baseline        Patched
       1       56.383          54.225 (+3.83%)
       2       40.047          35.492 (+11.38%)
       3       37.339          32.643 (+12.58%)
       4       35.578          30.992 (+12.89%)
       8       33.592          29.606 (+11.87%)
       16      32.362          28.532 (+11.85%)
       32      31.476          27.728 (+11.91%)
       64      30.633          27.252 (+11.04%)
       128     30.596          27.090 (+11.46%)
      
      While this is a positive outcome, the series is more likely to be
      interesting to the RT people in terms of getting parts of the PREEMPT_RT
      tree into mainline.
      
      This patch (of 9):
      
      The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
      in the same struct per_cpu_pages even though vmstats have no direct impact
      on the per-cpu page lists.  This is inconsistent because the vmstats for a
      node are stored on a dedicated structure.  The bigger issue is that the
      per_cpu_pages structure is not cache-aligned and stat updates either cache
      conflict with adjacent per-cpu lists incurring a runtime cost or padding
      is required incurring a memory cost.
      
      This patch splits the per-cpu pagelists and the vmstat deltas into
      separate structures.  It's mostly a mechanical conversion but some
      variable renaming is done to clearly distinguish the per-cpu pages
      structure (pcp) from the vmstats (pzstats).
      
      Superficially, this appears to increase the size of the per_cpu_pages
      structure but the movement of expire fills a structure hole so there is no
      impact overall.
      
      [mgorman@techsingularity.net: make it W=1 cleaner]
        Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net
      [mgorman@techsingularity.net: make it W=1 even cleaner]
        Link: https://lkml.kernel.org/r/20210516140705.GB3735@techsingularity.net
      [lkp@intel.com: check struct per_cpu_zonestat has a non-zero size]
      [vbabka@suse.cz: Init zone->per_cpu_zonestats properly]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28f836b6
    • M
      mm/mmzone.h: simplify is_highmem_idx() · b19bd1c9
      Mike Rapoport 提交于
      There is a lot of historical ifdefery in is_highmem_idx() and its helper
      zone_movable_is_highmem() that was required because of two different paths
      for nodes and zones initialization that were selected at compile time.
      
      Until commit 3f08a302 ("mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP
      option") the movable_zone variable was only available for configurations
      that had CONFIG_HAVE_MEMBLOCK_NODE_MAP enabled so the test in
      zone_movable_is_highmem() used that variable only for such configurations.
      For other configurations the test checked if the index of ZONE_MOVABLE
      was greater by 1 than the index of ZONE_HIGMEM and then movable zone was
      considered a highmem zone.  Needless to say, ZONE_MOVABLE - 1 equals
      ZONE_HIGHMEM by definition when CONFIG_HIGHMEM=y.
      
      Commit 3f08a302 ("mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option")
      made movable_zone variable always available.  Since this variable is set
      to ZONE_HIGHMEM if CONFIG_HIGHMEM is enabled and highmem zone is
      populated, it is enough to check whether
      
      	zone_idx == ZONE_MOVABLE && movable_zone == ZONE_HIGMEM
      
      to test if zone index points to a highmem zone.
      
      Remove zone_movable_is_highmem() that is not used anywhere except
      is_highmem_idx() and use the test above in is_highmem_idx() instead.
      
      Link: https://lkml.kernel.org/r/20210426141927.1314326-3-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b19bd1c9
  2. 07 5月, 2021 1 次提交
  3. 06 5月, 2021 3 次提交
    • O
      mm,memory_hotplug: allocate memmap from the added memory range · a08a2ae3
      Oscar Salvador 提交于
      Physical memory hotadd has to allocate a memmap (struct page array) for
      the newly added memory section.  Currently, alloc_pages_node() is used
      for those allocations.
      
      This has some disadvantages:
       a) an existing memory is consumed for that purpose
          (eg: ~2MB per 128MB memory section on x86_64)
          This can even lead to extreme cases where system goes OOM because
          the physically hotplugged memory depletes the available memory before
          it is onlined.
       b) if the whole node is movable then we have off-node struct pages
          which has performance drawbacks.
       c) It might be there are no PMD_ALIGNED chunks so memmap array gets
          populated with base pages.
      
      This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
      
      Vmemap page tables can map arbitrary memory.  That means that we can
      reserve a part of the physically hotadded memory to back vmemmap page
      tables.  This implementation uses the beginning of the hotplugged memory
      for that purpose.
      
      There are some non-obviously things to consider though.
      
      Vmemmap pages are allocated/freed during the memory hotplug events
      (add_memory_resource(), try_remove_memory()) when the memory is
      added/removed.  This means that the reserved physical range is not
      online although it is used.  The most obvious side effect is that
      pfn_to_online_page() returns NULL for those pfns.  The current design
      expects that this should be OK as the hotplugged memory is considered a
      garbage until it is onlined.  For example hibernation wouldn't save the
      content of those vmmemmaps into the image so it wouldn't be restored on
      resume but this should be OK as there no real content to recover anyway
      while metadata is reachable from other data structures (e.g.  vmemmap
      page tables).
      
      The reserved space is therefore (de)initialized during the {on,off}line
      events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
      allocator independent initialization from the regular onlining path.
      The primary reason to handle the reserved space outside of
      {on,off}line_pages is to make each initialization specific to the
      purpose rather than special case them in a single function.
      
      As per above, the functions that are introduced are:
      
       - mhp_init_memmap_on_memory:
         Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
         kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
         fully span.
      
       - mhp_deinit_memmap_on_memory:
         Offlines as many sections as vmemmap pages fully span, removes the
         range from zhe zone by remove_pfn_range_from_zone(), and calls
         kasan_remove_zero_shadow() for the range.
      
      The new function memory_block_online() calls mhp_init_memmap_on_memory()
      before doing the actual online_pages().  Should online_pages() fail, we
      clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
      present_pages is done at the end once we know that online_pages()
      succedeed.
      
      On offline, memory_block_offline() needs to unaccount vmemmap pages from
      present_pages() before calling offline_pages().  This is necessary because
      offline_pages() tears down some structures based on the fact whether the
      node or the zone become empty.  If offline_pages() fails, we account back
      vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().
      
      Hot-remove:
      
       We need to be careful when removing memory, as adding and
       removing memory needs to be done with the same granularity.
       To check that this assumption is not violated, we check the
       memory range we want to remove and if a) any memory block has
       vmemmap pages and b) the range spans more than a single memory
       block, we scream out loud and refuse to proceed.
      
       If all is good and the range was using memmap on memory (aka vmemmap pages),
       we construct an altmap structure so free_hugepage_table does the right
       thing and calls vmem_altmap_free instead of free_pagetable.
      
      Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a08a2ae3
    • P
      mm/gup: migrate pinned pages out of movable zone · d1e153fe
      Pavel Tatashin 提交于
      We should not pin pages in ZONE_MOVABLE.  Currently, we do not pin only
      movable CMA pages.  Generalize the function that migrates CMA pages to
      migrate all movable pages.  Use is_pinnable_page() to check which pages
      need to be migrated
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-10-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1e153fe
    • P
      mm/gup: do not migrate zero page · 9afaf30f
      Pavel Tatashin 提交于
      On some platforms ZERO_PAGE(0) might end-up in a movable zone.  Do not
      migrate zero page in gup during longterm pinning as migration of zero page
      is not allowed.
      
      For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I
      see the following:
      
      Boot#1: zero_pfn  0x48a8d zero_pfn zone: ZONE_DMA32
      Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE
      
      On x86, empty_zero_page is declared in .bss and depending on the loader
      may end up in different physical locations during boots.
      
      Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because
      zero_pfn that they are using is declared in memory.c which is compiled
      with CONFIG_MMU.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-9-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9afaf30f
  4. 01 5月, 2021 1 次提交
  5. 27 2月, 2021 2 次提交
  6. 25 2月, 2021 7 次提交
    • Y
      mm/vmscan.c: make lruvec_lru_size() static · 2091339d
      Yu Zhao 提交于
      All other references to the function were removed after
      commit b910718a ("mm: vmscan: detect file thrashing at the reclaim
      root").
      
      Link: https://lore.kernel.org/linux-mm/20201207220949.830352-11-yuzhao@google.com/
      Link: https://lkml.kernel.org/r/20210122220600.906146-11-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2091339d
    • S
      mm: memcg: add swapcache stat for memcg v2 · b6038942
      Shakeel Butt 提交于
      This patch adds swapcache stat for the cgroup v2.  The swapcache
      represents the memory that is accounted against both the memory and the
      swap limit of the cgroup.  The main motivation behind exposing the
      swapcache stat is for enabling users to gracefully migrate from cgroup
      v1's memsw counter to cgroup v2's memory and swap counters.
      
      Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
      workload but without control on the exact proportion of memory and swap.
      Cgroup v2 provides separate limits for memory and swap which enables more
      control on the exact usage of memory and swap individually for the
      workload.
      
      With some little subtleties, the v1's memsw limit can be switched with the
      sum of the v2's memory and swap limits.  However the alternative for memsw
      usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
      stat enables that alternative.  Adding the memory usage and swap usage and
      subtracting the swapcache will approximate the memsw usage.  This will
      help in the transparent migration of the workloads depending on memsw
      usage and limit to v2' memory and swap counters.
      
      The reasons these applications are still interested in this approximate
      memsw usage are: (1) these applications are not really interested in two
      separate memory and swap usage metrics.  A single usage metric is more
      simple to use and reason about for them.
      
      (2) The memsw usage metric hides the underlying system's swap setup from
      the applications.  Applications with multiple instances running in a
      datacenter with heterogeneous systems (some have swap and some don't) will
      keep seeing a consistent view of their usage.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      
      Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6038942
    • M
      mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages · 380780e7
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_FILE_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      380780e7
    • M
      mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages · a1528e21
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_SHMEM_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1528e21
    • M
      mm: memcontrol: convert NR_SHMEM_THPS account to pages · 57b2847d
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_SHMEM_THPS account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57b2847d
    • M
      mm: memcontrol: convert NR_FILE_THPS account to pages · bf9ecead
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with if hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_FILE_THPS account to pages.  This patch is consistent
      with 8f182270 ("mm/swap.c: flush lru pvecs on compound page arrival").
      Doing this also can make the unit of vmstat counters more unified.
      Finally, the unit of the vmstat counters are pages, kB and bytes.  The
      B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
      without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf9ecead
    • M
      mm: memcontrol: convert NR_ANON_THPS account to pages · 69473e5d
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_ANON_THPS account to pages.  This patch is consistent
      with 8f182270 ("mm/swap.c: flush lru pvecs on compound page arrival").
      Doing this also can make the unit of vmstat counters more unified.
      Finally, the unit of the vmstat counters are pages, kB and bytes.  The
      B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
      without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69473e5d
  7. 16 12月, 2020 6 次提交
  8. 20 11月, 2020 1 次提交
  9. 17 10月, 2020 2 次提交
  10. 14 10月, 2020 2 次提交
  11. 27 9月, 2020 1 次提交
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
  12. 13 8月, 2020 2 次提交