1. 29 4月, 2022 1 次提交
  2. 23 3月, 2022 3 次提交
    • D
      drivers/base/memory: determine and store zone for single-zone memory blocks · 395f6081
      David Hildenbrand 提交于
      test_pages_in_a_zone() is just another nasty PFN walker that can easily
      stumble over ZONE_DEVICE memory ranges falling into the same memory block
      as ordinary system RAM: the memmap of parts of these ranges might possibly
      be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:
      
        UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
        index 7 is out of range for type 'zone [5]'
        CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
        Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
        Call Trace:
         dump_stack+0x9a/0xf0
         ubsan_epilogue+0x9/0x7a
         __ubsan_handle_out_of_bounds+0x13a/0x181
         test_pages_in_a_zone+0x3c4/0x500
         show_valid_zones+0x1fa/0x380
         dev_attr_show+0x43/0xb0
         sysfs_kf_seq_show+0x1c5/0x440
         seq_read+0x49d/0x1190
         vfs_read+0xff/0x300
         ksys_read+0xb8/0x170
         do_syscall_64+0xa5/0x4b0
         entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        RIP: 0033:0x7f01f4439b52
      
      We seem to stumble over a memmap that contains a garbage zone id.  While
      we could try inserting pfn_to_online_page() calls, it will just make
      memory offlining slower, because we use test_pages_in_a_zone() to make
      sure we're offlining pages that all belong to the same zone.
      
      Let's just get rid of this PFN walker and determine the single zone of a
      memory block -- if any -- for early memory blocks during boot.  For memory
      onlining, we know the single zone already.  Let's avoid any additional
      memmap scanning and just rely on the zone information available during
      boot.
      
      For memory hot(un)plug, we only really care about memory blocks that:
      * span a single zone (and, thereby, a single node)
      * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
      If one of these conditions is not met, we reject memory offlining.
      Hotplugged memory blocks (starting out offline), always meet both
      conditions.
      
      There are three scenarios to handle:
      
      (1) Memory hot(un)plug
      
      A memory block with zone == NULL cannot be offlined, corresponding to
      our previous test_pages_in_a_zone() check.
      
      After successful memory onlining/offlining, we simply set the zone
      accordingly.
      * Memory onlining: set the zone we just used for onlining
      * Memory offlining: set zone = NULL
      
      So a hotplugged memory block starts with zone = NULL. Once memory
      onlining is done, we set the proper zone.
      
      (2) Boot memory with !CONFIG_NUMA
      
      We know that there is just a single pgdat, so we simply scan all zones
      of that pgdat for an intersection with our memory block PFN range when
      adding the memory block. If more than one zone intersects (e.g., DMA and
      DMA32 on x86 for the first memory block) we set zone = NULL and
      consequently mimic what test_pages_in_a_zone() used to do.
      
      (3) Boot memory with CONFIG_NUMA
      
      At the point in time we create the memory block devices during boot, we
      don't know yet which nodes *actually* span a memory block. While we could
      scan all zones of all nodes for intersections, overlapping nodes complicate
      the situation and scanning all nodes is possibly expensive. But that
      problem has already been solved by the code that sets the node of a memory
      block and creates the link in the sysfs --
      do_register_memory_block_under_node().
      
      So, we hook into the code that sets the node id for a memory block. If
      we already have a different node id set for the memory block, we know
      that multiple nodes *actually* have PFNs falling into our memory block:
      we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
      to do. If there is no node id set, we do the same as (2) for the given
      node.
      
      Note that the call order in driver_init() is:
      -> memory_dev_init(): create memory block devices
      -> node_dev_init(): link memory block devices to the node and set the
      		    node id
      
      So in summary, we detect if there is a single zone responsible for this
      memory block and we consequently store the zone in that case in the
      memory block, updating it during memory onlining/offlining.
      
      Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NRafael Parra <rparrazo@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rafael Parra <rparrazo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      395f6081
    • D
      drivers/base/node: rename link_mem_sections() to register_memory_block_under_node() · cc651559
      David Hildenbrand 提交于
      Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2.
      
      I remember talking to Michal in the past about removing
      test_pages_in_a_zone(), which we use for:
      * verifying that a memory block we intend to offline is really only managed
        by a single zone. We don't support offlining of memory blocks that are
        managed by multiple zones (e.g., multiple nodes, DMA and DMA32)
      * exposing that zone to user space via
        /sys/devices/system/memory/memory*/valid_zones
      
      Now that I identified some more cases where test_pages_in_a_zone() might
      go wrong, and we received an UBSAN report (see patch #3), let's get rid of
      this PFN walker.
      
      So instead of detecting the zone at runtime with test_pages_in_a_zone() by
      scanning the memmap, let's determine and remember for each memory block if
      it's managed by a single zone.  The stored zone can then be used for the
      above two cases, avoiding a manual lookup using test_pages_in_a_zone().
      
      This avoids eventually stumbling over uninitialized memmaps in corner
      cases, especially when ZONE_DEVICE ranges partly fall into memory block
      (that are responsible for managing System RAM).
      
      Handling memory onlining is easy, because we online to exactly one zone.
      Handling boot memory is more tricky, because we want to avoid scanning all
      zones of all nodes to detect possible zones that overlap with the physical
      memory region of interest.  Fortunately, we already have code that
      determines the applicable nodes for a memory block, to create sysfs links
      -- we'll hook into that.
      
      Patch #1 is a simple cleanup I had laying around for a longer time.
      Patch #2 contains the main logic to remove test_pages_in_a_zone() and
      further details.
      
      [1] https://lkml.kernel.org/r/20220128144540.153902-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
      
      This patch (of 2):
      
      Let's adjust the stale terminology, making it match
      unregister_memory_block_under_nodes() and
      do_register_memory_block_under_node().  We're dealing with memory block
      devices, which span 1..X memory sections.
      
      Link: https://lkml.kernel.org/r/20220210184359.235565-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220210184359.235565-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NOscar Salvador <osalvador@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Rafael Parra <rparrazo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc651559
    • D
      drivers/base/node: consolidate node device subsystem initialization in node_dev_init() · 2848a28b
      David Hildenbrand 提交于
      ...  and call node_dev_init() after memory_dev_init() from driver_init(),
      so before any of the existing arch/subsys calls.  All online nodes should
      be known at that point: early during boot, arch code determines node and
      zone ranges and sets the relevant nodes online; usually this happens in
      setup_arch().
      
      This is in line with memory_dev_init(), which initializes the memory
      device subsystem and creates all memory block devices.
      
      Similar to memory_dev_init(), panic() if anything goes wrong, we don't
      want to continue with such basic initialization errors.
      
      The important part is that node_dev_init() gets called after
      memory_dev_init() and after cpu_dev_init(), but before any of the relevant
      archs call register_cpu() to register the new cpu device under the node
      device.  The latter should be the case for the current users of
      topology_init().
      
      Link: https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: Anatoly Pugachev <matorola@gmail.com> (sparc64)
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2848a28b
  3. 09 12月, 2021 1 次提交
    • J
      x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node · 50468e43
      Jarkko Sakkinen 提交于
      == Problem ==
      
      The amount of SGX memory on a system is determined by the BIOS and it
      varies wildly between systems.  It can be as small as dozens of MB's
      and as large as many GB's on servers.  Just like how applications need
      to know how much regular RAM is available, enclave builders need to
      know how much SGX memory an enclave can consume.
      
      == Solution ==
      
      Introduce a new sysfs file:
      
      	/sys/devices/system/node/nodeX/x86/sgx_total_bytes
      
      to enumerate the amount of SGX memory available in each NUMA node.
      This serves the same function for SGX as /proc/meminfo or
      /sys/devices/system/node/nodeX/meminfo does for normal RAM.
      
      'sgx_total_bytes' is needed today to help drive the SGX selftests.
      SGX-specific swap code is exercised by creating overcommitted enclaves
      which are larger than the physical SGX memory on the system.  They
      currently use a CPUID-based approach which can diverge from the actual
      amount of SGX memory available.  'sgx_total_bytes' ensures that the
      selftests can work efficiently and do not attempt stupid things like
      creating a 100,000 MB enclave on a system with 128 MB of SGX memory.
      
      == Implementation Details ==
      
      Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an
      arch specific attribute group, and add an attribute for the amount of
      SGX memory in bytes to each NUMA node:
      
      == ABI Design Discussion ==
      
      As opposed to the per-node ABI, a single, global ABI was considered.
      However, this would prevent enclaves from being able to size
      themselves so that they fit on a single NUMA node.  Essentially, a
      single value would rule out NUMA optimizations for enclaves.
      
      Create a new "x86/" directory inside each "nodeX/" sysfs directory.
      'sgx_total_bytes' is expected to be the first of at least a few
      sgx-specific files to be placed in the new directory.  Just scanning
      /proc/meminfo, these are the no-brainers that we have for RAM, but we
      need for SGX:
      
      	MemTotal:       xxxx kB // sgx_total_bytes (implemented here)
      	MemFree:        yyyy kB // sgx_free_bytes
      	SwapTotal:      zzzz kB // sgx_swapped_bytes
      
      So, at *least* three.  I think we will eventually end up needing
      something more along the lines of a dozen.  A new directory (as
      opposed to being in the nodeX/ "root") directory avoids cluttering the
      root with several "sgx_*" files.
      
      Place the new file in a new "nodeX/x86/" directory because SGX is
      highly x86-specific.  It is very unlikely that any other architecture
      (or even non-Intel x86 vendor) will ever implement SGX.  Using "sgx/"
      as opposed to "x86/" was also considered.  But, there is a real chance
      this can get used for other arch-specific purposes.
      
      [ dhansen: rewrite changelog ]
      Signed-off-by: NJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.org
      50468e43
  4. 07 11月, 2021 1 次提交
  5. 09 9月, 2021 1 次提交
  6. 13 8月, 2021 1 次提交
  7. 21 7月, 2021 1 次提交
  8. 30 6月, 2021 1 次提交
    • M
      mm/vmstat: convert NUMA statistics to basic NUMA counters · f19298b9
      Mel Gorman 提交于
      NUMA statistics are maintained on the zone level for hits, misses, foreign
      etc but nothing relies on them being perfectly accurate for functional
      correctness.  The counters are used by userspace to get a general overview
      of a workloads NUMA behaviour but the page allocator incurs a high cost to
      maintain perfect accuracy similar to what is required for a vmstat like
      NR_FREE_PAGES.  There even is a sysctl vm.numa_stat to allow userspace to
      turn off the collection of NUMA statistics like NUMA_HIT.
      
      This patch converts NUMA_HIT and friends to be NUMA events with similar
      accuracy to VM events.  There is a possibility that slight errors will be
      introduced but the overall trend as seen by userspace will be similar.
      The counters are no longer updated from vmstat_refresh context as it is
      unnecessary overhead for counters that may never be read by userspace.
      Note that counters could be maintained at the node level to save space but
      it would have a user-visible impact due to /proc/zoneinfo.
      
      [lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19298b9
  9. 04 6月, 2021 1 次提交
  10. 22 5月, 2021 1 次提交
  11. 10 4月, 2021 1 次提交
  12. 25 2月, 2021 6 次提交
    • S
      mm: memcg: add swapcache stat for memcg v2 · b6038942
      Shakeel Butt 提交于
      This patch adds swapcache stat for the cgroup v2.  The swapcache
      represents the memory that is accounted against both the memory and the
      swap limit of the cgroup.  The main motivation behind exposing the
      swapcache stat is for enabling users to gracefully migrate from cgroup
      v1's memsw counter to cgroup v2's memory and swap counters.
      
      Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
      workload but without control on the exact proportion of memory and swap.
      Cgroup v2 provides separate limits for memory and swap which enables more
      control on the exact usage of memory and swap individually for the
      workload.
      
      With some little subtleties, the v1's memsw limit can be switched with the
      sum of the v2's memory and swap limits.  However the alternative for memsw
      usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
      stat enables that alternative.  Adding the memory usage and swap usage and
      subtracting the swapcache will approximate the memsw usage.  This will
      help in the transparent migration of the workloads depending on memsw
      usage and limit to v2' memory and swap counters.
      
      The reasons these applications are still interested in this approximate
      memsw usage are: (1) these applications are not really interested in two
      separate memory and swap usage metrics.  A single usage metric is more
      simple to use and reason about for them.
      
      (2) The memsw usage metric hides the underlying system's swap setup from
      the applications.  Applications with multiple instances running in a
      datacenter with heterogeneous systems (some have swap and some don't) will
      keep seeing a consistent view of their usage.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      
      Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6038942
    • M
      mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages · 380780e7
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_FILE_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      380780e7
    • M
      mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages · a1528e21
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_SHMEM_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1528e21
    • M
      mm: memcontrol: convert NR_SHMEM_THPS account to pages · 57b2847d
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_SHMEM_THPS account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57b2847d
    • M
      mm: memcontrol: convert NR_FILE_THPS account to pages · bf9ecead
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with if hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_FILE_THPS account to pages.  This patch is consistent
      with 8f182270 ("mm/swap.c: flush lru pvecs on compound page arrival").
      Doing this also can make the unit of vmstat counters more unified.
      Finally, the unit of the vmstat counters are pages, kB and bytes.  The
      B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
      without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf9ecead
    • M
      mm: memcontrol: convert NR_ANON_THPS account to pages · 69473e5d
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_ANON_THPS account to pages.  This patch is consistent
      with 8f182270 ("mm/swap.c: flush lru pvecs on compound page arrival").
      Doing this also can make the unit of vmstat counters more unified.
      Finally, the unit of the vmstat counters are pages, kB and bytes.  The
      B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
      without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69473e5d
  13. 16 12月, 2020 1 次提交
  14. 17 10月, 2020 1 次提交
  15. 03 10月, 2020 1 次提交
  16. 02 10月, 2020 5 次提交
  17. 27 9月, 2020 1 次提交
    • L
      mm: don't rely on system state to detect hot-plug operations · f85086f9
      Laurent Dufour 提交于
      In register_mem_sect_under_node() the system_state's value is checked to
      detect whether the call is made during boot time or during an hot-plug
      operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
      because regular memory is registered at SYSTEM_SCHEDULING state.  In
      addition, memory hot-plug operation can be triggered at this system
      state by the ACPI [1].  So checking against the system state is not
      enough.
      
      The consequence is that on system with interleaved node's ranges like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      This can be seen on PowerPC LPAR after multiple memory hot-plug and
      hot-unplug operations are done.  At the next reboot the node's memory
      ranges can be interleaved and since the call to link_mem_sections() is
      made in topology_init() while the system is in the SYSTEM_SCHEDULING
      state, the node's id is not checked, and the sections registered to
      multiple nodes:
      
        $ ls -l /sys/devices/system/memory/memory21/node*
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
      
      In that case, the system is able to boot but if later one of theses
      memory blocks is hot-unplugged and then hot-plugged, the sysfs
      inconsistency is detected and this is triggering a BUG_ON():
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This patch addresses the root cause by not relying on the system_state
      value to detect whether the call is due to a hot-plug operation.  An
      extra parameter is added to link_mem_sections() detailing whether the
      operation is due to a hot-plug operation.
      
      [1] According to Oscar Salvador, using this qemu command line, ACPI
      memory hotplug operations are raised at SYSTEM_SCHEDULING state:
      
        $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
              -m size=$MEM,slots=255,maxmem=4294967296k  \
              -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
              -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
              -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
              -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
              -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
              -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
              -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
              -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
      
      Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f85086f9
  18. 08 8月, 2020 3 次提交
  19. 03 6月, 2020 1 次提交
    • N
      mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead · 8d92890b
      NeilBrown 提交于
      After an NFS page has been written it is considered "unstable" until a
      COMMIT request succeeds.  If the COMMIT fails, the page will be
      re-written.
      
      These "unstable" pages are currently accounted as "reclaimable", either
      in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
      'reclaimable' count.  This might have made sense when sending the COMMIT
      required a separate action by the VFS/MM (e.g.  releasepage() used to
      send a COMMIT).  However now that all writes generated by ->writepages()
      will automatically be followed by a COMMIT (since commit 919e3bd9
      ("NFS: Ensure we commit after writeback is complete")) it makes more
      sense to treat them as writeback pages.
      
      So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
      NR_WRITEBACK and WB_WRITEBACK.
      
      A particular effect of this change is that when
      wb_check_background_flush() calls wb_over_bg_threshold(), the latter
      will report 'true' a lot less often as the 'unstable' pages are no
      longer considered 'dirty' (as there is nothing that writeback can do
      about them anyway).
      
      Currently wb_check_background_flush() will trigger writeback to NFS even
      when there are relatively few dirty pages (if there are lots of unstable
      pages), this can result in small writes going to the server (10s of
      Kilobytes rather than a Megabyte) which hurts throughput.  With this
      patch, there are fewer writes which are each larger on average.
      
      Where the NR_UNSTABLE_NFS count was included in statistics
      virtual-files, the entry is retained, but the value is hard-coded as
      zero.  static trace points and warning printks which mentioned this
      counter no longer report it.
      
      [akpm@linux-foundation.org: re-layout comment]
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Acked-by: Michal Hocko <mhocko@suse.com>	[mm]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d92890b
  20. 15 5月, 2020 1 次提交
  21. 03 4月, 2020 1 次提交
  22. 05 12月, 2019 1 次提交
  23. 25 9月, 2019 3 次提交
  24. 19 7月, 2019 2 次提交