1. 10 4月, 2021 1 次提交
  2. 25 2月, 2021 6 次提交
    • S
      mm: memcg: add swapcache stat for memcg v2 · b6038942
      Shakeel Butt 提交于
      This patch adds swapcache stat for the cgroup v2.  The swapcache
      represents the memory that is accounted against both the memory and the
      swap limit of the cgroup.  The main motivation behind exposing the
      swapcache stat is for enabling users to gracefully migrate from cgroup
      v1's memsw counter to cgroup v2's memory and swap counters.
      
      Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
      workload but without control on the exact proportion of memory and swap.
      Cgroup v2 provides separate limits for memory and swap which enables more
      control on the exact usage of memory and swap individually for the
      workload.
      
      With some little subtleties, the v1's memsw limit can be switched with the
      sum of the v2's memory and swap limits.  However the alternative for memsw
      usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
      stat enables that alternative.  Adding the memory usage and swap usage and
      subtracting the swapcache will approximate the memsw usage.  This will
      help in the transparent migration of the workloads depending on memsw
      usage and limit to v2' memory and swap counters.
      
      The reasons these applications are still interested in this approximate
      memsw usage are: (1) these applications are not really interested in two
      separate memory and swap usage metrics.  A single usage metric is more
      simple to use and reason about for them.
      
      (2) The memsw usage metric hides the underlying system's swap setup from
      the applications.  Applications with multiple instances running in a
      datacenter with heterogeneous systems (some have swap and some don't) will
      keep seeing a consistent view of their usage.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      
      Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6038942
    • M
      mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages · 380780e7
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_FILE_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      380780e7
    • M
      mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages · a1528e21
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_SHMEM_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1528e21
    • M
      mm: memcontrol: convert NR_SHMEM_THPS account to pages · 57b2847d
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_SHMEM_THPS account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57b2847d
    • M
      mm: memcontrol: convert NR_FILE_THPS account to pages · bf9ecead
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with if hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_FILE_THPS account to pages.  This patch is consistent
      with 8f182270 ("mm/swap.c: flush lru pvecs on compound page arrival").
      Doing this also can make the unit of vmstat counters more unified.
      Finally, the unit of the vmstat counters are pages, kB and bytes.  The
      B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
      without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf9ecead
    • M
      mm: memcontrol: convert NR_ANON_THPS account to pages · 69473e5d
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_ANON_THPS account to pages.  This patch is consistent
      with 8f182270 ("mm/swap.c: flush lru pvecs on compound page arrival").
      Doing this also can make the unit of vmstat counters more unified.
      Finally, the unit of the vmstat counters are pages, kB and bytes.  The
      B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
      without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69473e5d
  3. 16 12月, 2020 1 次提交
  4. 17 10月, 2020 1 次提交
  5. 03 10月, 2020 1 次提交
  6. 02 10月, 2020 5 次提交
  7. 27 9月, 2020 1 次提交
    • L
      mm: don't rely on system state to detect hot-plug operations · f85086f9
      Laurent Dufour 提交于
      In register_mem_sect_under_node() the system_state's value is checked to
      detect whether the call is made during boot time or during an hot-plug
      operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
      because regular memory is registered at SYSTEM_SCHEDULING state.  In
      addition, memory hot-plug operation can be triggered at this system
      state by the ACPI [1].  So checking against the system state is not
      enough.
      
      The consequence is that on system with interleaved node's ranges like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      This can be seen on PowerPC LPAR after multiple memory hot-plug and
      hot-unplug operations are done.  At the next reboot the node's memory
      ranges can be interleaved and since the call to link_mem_sections() is
      made in topology_init() while the system is in the SYSTEM_SCHEDULING
      state, the node's id is not checked, and the sections registered to
      multiple nodes:
      
        $ ls -l /sys/devices/system/memory/memory21/node*
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
      
      In that case, the system is able to boot but if later one of theses
      memory blocks is hot-unplugged and then hot-plugged, the sysfs
      inconsistency is detected and this is triggering a BUG_ON():
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This patch addresses the root cause by not relying on the system_state
      value to detect whether the call is due to a hot-plug operation.  An
      extra parameter is added to link_mem_sections() detailing whether the
      operation is due to a hot-plug operation.
      
      [1] According to Oscar Salvador, using this qemu command line, ACPI
      memory hotplug operations are raised at SYSTEM_SCHEDULING state:
      
        $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
              -m size=$MEM,slots=255,maxmem=4294967296k  \
              -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
              -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
              -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
              -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
              -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
              -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
              -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
              -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
      
      Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f85086f9
  8. 08 8月, 2020 3 次提交
  9. 03 6月, 2020 1 次提交
    • N
      mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead · 8d92890b
      NeilBrown 提交于
      After an NFS page has been written it is considered "unstable" until a
      COMMIT request succeeds.  If the COMMIT fails, the page will be
      re-written.
      
      These "unstable" pages are currently accounted as "reclaimable", either
      in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
      'reclaimable' count.  This might have made sense when sending the COMMIT
      required a separate action by the VFS/MM (e.g.  releasepage() used to
      send a COMMIT).  However now that all writes generated by ->writepages()
      will automatically be followed by a COMMIT (since commit 919e3bd9
      ("NFS: Ensure we commit after writeback is complete")) it makes more
      sense to treat them as writeback pages.
      
      So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
      NR_WRITEBACK and WB_WRITEBACK.
      
      A particular effect of this change is that when
      wb_check_background_flush() calls wb_over_bg_threshold(), the latter
      will report 'true' a lot less often as the 'unstable' pages are no
      longer considered 'dirty' (as there is nothing that writeback can do
      about them anyway).
      
      Currently wb_check_background_flush() will trigger writeback to NFS even
      when there are relatively few dirty pages (if there are lots of unstable
      pages), this can result in small writes going to the server (10s of
      Kilobytes rather than a Megabyte) which hurts throughput.  With this
      patch, there are fewer writes which are each larger on average.
      
      Where the NR_UNSTABLE_NFS count was included in statistics
      virtual-files, the entry is retained, but the value is hard-coded as
      zero.  static trace points and warning printks which mentioned this
      counter no longer report it.
      
      [akpm@linux-foundation.org: re-layout comment]
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Acked-by: Michal Hocko <mhocko@suse.com>	[mm]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d92890b
  10. 15 5月, 2020 1 次提交
  11. 03 4月, 2020 1 次提交
  12. 05 12月, 2019 1 次提交
  13. 25 9月, 2019 3 次提交
  14. 19 7月, 2019 4 次提交
    • D
      mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns · fbcf73ce
      David Hildenbrand 提交于
      walk_memory_range() was once used to iterate over sections.  Now, it
      iterates over memory blocks.  Rename the function, fixup the
      documentation.
      
      Also, pass start+size instead of PFNs, which is what most callers
      already have at hand.  (we'll rework link_mem_sections() most probably
      soon)
      
      Follow-up patches will rework, simplify, and move walk_memory_blocks()
      to drivers/base/memory.c.
      
      Note: walk_memory_blocks() only works correctly right now if the
      start_pfn is aligned to a section start.  This is the case right now,
      but we'll generalize the function in a follow up patch so the semantics
      match the documentation.
      
      [akpm@linux-foundation.org: remove unused variable]
      Link: http://lkml.kernel.org/r/20190614100114.311-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbcf73ce
    • D
      mm: make register_mem_sect_under_node() static · 8d595c4c
      David Hildenbrand 提交于
      It is only used internally.
      
      Link: http://lkml.kernel.org/r/20190614100114.311-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d595c4c
    • D
      mm/memory_hotplug: make unregister_memory_block_under_nodes() never fail · a31b264c
      David Hildenbrand 提交于
      We really don't want anything during memory hotunplug to fail.  We
      always pass a valid memory block device, that check can go.  Avoid
      allocating memory and eventually failing.  As we are always called under
      lock, we can use a static piece of memory.  This avoids having to put
      the structure onto the stack, having to guess about the stack size of
      callers.
      
      Patch inspired by a patch from Oscar Salvador.
      
      In the future, there might be no need to iterate over nodes at all.
      mem->nid should tell us exactly what to remove.  Memory block devices
      with mixed nodes (added during boot) should properly fenced off and
      never removed.
      
      Link: http://lkml.kernel.org/r/20190527111152.16324-11-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NWei Yang <richardw.yang@linux.intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a31b264c
    • D
      mm/memory_hotplug: remove memory block devices before arch_remove_memory() · 4c4b7f9b
      David Hildenbrand 提交于
      Let's factor out removing of memory block devices, which is only
      necessary for memory added via add_memory() and friends that created
      memory block devices.  Remove the devices before calling
      arch_remove_memory().
      
      This finishes factoring out memory block device handling from
      arch_add_memory() and arch_remove_memory().
      
      Link: http://lkml.kernel.org/r/20190527111152.16324-10-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c4b7f9b
  15. 21 6月, 2019 1 次提交
    • M
      drivers: base/node.c: fixes a kernel-doc markups · 58cb346c
      Mauro Carvalho Chehab 提交于
      There was a typo at the name of the vars inside the kernel-doc
      comment, causing those warnings:
      
      	./drivers/base/node.c:690: warning: Function parameter or member 'mem_nid' not described in 'register_memory_node_under_compute_node'
      	./drivers/base/node.c:690: warning: Function parameter or member 'cpu_nid' not described in 'register_memory_node_under_compute_node'
      	./drivers/base/node.c:690: warning: Excess function parameter 'mem_node' description in 'register_memory_node_under_compute_node'
      	./drivers/base/node.c:690: warning: Excess function parameter 'cpu_node' description in 'register_memory_node_under_compute_node'
      
      There's also a description missing here:
      	./drivers/base/node.c:78: warning: Function parameter or member 'hmem_attrs' not described in 'node_access_nodes'
      
      Copy an existing description from another function call.
      Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58cb346c
  16. 05 4月, 2019 3 次提交
    • K
      node: Add memory-side caching attributes · acc02a10
      Keith Busch 提交于
      System memory may have caches to help improve access speed to frequently
      requested address ranges. While the system provided cache is transparent
      to the software accessing these memory ranges, applications can optimize
      their own access based on cache attributes.
      
      Provide a new API for the kernel to register these memory-side caches
      under the memory node that provides it.
      
      The new sysfs representation is modeled from the existing cpu cacheinfo
      attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/.  Unlike CPU
      cacheinfo though, the node cache level is reported from the view of the
      memory. A higher level number is nearer to the CPU, while lower levels
      are closer to the last level memory.
      
      The exported attributes are the cache size, the line size, associativity
      indexing, and write back policy, and add the attributes for the system
      memory caches to sysfs stable documentation.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      acc02a10
    • K
      node: Add heterogenous memory access attributes · e1cf33aa
      Keith Busch 提交于
      Heterogeneous memory systems provide memory nodes with different latency
      and bandwidth performance attributes. Provide a new kernel interface
      for subsystems to register the attributes under the memory target
      node's initiator access class. If the system provides this information,
      applications may query these attributes when deciding which node to
      request memory.
      
      The following example shows the new sysfs hierarchy for a node exporting
      performance attributes:
      
        # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/
        /sys/devices/system/node/nodeY/accessZ/initiators/
        |-- read_bandwidth
        |-- read_latency
        |-- write_bandwidth
        `-- write_latency
      
      The bandwidth is exported as MB/s and latency is reported in
      nanoseconds. The values are taken from the platform as reported by the
      manufacturer.
      
      Memory accesses from an initiator node that is not one of the memory's
      access "Z" initiator nodes linked in the same directory may observe
      different performance than reported here. When a subsystem makes use
      of this interface, initiators of a different access number may not have
      the same performance relative to initiators in other access numbers, or
      omitted from the any access class' initiators.
      
      Descriptions for memory access initiator performance access attributes
      are added to sysfs stable documentation.
      Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e1cf33aa
    • K
      node: Link memory nodes to their compute nodes · 08d9dbe7
      Keith Busch 提交于
      Systems may be constructed with various specialized nodes. Some nodes
      may provide memory, some provide compute devices that access and use
      that memory, and others may provide both. Nodes that provide memory are
      referred to as memory targets, and nodes that can initiate memory access
      are referred to as memory initiators.
      
      Memory targets will often have varying access characteristics from
      different initiators, and platforms may have ways to express those
      relationships. In preparation for these systems, provide interfaces for
      the kernel to export the memory relationship among different nodes memory
      targets and their initiators with symlinks to each other.
      
      If a system provides access locality for each initiator-target pair, nodes
      may be grouped into ranked access classes relative to other nodes. The
      new interface allows a subsystem to register relationships of varying
      classes if available and desired to be exported.
      
      A memory initiator may have multiple memory targets in the same access
      class. The target memory's initiators in a given class indicate the
      nodes access characteristics share the same performance relative to other
      linked initiator nodes. Each target within an initiator's access class,
      though, do not necessarily perform the same as each other.
      
      A memory target node may have multiple memory initiators. All linked
      initiators in a target's class have the same access characteristics to
      that target.
      
      The following example show the nodes' new sysfs hierarchy for a memory
      target node 'Y' with access class 0 from initiator node 'X':
      
        # symlinks -v /sys/devices/system/node/nodeX/access0/
        relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
      
        # symlinks -v /sys/devices/system/node/nodeY/access0/
        relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
      
      The new attributes are added to the sysfs stable documentation.
      Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08d9dbe7
  17. 27 10月, 2018 1 次提交
    • V
      mm, proc: add KReclaimable to /proc/meminfo · 61f94e18
      Vlastimil Babka 提交于
      The vmstat NR_KERNEL_MISC_RECLAIMABLE counter is for kernel non-slab
      allocations that can be reclaimed via shrinker.  In /proc/meminfo, we can
      show the sum of all reclaimable kernel allocations (including slab) as
      "KReclaimable".  Add the same counter also to per-node meminfo under /sys
      
      With this counter, users will have more complete information about kernel
      memory usage.  Non-slab reclaimable pages (currently just the ION
      allocator) will not be missing from /proc/meminfo, making users wonder
      where part of their memory went.  More precisely, they already appear in
      MemAvailable, but without the new counter, it's not obvious why the value
      in MemAvailable doesn't fully correspond with the sum of other counters
      participating in it.
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-6-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f94e18
  18. 18 8月, 2018 2 次提交
  19. 26 5月, 2018 1 次提交
    • J
      mm/memory_hotplug: fix leftover use of struct page during hotplug · a2155861
      Jonathan Cameron 提交于
      The case of a new numa node got missed in avoiding using the node info
      from page_struct during hotplug.  In this path we have a call to
      register_mem_sect_under_node (which allows us to specify it is hotplug
      so don't change the node), via link_mem_sections which unfortunately
      does not.
      
      Fix is to pass check_nid through link_mem_sections as well and disable
      it in the new numa node path.
      
      Note the bug only 'sometimes' manifests depending on what happens to be
      in the struct page structures - there are lots of them and it only needs
      to match one of them.
      
      The result of the bug is that (with a new memory only node) we never
      successfully call register_mem_sect_under_node so don't get the memory
      associated with the node in sysfs and meminfo for the node doesn't
      report it.
      
      It came up whilst testing some arm64 hotplug patches, but appears to be
      universal.  Whilst I'm triggering it by removing then reinserting memory
      to a node with no other elements (thus making the node disappear then
      appear again), it appears it would happen on hotplugging memory where
      there was none before and it doesn't seem to be related the arm64
      patches.
      
      These patches call __add_pages (where most of the issue was fixed by
      Pavel's patch).  If there is a node at the time of the __add_pages call
      then all is well as it calls register_mem_sect_under_node from there
      with check_nid set to false.  Without a node that function returns
      having not done the sysfs related stuff as there is no node to use.
      This is expected but it is the resulting path that fails...
      
      Exact path to the problem is as follows:
      
       mm/memory_hotplug.c: add_memory_resource()
      
         The node is not online so we enter the 'if (new_node)' twice, on the
         second such block there is a call to link_mem_sections which calls
         into
      
        drivers/node.c: link_mem_sections() which calls
      
        drivers/node.c: register_mem_sect_under_node() which calls
           get_nid_for_pfn and keeps trying until the output of that matches
           the expected node (passed all the way down from
           add_memory_resource)
      
      It is effectively the same fix as the one referred to in the fixes tag
      just in the code path for a new node where the comments point out we
      have to rerun the link creation because it will have failed in
      register_new_memory (as there was no node at the time).  (actually that
      comment is wrong now as we don't have register_new_memory any more it
      got renamed to hotplug_memory_register in Pavel's patch).
      
      Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com
      Fixes: fc44f7f9 ("mm/memory_hotplug: don't read nid from struct page during hotplug")
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2155861
  20. 06 4月, 2018 2 次提交