1. 17 10月, 2020 3 次提交
  2. 14 10月, 2020 1 次提交
    • D
      mm/memory_hotplug: introduce default phys_to_target_node() implementation · a035b6bf
      Dan Williams 提交于
      In preparation to set a fallback value for dev_dax->target_node, introduce
      generic fallback helpers for phys_to_target_node()
      
      A generic implementation based on node-data or memblock was proposed, but
      as noted by Mike:
      
          "Here again, I would prefer to add a weak default for
           phys_to_target_node() because the "generic" implementation is not really
           generic.
      
           The fallback to reserved ranges is x86 specfic because on x86 most of
           the reserved areas is not in memblock.memory. AFAIK, no other
           architecture does this."
      
      The info message in the generic memory_add_physaddr_to_nid()
      implementation is fixed up to properly reflect that
      memory_add_physaddr_to_nid() communicates "online" node info and
      phys_to_target_node() indicates "target / to-be-onlined" node info.
      
      [akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
        Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a035b6bf
  3. 27 9月, 2020 2 次提交
    • L
      mm: don't rely on system state to detect hot-plug operations · f85086f9
      Laurent Dufour 提交于
      In register_mem_sect_under_node() the system_state's value is checked to
      detect whether the call is made during boot time or during an hot-plug
      operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
      because regular memory is registered at SYSTEM_SCHEDULING state.  In
      addition, memory hot-plug operation can be triggered at this system
      state by the ACPI [1].  So checking against the system state is not
      enough.
      
      The consequence is that on system with interleaved node's ranges like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      This can be seen on PowerPC LPAR after multiple memory hot-plug and
      hot-unplug operations are done.  At the next reboot the node's memory
      ranges can be interleaved and since the call to link_mem_sections() is
      made in topology_init() while the system is in the SYSTEM_SCHEDULING
      state, the node's id is not checked, and the sections registered to
      multiple nodes:
      
        $ ls -l /sys/devices/system/memory/memory21/node*
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
      
      In that case, the system is able to boot but if later one of theses
      memory blocks is hot-unplugged and then hot-plugged, the sysfs
      inconsistency is detected and this is triggering a BUG_ON():
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This patch addresses the root cause by not relying on the system_state
      value to detect whether the call is due to a hot-plug operation.  An
      extra parameter is added to link_mem_sections() detailing whether the
      operation is due to a hot-plug operation.
      
      [1] According to Oscar Salvador, using this qemu command line, ACPI
      memory hotplug operations are raised at SYSTEM_SCHEDULING state:
      
        $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
              -m size=$MEM,slots=255,maxmem=4294967296k  \
              -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
              -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
              -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
              -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
              -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
              -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
              -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
              -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
      
      Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f85086f9
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
  4. 20 9月, 2020 1 次提交
  5. 15 8月, 2020 1 次提交
  6. 13 8月, 2020 4 次提交
    • J
      mm/migrate: introduce a standard migration target allocation function · 19fc7bed
      Joonsoo Kim 提交于
      There are some similar functions for migration target allocation.  Since
      there is no fundamental difference, it's better to keep just one rather
      than keeping all variants.  This patch implements base migration target
      allocation function.  In the following patches, variants will be converted
      to use this function.
      
      Changes should be mechanical, but, unfortunately, there are some
      differences.  First, some callers' nodemask is assgined to NULL since NULL
      nodemask will be considered as all available nodes, that is,
      &node_states[N_MEMORY].  Second, for hugetlb page allocation, gfp_mask is
      redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
      user provided gfp_mask has it.  This is because future caller of this
      function requires to set this node constaint.  Lastly, if provided nodeid
      is NUMA_NO_NODE, nodeid is set up to the node where migration source
      lives.  It helps to remove simple wrappers for setting up the nodeid.
      
      Note that PageHighmem() call in previous function is changed to open-code
      "is_highmem_idx()" since it provides more readability.
      
      [akpm@linux-foundation.org: tweak patch title, per Vlastimil]
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19fc7bed
    • C
      mm, memory_hotplug: update pcp lists everytime onlining a memory block · de1193f0
      Charan Teja Reddy 提交于
      When onlining a first memory block in a zone, pcp lists are not updated
      thus pcp struct will have the default setting of ->high = 0,->batch = 1.
      
      This means till the second memory block in a zone(if it have) is onlined
      the pcp lists of this zone will not contain any pages because pcp's
      ->count is always greater than ->high thus free_pcppages_bulk() is called
      to free batch size(=1) pages every time system wants to add a page to the
      pcp list through free_unref_page().
      
      To put this in a word, system is not using benefits offered by the pcp
      lists when there is a single onlineable memory block in a zone.  Correct
      this by always updating the pcp lists when memory block is onlined.
      
      Fixes: 1f522509 ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset")
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de1193f0
    • J
      mm/memory_hotplug: fix unpaired mem_hotplug_begin/done · b4223a51
      Jia He 提交于
      When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
      online at that time), mem_hotplug_begin/done is unpaired in such case.
      
      Therefore a warning:
       Call Trace:
        percpu_up_write+0x33/0x40
        try_remove_memory+0x66/0x120
        ? _cond_resched+0x19/0x30
        remove_memory+0x2b/0x40
        dev_dax_kmem_remove+0x36/0x72 [kmem]
        device_release_driver_internal+0xf0/0x1c0
        device_release_driver+0x12/0x20
        bus_remove_device+0xe1/0x150
        device_del+0x17b/0x3e0
        unregister_dev_dax+0x29/0x60
        devm_action_release+0x15/0x20
        release_nodes+0x19a/0x1e0
        devres_release_all+0x3f/0x50
        device_release_driver_internal+0x100/0x1c0
        driver_detach+0x4c/0x8f
        bus_remove_driver+0x5c/0xd0
        driver_unregister+0x31/0x50
        dax_pmem_exit+0x10/0xfe0 [dax_pmem]
      
      Fixes: f1037ec0 ("mm/memory_hotplug: fix remove_memory() lockdep splat")
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>	[5.6+]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chuhong Yuan <hslester96@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
      Cc: Kaly Xin <Kaly.Xin@arm.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4223a51
    • J
      mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() · d622ecec
      Jia He 提交于
      This is to introduce a general dummy helper.  memory_add_physaddr_to_nid()
      is a fallback option to get the nid in case NUMA_NO_NID is detected.
      
      After this patch, arm64/sh/s390 can simply use the general dummy version.
      PowerPC/x86/ia64 will still use their specific version.
      
      This is the preparation to set a fallback value for dev_dax->target_node.
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chuhong Yuan <hslester96@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
      Cc: Kaly Xin <Kaly.Xin@arm.com>
      Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d622ecec
  7. 08 8月, 2020 2 次提交
  8. 26 6月, 2020 1 次提交
    • B
      mm/memory_hotplug.c: fix false softlockup during pfn range removal · b7e3debd
      Ben Widawsky 提交于
      When working with very large nodes, poisoning the struct pages (for which
      there will be very many) can take a very long time.  If the system is
      using voluntary preemptions, the software watchdog will not be able to
      detect forward progress.  This patch addresses this issue by offering to
      give up time like __remove_pages() does.  This behavior was introduced in
      v5.6 with: commit d33695b1 ("mm/memory_hotplug: poison memmap in
      remove_pfn_range_from_zone()")
      
      Alternately, init_page_poison could do this cond_resched(), but it seems
      to me that the caller of init_page_poison() is what actually knows whether
      or not it should relax its own priority.
      
      Based on Dan's notes, I think this is perfectly safe: commit f931ab47
      ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
      
      Aside from fixing the lockup, it is also a friendlier thing to do on lower
      core systems that might wipe out large chunks of hotplug memory (probably
      not a very common case).
      
      Fixes this kind of splat:
      
        watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
        irq event stamp: 138450
        hardirqs last  enabled at (138449): [<ffffffffa1001f26>] trace_hardirqs_on_thunk+0x1a/0x1c
        hardirqs last disabled at (138450): [<ffffffffa1001f42>] trace_hardirqs_off_thunk+0x1a/0x1c
        softirqs last  enabled at (138448): [<ffffffffa1e00347>] __do_softirq+0x347/0x456
        softirqs last disabled at (138443): [<ffffffffa10c416d>] irq_exit+0x7d/0xb0
        CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
        Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
        RIP: 0010:memset_erms+0x9/0x10
        Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
        Call Trace:
         remove_pfn_range_from_zone+0x3a/0x380
         memunmap_pages+0x17f/0x280
         release_nodes+0x22a/0x260
         __device_release_driver+0x172/0x220
         device_driver_detach+0x3e/0xa0
         unbind_store+0x113/0x130
         kernfs_fop_write+0xdc/0x1c0
         vfs_write+0xde/0x1d0
         ksys_write+0x58/0xd0
         do_syscall_64+0x5a/0x120
         entry_SYSCALL_64_after_hwframe+0x49/0xb3
        Built 2 zonelists, mobility grouping on.  Total pages: 49050381
        Policy zone: Normal
        Built 3 zonelists, mobility grouping on.  Total pages: 49312525
        Policy zone: Normal
      
      David said: "It really only is an issue for devmem.  Ordinary
      hotplugged system memory is not affected (onlined/offlined in memory
      block granularity)."
      
      Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
      Fixes: commit d33695b1 ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reported-by: N"Scargall, Steve" <steve.scargall@intel.com>
      Reported-by: NBen Widawsky <ben.widawsky@intel.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7e3debd
  9. 05 6月, 2020 8 次提交
    • E
    • D
      mm/memory_hotplug: introduce add_memory_driver_managed() · 7b7b2721
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: Interface to add driver-managed system
      ram", v4.
      
      kexec (via kexec_load()) can currently not properly handle memory added
      via dax/kmem, and will have similar issues with virtio-mem.  kexec-tools
      will currently add all memory to the fixed-up initial firmware memmap.  In
      case of dax/kmem, this means that - in contrast to a proper reboot - how
      that persistent memory will be used can no longer be configured by the
      kexec'd kernel.  In case of virtio-mem it will be harmful, because that
      memory might contain inaccessible pieces that require coordination with
      hypervisor first.
      
      In both cases, we want to let the driver in the kexec'd kernel handle
      detecting and adding the memory, like during an ordinary reboot.
      Introduce add_memory_driver_managed().  More on the samentics are in patch
      #1.
      
      In the future, we might want to make this behavior configurable for
      dax/kmem- either by configuring it in the kernel (which would then also
      allow to configure kexec_file_load()) or in kexec-tools by also adding
      "System RAM (kmem)" memory from /proc/iomem to the fixed-up initial
      firmware memmap.
      
      More on the motivation can be found in [1] and [2].
      
      [1] https://lkml.kernel.org/r/20200429160803.109056-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20200430102908.10107-1-david@redhat.com
      
      This patch (of 3):
      
      Some device drivers rely on memory they managed to not get added to the
      initial (firmware) memmap as system RAM - so it's not used as initial
      system RAM by the kernel and the driver is under control.  While this is
      the case during cold boot and after a reboot, kexec is not aware of that
      and might add such memory to the initial (firmware) memmap of the kexec
      kernel.  We need ways to teach kernel and userspace that this system ram
      is different.
      
      For example, dax/kmem allows to decide at runtime if persistent memory is
      to be used as system ram.  Another future user is virtio-mem, which has to
      coordinate with its hypervisor to deal with inaccessible parts within
      memory resources.
      
      We want to let users in the kernel (esp. kexec) but also user space
      (esp. kexec-tools) know that this memory has different semantics and
      needs to be handled differently:
      1. Don't create entries in /sys/firmware/memmap/
      2. Name the memory resource "System RAM ($DRIVER)" (exposed via
         /proc/iomem) ($DRIVER might be "kmem", "virtio_mem").
      3. Flag the memory resource IORESOURCE_MEM_DRIVER_MANAGED
      
      /sys/firmware/memmap/ [1] represents the "raw firmware-provided memory
      map" because "on most architectures that firmware-provided memory map is
      modified afterwards by the kernel itself".  The primary user is kexec on
      x86-64.  Since commit d96ae530 ("memory-hotplug: create
      /sys/firmware/memmap entry for new memory"), we add all hotplugged memory
      to that firmware memmap - which makes perfect sense for traditional memory
      hotplug on x86-64, where real HW will also add hotplugged DIMMs to the
      firmware memmap.  We replicate what the "raw firmware-provided memory map"
      looks like after hot(un)plug.
      
      To keep things simple, let the user provide the full resource name instead
      of only the driver name - this way, we don't have to manually
      allocate/craft strings for memory resources.  Also use the resource name
      to make decisions, to avoid passing additional flags.  In case the name
      isn't "System RAM", it's special.
      
      We don't have to worry about firmware_map_remove() on the removal path.
      If there is no entry, it will simply return with -EINVAL.
      
      We'll adapt dax/kmem in a follow-up patch.
      
      [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmapSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200508084217.9160-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200508084217.9160-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b7b2721
    • D
      mm/memory_hotplug: handle memblocks only with CONFIG_ARCH_KEEP_MEMBLOCK · 52219aea
      David Hildenbrand 提交于
      The comment in add_memory_resource() is stale: hotadd_new_pgdat() will no
      longer call get_pfn_range_for_nid(), as a hotadded pgdat will simply span
      no pages at all, until memory is moved to the zone/node via
      move_pfn_range_to_zone() - e.g., when onlining memory blocks.
      
      The only archs that care about memblocks for hotplugged memory (either for
      iterating over all system RAM or testing for memory validity) are arm64,
      s390x, and powerpc - due to CONFIG_ARCH_KEEP_MEMBLOCK.  Without
      CONFIG_ARCH_KEEP_MEMBLOCK, we can simply stop messing with memblocks.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200422155353.25381-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52219aea
    • D
      mm/memory_hotplug: set node_start_pfn of hotadded pgdat to 0 · c68ab18c
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: handle memblocks only with
      CONFIG_ARCH_KEEP_MEMBLOCK", v1.
      
      A hotadded node/pgdat will span no pages at all, until memory is moved to
      the zone/node via move_pfn_range_to_zone() -> resize_pgdat_range - e.g.,
      when onlining memory blocks.  We don't have to initialize the
      node_start_pfn to the memory we are adding.
      
      This patch (of 2):
      
      Especially, there is an inconsistency:
       - Hotplugging memory to a memory-less node with cpus: node_start_pf ==  0
       - Offlining and removing last memory from a node: node_start_pfn == 0
       - Hotplugging memory to a memory-less node without cpus: node_start_pfn != 0
      
      As soon as memory is onlined, node_start_pfn is overwritten with the
      actual start.  E.g., when adding two DIMMs but only onlining one of both,
      only that DIMM (with online memory blocks) is spanned by the node.
      
      Currently, the validity of node_start_pfn really is linked to
      node_spanned_pages != 0.  With node_spanned_pages == 0 (e.g., before
      onlining memory), it has no meaning.
      
      So let's stop setting node_start_pfn, just to be overwritten via
      move_pfn_range_to_zone().  This avoids confusion when looking at the code,
      wondering which magic will be performed with the node_start_pfn in this
      function, when hotadding a pgdat.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200422155353.25381-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200422155353.25381-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c68ab18c
    • D
      mm/memory_hotplug: remove is_mem_section_removable() · 04f3465c
      David Hildenbrand 提交于
      Fortunately, all users of is_mem_section_removable() are gone.  Get rid of
      it, including some now unnecessary functions.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Link: http://lkml.kernel.org/r/20200407135416.24093-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f3465c
    • V
      mm/memory_hotplug: refrain from adding memory into an impossible node · fa6d9ec7
      Vishal Verma 提交于
      A misbehaving qemu created a situation where the ACPI SRAT table
      advertised one fewer proximity domains than intended.  The NFIT table did
      describe all the expected proximity domains.  This caused the device dax
      driver to assign an impossible target_node to the device, and when
      hotplugged as system memory, this would fail with the following signature:
      
         BUG: kernel NULL pointer dereference, address: 0000000000000088
         #PF: supervisor read access in kernel mode
         #PF: error_code(0x0000) - not-present page
         PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
         Oops: 0000 [#1] SMP PTI
         CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
         RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
         Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44 00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b 84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0 80 00 00 00 fe f0 80 a3 38 20
         RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
         RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
         RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
         RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
         R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
         R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
         FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
         Call Trace:
          kswapd+0x103/0x520
          kthread+0x120/0x140
          ret_from_fork+0x3a/0x50
      
      Add a check in the add_memory path to fail if the node to which we are
      adding memory is in the node_possible_map
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200416225438.15208-1-vishal.l.verma@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa6d9ec7
    • D
      mm/memory_hotplug: Introduce offline_and_remove_memory() · 08b3acd7
      David Hildenbrand 提交于
      virtio-mem wants to offline and remove a memory block once it unplugged
      all subblocks (e.g., using alloc_contig_range()). Let's provide
      an interface to do that from a driver. virtio-mem already supports to
      offline partially unplugged memory blocks. Offlining a fully unplugged
      memory block will not require to migrate any pages. All unplugged
      subblocks are PageOffline() and have a reference count of 0 - so
      offlining code will simply skip them.
      
      All we need is an interface to offline and remove the memory from kernel
      module context, where we don't have access to the memory block devices
      (esp. find_memory_block() and device_offline()) and the device hotplug
      lock.
      
      To keep things simple, allow to only work on a single memory block.
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      08b3acd7
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
  10. 04 6月, 2020 2 次提交
    • J
      mm/page_alloc: integrate classzone_idx and high_zoneidx · 97a225e6
      Joonsoo Kim 提交于
      classzone_idx is just different name for high_zoneidx now.  So, integrate
      them and add some comment to struct alloc_context in order to reduce
      future confusion about the meaning of this variable.
      
      The accessor, ac_classzone_idx() is also removed since it isn't needed
      after integration.
      
      In addition to integration, this patch also renames high_zoneidx to
      highest_zoneidx since it represents more precise meaning.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ye Xiaolong <xiaolong.ye@intel.com>
      Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a225e6
    • M
      mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option · 3f08a302
      Mike Rapoport 提交于
      CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
      nodes and zones structures between the systems that have region to node
      mapping in memblock and those that don't.
      
      Currently all the NUMA architectures enable this option and for the
      non-NUMA systems we can presume that all the memory belongs to node 0 and
      therefore the compile time configuration option is not required.
      
      The remaining few architectures that use DISCONTIGMEM without NUMA are
      easily updated to use memblock_add_node() instead of memblock_add() and
      thus have proper correspondence of memblock regions to NUMA nodes.
      
      Still, free_area_init_node() must have a backward compatible version
      because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
      different.  Once all the architectures will use the new semantics, the
      entire compatibility layer can be dropped.
      
      To avoid addition of extra run time memory to store node id for
      architectures that keep memblock but have only a single node, the node id
      field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
      the corresponding accessors presume that in those cases it is always 0.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f08a302
  11. 11 4月, 2020 2 次提交
    • L
      mm/memory_hotplug: add pgprot_t to mhp_params · bfeb022f
      Logan Gunthorpe 提交于
      devm_memremap_pages() is currently used by the PCI P2PDMA code to create
      struct page mappings for IO memory.  At present, these mappings are
      created with PAGE_KERNEL which implies setting the PAT bits to be WB.
      However, on x86, an mtrr register will typically override this and force
      the cache type to be UC-.  In the case firmware doesn't set this
      register it is effectively WB and will typically result in a machine
      check exception when it's accessed.
      
      Other arches are not currently likely to function correctly seeing they
      don't have any MTRR registers to fall back on.
      
      To solve this, provide a way to specify the pgprot value explicitly to
      arch_add_memory().
      
      Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
      simple change to pass the pgprot_t down to their respective functions
      which set up the page tables.  For x86_32, set the page tables
      explicitly using _set_memory_prot() (seeing they are already mapped).
      
      For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
      should be fine, for now, seeing these architectures don't support
      ZONE_DEVICE.
      
      A check in __add_pages() is also added to ensure the pgprot parameter
      was set for all arches.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfeb022f
    • L
      mm/memory_hotplug: rename mhp_restrictions to mhp_params · f5637d3b
      Logan Gunthorpe 提交于
      The mhp_restrictions struct really doesn't specify anything resembling a
      restriction anymore so rename it to be mhp_params as it is a list of
      extended parameters.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5637d3b
  12. 08 4月, 2020 8 次提交
  13. 06 3月, 2020 1 次提交
    • V
      mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled · c87cbc1f
      Vlastimil Babka 提交于
      Commit cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      fixed memory hotplug with debug_pagealloc enabled, where onlining a page
      goes through page freeing, which removes the direct mapping.  Some arches
      don't like when the page is not mapped in the first place, so
      generic_online_page() maps it first.  This is somewhat wasteful, but
      better than special casing page freeing fast paths.
      
      The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
      it's actually enabled.  One has to test debug_pagealloc_enabled() since
      031bc574 ("mm/debug-pagealloc: make debug-pagealloc boottime
      configurable"), or alternatively debug_pagealloc_enabled_static() since
      8e57f8ac ("mm, debug_pagealloc: don't rely on static keys too early"),
      but this is not done.
      
      As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
      will crash:
      
      Unable to handle kernel pointer dereference in virtual kernel address space
      Failing address: 0000000000000000 TEID: 0000000000000483
      Fault in home space mode while using kernel ASCE.
      AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
      Oops: 0004 ilc:2 [#1] SMP
      CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
      Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
      0000000000000001 0000000000000000 0000000000000002 0000000000000100
      0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
      000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
      Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
      0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
      >0000001ecd281b9e: 94fb5006 ni 6(%r5),251
      0000001ecd281ba2: 41505008 la %r5,8(%r5)
      0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
      0000001ecd281bac: 1a07 ar %r0,%r7
      0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
      Call Trace:
      [<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188
      [<0000001ecd4d9516>] online_pages_range+0xf6/0x128
      [<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8
      [<0000001ecda28aae>] online_pages+0x2fe/0x3f0
      [<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0
      [<0000001ecd7add42>] device_online+0x5a/0xc8
      [<0000001ecd7d0430>] state_store+0x88/0x118
      [<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200
      [<0000001ecd5064b6>] vfs_write+0x176/0x1e0
      [<0000001ecd50676a>] ksys_write+0xa2/0x100
      [<0000001ecda315d4>] system_call+0xd8/0x2c8
      
      Fix this by checking debug_pagealloc_enabled_static() before calling
      kernel_map_pages(). Backports for kernel before 5.5 should use
      debug_pagealloc_enabled() instead. Also add comments.
      
      Fixes: cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      Reported-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c87cbc1f
  14. 04 2月, 2020 4 次提交