1. 28 11月, 2019 4 次提交
  2. 21 11月, 2019 1 次提交
  3. 20 11月, 2019 35 次提交
    • Y
      mm: thp: handle page cache THP correctly in PageTransCompoundMap · c9f8166a
      Yang Shi 提交于
      commit 169226f7e0d275c1879551f37484ef6683579a5c upstream
      
      We have a usecase to use tmpfs as QEMU memory backend and we would like
      to take the advantage of THP as well.  But, our test shows the EPT is
      not PMD mapped even though the underlying THP are PMD mapped on host.
      The number showed by /sys/kernel/debug/kvm/largepage is much less than
      the number of PMD mapped shmem pages as the below:
      
        7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   579584 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        12
      
      And some benchmarks do worse than with anonymous THPs.
      
      By digging into the code we figured out that commit 127393fb ("mm:
      thp: kvm: fix memory corruption in KVM with THP enabled") checks if
      there is a single PTE mapping on the page for anonymous THP when setting
      up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
      since every subpage of page cache THP would get _mapcount inc'ed once it
      is PMD mapped, so PageTransCompoundMap() always returns false for page
      cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.
      
      So we need handle page cache THP correctly.  However, when page cache
      THP's PMD gets split, kernel just remove the map instead of setting up
      PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
      the subpages may get PTE mapped even though it is still a THP since the
      page cache THP may be mapped by other processes at the mean time.
      
      Checking its _mapcount and whether the THP has PTE mapped or not.
      Although this may report some false negative cases (PTE mapped by other
      processes), it looks not trivial to make this accurate.
      
      With this fix /sys/kernel/debug/kvm/largepage would show reasonable
      pages are PMD mapped by EPT as the below:
      
        7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   557056 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        271
      
      And the benchmarks are as same as anonymous THPs.
      
      [yang.shi@linux.alibaba.com: v4]
        Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Tested-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>    [4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      c9f8166a
    • Y
      ICX: perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register · c154e184
      Yunying Sun 提交于
      commit 3b238a64c3009fed36eaea1af629d9377759d87d upstream.
      
      The Intel SDM states that bit 13 of Icelake's MSR_OFFCORE_RSP_x
      register is valid, and used for counting hardware generated prefetches
      of L3 cache. Update the bitmask to allow bit 13.
      
      Before:
      $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
       Performance counter stats for 'sleep 3':
         <not supported>      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
      
      After:
      $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
       Performance counter stats for 'sleep 3':
                   9,293      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
      Signed-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: jolsa@redhat.com
      Cc: namhyung@kernel.org
      Link: https://lkml.kernel.org/r/20190724082932.12833-1-yunying.sun@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLin Wang <lin.x.wang@intel.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c154e184
    • K
      ICX: perf/x86/intel: Add more Icelake CPUIDs · e4ed6f52
      Kan Liang 提交于
      commit faaeff98666c24376cebd0b106504d05a36881d1 upstream.
      
      Add new model number for Icelake desktop and server to perf.
      
      The data source encoding for Icelake server is the same as Skylake
      server.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: qiuxu.zhuo@intel.com
      Cc: rui.zhang@intel.com
      Cc: tony.luck@intel.com
      Link: https://lkml.kernel.org/r/20190603134122.13853-2-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLin Wang <lin.x.wang@intel.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e4ed6f52
    • B
      resource/docs: Complete kernel-doc style function documentation · 1de9c7c3
      Borislav Petkov 提交于
      commit f26621e60b35369bca9228bc936dc723b3e421af upstream.
      
      Add the missing kernel-doc style function parameters documentation.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: linux-tip-commits@vger.kernel.org
      Cc: rdunlap@infradead.org
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      1de9c7c3
    • R
      resource/docs: Fix new kernel-doc warnings · 39cecf2f
      Randy Dunlap 提交于
      commit f75d651587f719a813ebbbfeee570e6570731d55 upstream.
      
      The first group of warnings is caused by a "/**" kernel-doc notation
      marker but the function comments are not in kernel-doc format.
      Also add another error return value here.
      
        ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'
      
      Add the missing function parameter documentation for the other warnings:
      
        ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
        ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      39cecf2f
    • Q
      acpi/hmat: fix an uninitialized memory_target · ffd11878
      Qian Cai 提交于
      commit ab3a9f2ccc080d27873f76869c9a780be45e581e upstream.
      
      The commit 665ac7e92757 ("acpi/hmat: Register processor domain to its
      memory") introduced an uninitialized "struct memory_target" that could
      cause an incorrect branching.
      
      drivers/acpi/hmat/hmat.c:385:6: warning: variable 'target' is used
      uninitialized whenever 'if' condition is false
      [-Wsometimes-uninitialized]
              if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/acpi/hmat/hmat.c:392:6: note: uninitialized use occurs here
              if (target && p->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
                  ^~~~~~
      drivers/acpi/hmat/hmat.c:385:2: note: remove the 'if' if its condition
      is always true
              if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/acpi/hmat/hmat.c:369:30: note: initialize the variable 'target'
      to silence this warning
              struct memory_target *target;
                                          ^
                                           = NULL
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Fixes: 665ac7e92757 ("acpi/hmat: Register processor domain to its memory")
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ffd11878
    • T
      ICX: EDAC, i10nm: Fix randconfig builds · ea396c30
      Tony Luck 提交于
      commit d6a9f7336d925364daca00557afa59a68e78b422 upstream.
      
      I10NM_EDAC depends on CONFIG_ACPI so make that dependency explicit.
      Reported-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20190205180200.26865-1-tony.luck@intel.comSigned-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ea396c30
    • A
      tools x86 uapi asm: Sync the pt_regs.h copy with the kernel sources · fae5be70
      Arnaldo Carvalho de Melo 提交于
      commit 0ceb5499a8001e5ddac2c8bd7b45eb4c643469ad upstream.
      
      To get the changes in:
      
        878068ea270e ("perf/x86: Support outputting XMM registers")
      
      That will be used in a followup patch to allow users to ask for some or
      all of those registers to be collected in certain contatexts.
      
      This silences the following perf build warning:
      
        Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/perf_regs.h' differs from latest version at 'arch/x86/include/uapi/asm/perf_regs.h'
        diff -u tools/arch/x86/include/uapi/asm/perf_regs.h arch/x86/include/uapi/asm/perf_regs.h
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/n/tip-6pjnnrzqt3x3n2cd6br3wk7k@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fae5be70
    • P
      device-dax: fix memory and resource leak if hotplug fails · d0219d42
      Pavel Tatashin 提交于
      commit 31e4ca92a7dd4cdebd7fe1456b3b0b6ace9a816f upstream
      
      Patch series ""Hotremove" persistent memory", v6.
      
      Recently, adding a persistent memory to be used like a regular RAM was
      added to Linux.  This work extends this functionality to also allow hot
      removing persistent memory.
      
      We (Microsoft) have an important use case for this functionality.
      
      The requirement is for physical machines with small amount of RAM (~8G)
      to be able to reboot in a very short period of time (<1s).  Yet, there
      is a userland state that is expensive to recreate (~2G).
      
      The solution is to boot machines with 2G preserved for persistent
      memory.
      
      Copy the state, and hotadd the persistent memory so machine still has
      all 8G available for runtime.  Before reboot, offline and hotremove
      device-dax 2G, copy the memory that is needed to be preserved to pmem0
      device, and reboot.
      
      The series of operations look like this:
      
      1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
         and free ramdisk.
      2. Convert raw pmem0 to devdax
         ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
      3. Hotadd to System RAM
         echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
         echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
         echo online_movable > /sys/devices/system/memoryXXX/state
      4. Before reboot hotremove device-dax memory from System RAM
         echo offline > /sys/devices/system/memoryXXX/state
         echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
      5. Create raw pmem0 device
         ndctl create-namespace --mode raw  -e namespace0.0 -f
      6. Copy the state that was stored by apps to ramdisk to pmem device
      7. Do kexec reboot or reboot through firmware if firmware does not
         zero memory in pmem0 region (These machines have only regular
         volatile memory). So to have pmem0 device either memmap kernel
         parameter is used, or devices nodes in dtb are specified.
      
      This patch (of 3):
      
      When add_memory() fails, the resource and the memory should be freed.
      
      Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com
      Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      d0219d42
    • V
      device-dax: Add a 'resource' attribute · 651aadb6
      Vishal Verma 提交于
      commit 40cdc60ac16a42eb4e013f84d0e7aa1d6ee060d3 upstream
      
      device-dax based devices were missing a 'resource' attribute to indicate
      the physical address range contributed by the device in question. This
      information is desirable to userspace tooling that may want to use the
      dax device as system-ram, and wants to selectively hotplug and online
      the memory blocks associated with a given device.
      
      Without this, the tooling would have to parse /proc/iomem for the memory
      ranges contributed by dax devices, which can be a workaround, but it is
      far easier to provide this information in the sysfs hierarchy.
      
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      651aadb6
    • A
      drivers/dax: Allow to include DEV_DAX_PMEM as builtin · c83b6a3a
      Aneesh Kumar K.V 提交于
      commit 67476656febd7ec5f1fe1aeec3c441fcf53b1e45 upstream
      
      This move the dependency to DEV_DAX_PMEM_COMPAT such that only
      if DEV_DAX_PMEM is built as module we can allow the compat support.
      
      This allows to test the new code easily in a emulation setup where we
      often build things without module support.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 730926c3b099 ("device-dax: Add /sys/class/dax backwards compatibility")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      c83b6a3a
    • D
      device-dax: "Hotplug" persistent memory for use like normal RAM · 370de25f
      Dave Hansen 提交于
      commit c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 upstream
      
      This is intended for use with NVDIMMs that are physically persistent
      (physically like flash) so that they can be used as a cost-effective
      RAM replacement.  Intel Optane DC persistent memory is one
      implementation of this kind of NVDIMM.
      
      Currently, a persistent memory region is "owned" by a device driver,
      either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
      allow applications to explicitly use persistent memory, generally
      by being modified to use special, new libraries. (DIMM-based
      persistent memory hardware/software is described in great detail
      here: Documentation/nvdimm/nvdimm.txt).
      
      However, this limits persistent memory use to applications which
      *have* been modified.  To make it more broadly usable, this driver
      "hotplugs" memory into the kernel, to be managed and used just like
      normal RAM would be.
      
      To make this work, management software must remove the device from
      being controlled by the "Device DAX" infrastructure:
      
      	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
      
      and then tell the new driver that it can bind to the device:
      
      	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
      
      After this, there will be a number of new memory sections visible
      in sysfs that can be onlined, or that may get onlined by existing
      udev-initiated memory hotplug rules.
      
      This rebinding procedure is currently a one-way trip.  Once memory
      is bound to "kmem", it's there permanently and can not be
      unbound and assigned back to device_dax.
      
      The kmem driver will never bind to a dax device unless the device
      is *explicitly* bound to the driver.  There are two reasons for
      this: One, since it is a one-way trip, it can not be undone if
      bound incorrectly.  Two, the kmem driver destroys data on the
      device.  Think of if you had good data on a pmem device.  It
      would be catastrophic if you compile-in "kmem", but leave out
      the "device_dax" driver.  kmem would take over the device and
      write volatile data all over your good data.
      
      This inherits any existing NUMA information for the newly-added
      memory from the persistent memory device that came from the
      firmware.  On Intel platforms, the firmware has guarantees that
      require each socket's persistent memory to be in a separate
      memory-only NUMA node.  That means that this patch is not expected
      to create NUMA nodes, but will simply hotplug memory into existing
      nodes.
      
      Because NUMA nodes are created, the existing NUMA APIs and tools
      are sufficient to create policies for applications or memory areas
      to have affinity for or an aversion to using this memory.
      
      There is currently some metadata at the beginning of pmem regions.
      The section-size memory hotplug restrictions, plus this small
      reserved area can cause the "loss" of a section or two of capacity.
      This should be fixable in follow-on patches.  But, as a first step,
      losing 256MB of memory (worst case) out of hundreds of gigabytes
      is a good tradeoff vs. the required code to fix this up precisely.
      This calculation is also the reason we export
      memory_block_size_bytes().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      370de25f
    • D
      mm/resource: Let walk_system_ram_range() search child resources · 3ed62604
      Dave Hansen 提交于
      commit 2b539aefe9e48e3908cff02699aa63a8b9bd268e upstream
      
      In the process of onlining memory, we use walk_system_ram_range()
      to find the actual RAM areas inside of the area being onlined.
      
      However, it currently only finds memory resources which are
      "top-level" iomem_resources.  Children are not currently
      searched which causes it to skip System RAM in areas like this
      (in the format of /proc/iomem):
      
      a0000000-bfffffff : Persistent Memory (legacy)
        a0000000-afffffff : System RAM
      
      Changing the true->false here allows children to be searched
      as well.  We need this because we add a new "System RAM"
      resource underneath the "persistent memory" resource when
      we use persistent memory in a volatile mode.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      3ed62604
    • D
      mm/memory-hotplug: Allow memory resources to be children · 71009d66
      Dave Hansen 提交于
      commit 2794129e902d8eb69413d884dc6404b8716ed9ed upstream
      
      The mm/resource.c code is used to manage the physical address
      space.  The current resource configuration can be viewed in
      /proc/iomem.  An example of this is at the bottom of this
      description.
      
      The nvdimm subsystem "owns" the physical address resources which
      map to persistent memory and has resources inserted for them as
      "Persistent Memory".  The best way to repurpose this for volatile
      use is to leave the existing resource in place, but add a "System
      RAM" resource underneath it. This clearly communicates the
      ownership relationship of this memory.
      
      The request_resource_conflict() API only deals with the
      top-level resources.  Replace it with __request_region() which
      will search for !IORESOURCE_BUSY areas lower in the resource
      tree than the top level.
      
      We *could* also simply truncate the existing top-level
      "Persistent Memory" resource and take over the released address
      space.  But, this means that if we ever decide to hot-unplug the
      "RAM" and give it back, we need to recreate the original setup,
      which may mean going back to the BIOS tables.
      
      This should have no real effect on the existing collision
      detection because the areas that truly conflict should be marked
      IORESOURCE_BUSY.
      
      00000000-00000fff : Reserved
      00001000-0009fbff : System RAM
      0009fc00-0009ffff : Reserved
      000a0000-000bffff : PCI Bus 0000:00
      000c0000-000c97ff : Video ROM
      000c9800-000ca5ff : Adapter ROM
      000f0000-000fffff : Reserved
        000f0000-000fffff : System ROM
      00100000-9fffffff : System RAM
        01000000-01e071d0 : Kernel code
        01e071d1-027dfdff : Kernel data
        02dc6000-0305dfff : Kernel bss
      a0000000-afffffff : Persistent Memory (legacy)
        a0000000-a7ffffff : System RAM
      b0000000-bffdffff : System RAM
      bffe0000-bfffffff : Reserved
      c0000000-febfffff : PCI Bus 0000:00
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      71009d66
    • D
      mm/resource: Move HMM pr_debug() deeper into resource code · cdb8d31f
      Dave Hansen 提交于
      commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream
      
      HMM consumes physical address space for its own use, even
      though nothing is mapped or accessible there.  It uses a
      special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
      to uniquely identify these areas.
      
      When HMM consumes address space, it makes a best guess about
      what to consume.  However, it is possible that a future memory
      or device hotplug can collide with the reserved area.  In the
      case of these conflicts, there is an error message in
      register_memory_resource().
      
      Later patches in this series move register_memory_resource()
      from using request_resource_conflict() to __request_region().
      Unfortunately, __request_region() does not return the conflict
      like the previous function did, which makes it impossible to
      check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
      resource.
      
      Instead of warning in register_memory_resource(), move the
      check into the core resource code itself (__request_region())
      where the conflicting resource _is_ available.  This has the
      added bonus of producing a warning in case of HMM conflicts
      with devices *or* RAM address space, as opposed to the RAM-
      only warnings that were there previously.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      cdb8d31f
    • D
      mm/resource: Return real error codes from walk failures · 2c33206f
      Dave Hansen 提交于
      commit 5cd401ace914dc68556c6d2fcae0c349444d5f86 upstream
      
      walk_system_ram_range() can return an error code either becuase
      *it* failed, or because the 'func' that it calls returned an
      error.  The memory hotplug does the following:
      
      	ret = walk_system_ram_range(..., func);
              if (ret)
      		return ret;
      
      and 'ret' makes it out to userspace, eventually.  The problem
      s, walk_system_ram_range() failues that result from *it* failing
      (as opposed to 'func') return -1.  That leads to a very odd
      -EPERM (-1) return code out to userspace.
      
      Make walk_system_ram_range() return -EINVAL for internal
      failures to keep userspace less confused.
      
      This return code is compatible with all the callers that I
      audited.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      2c33206f
    • O
      kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable · d0bc6e68
      Oscar Salvador 提交于
      commit 65c78784135f847e49eb98e6b976e453e71100c3 upstream
      
      This is a preparation for the next patch.
      
      Currently, we only call release_mem_region_adjustable() in __remove_pages
      if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
      are being released by themselves with devm_release_mem_region.
      
      Since we do not want to touch any zone/page stuff during the removing of
      the memory (but during the offlining), we do not want to check for the
      zone here.  So we need another way to tell release_mem_region_adjustable()
      to not realease the resource in case it belongs to HMM/devm.
      
      HMM/devm acquires/releases a resource through
      devm_request_mem_region/devm_release_mem_region.
      
      These resources have the flag IORESOURCE_MEM, while resources acquired by
      hot-add memory path (register_memory_resource()) contain
      IORESOURCE_SYSTEM_RAM.
      
      So, we can check for this flag in release_mem_region_adjustable, and if
      the resource does not contain such flag, we know that we are dealing with
      a HMM/devm resource, so we can back off.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      d0bc6e68
    • B
      resource: Clean it up a bit · 02cc5207
      Borislav Petkov 提交于
      commit b69c2e20f6e4046da84ce5b33ba1ef89cb087b40 upstream
      
      - Drop BUG_ON()s and do normal error handling instead, in
        find_next_iomem_res().
      
      - Align function arguments on opening braces.
      
      - Get rid of local var sibling_only in find_next_iomem_res().
      
      - Shorten unnecessarily long first_level_children_only arg name.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Bjorn Helgaas <bhelgaas@google.com>
      CC: Brijesh Singh <brijesh.singh@amd.com>
      CC: Dan Williams <dan.j.williams@intel.com>
      CC: H. Peter Anvin <hpa@zytor.com>
      CC: Lianbo Jiang <lijiang@redhat.com>
      CC: Takashi Iwai <tiwai@suse.de>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Tom Lendacky <thomas.lendacky@amd.com>
      CC: Vivek Goyal <vgoyal@redhat.com>
      CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      CC: bhe@redhat.com
      CC: dan.j.williams@intel.com
      CC: dyoung@redhat.com
      CC: kexec@lists.infradead.org
      CC: mingo@redhat.com
      Link: <new submission>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      02cc5207
    • V
      device-dax: Add a 'modalias' attribute to DAX 'bus' devices · 3405f4ea
      Vishal Verma 提交于
      commit c347bd71dcdb2d0ac8b3a771486584dca8c8dd80 upstream
      
      Add a 'modalias' attribute to devices under the DAX bus so that userspace
      is able to dynamically load modules as needed.
      
      Normally, udev can get the modalias from 'uevent', and that is correctly
      set up by the DAX bus. However other tooling such as 'libndctl' for
      interacting with drivers/nvdimm/, and 'libdaxctl' for drivers/dax/ can
      also use the modalias to dynamically load modules via libkmod lookups.
      
      The 'nd' bus set up by the libnvdimm subsystem exports a modalias
      attribute. Imitate this to export the same for the 'dax' bus.
      
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      3405f4ea
    • D
      device-dax: Add a 'target_node' attribute · 539796f4
      Dan Williams 提交于
      commit 21c75763a3ae18679e5c4e2260aa9379b073566b upstream
      
      The target-node attribute is the Linux numa-node that a device-dax
      instance may create when it is online. Prior to being online the
      device's 'numa_node' property reflects the closest online cpu node which
      is the typical expectation of a device 'numa_node'. Once it is online it
      becomes its own distinct numa node, i.e. 'target_node'.
      
      Export the 'target_node' property to give userspace tooling the ability
      to predict the effective numa-node from a device-dax instance configured
      to provide 'System RAM' capacity.
      
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      539796f4
    • D
      device-dax: Auto-bind device after successful new_id · e663208d
      Dan Williams 提交于
      commit 664525b2d84abca1074c9546654ae9689de8a818 upstream
      
      The typical 'new_id' attribute behavior is to immediately attach a
      device to its driver after a new device-id is added. Implement this
      behavior for the dax bus.
      Reported-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reported-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      e663208d
    • D
      acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · 159d387d
      Dan Williams 提交于
      commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf upstream
      
      Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
      Interface Table), is the first known instance of a memory range
      described by a unique "target" proximity domain. Where "initiator" and
      "target" proximity domains is an approach that the ACPI HMAT
      (Heterogeneous Memory Attributes Table) uses to described the unique
      performance properties of a memory range relative to a given initiator
      (e.g. CPU or DMA device).
      
      Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
      char-device follows the traditional notion of 'numa-node' where the
      attribute conveys the closest online numa-node. That numa-node attribute
      is useful for cpu-binding and memory-binding processes *near* the
      device. However, when the memory range backing a 'pmem', or 'dax' device
      is onlined (memory hot-add) the memory-only-numa-node representing that
      address needs to be differentiated from the set of online nodes. In
      other words, the numa-node association of the device depends on whether
      you can bind processes *near* the cpu-numa-node in the offline
      device-case, or bind process *on* the memory-range directly after the
      backing address range is onlined.
      
      Allow for the case that platform firmware describes persistent memory
      with a unique proximity domain, i.e. when it is distinct from the
      proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
      numa-node translation of that proximity through the libnvdimm region
      device to namespaces that are in device-dax mode. With this in place the
      proposed kmem driver [1] can optionally discover a unique numa-node
      number for the address range as it transitions the memory from an
      offline state managed by a device-driver to an online memory range
      managed by the core-mm.
      
      [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.comReported-by: NFan Du <fan.du@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      [yshi: Removed PowerPC stuff which is not applicable 4.19]
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      159d387d
    • D
      device-dax: Add /sys/class/dax backwards compatibility · 796f6939
      Dan Williams 提交于
      commit 730926c3b0998943654019f00296cf8e3b02277e upstream
      
      On the expectation that some environments may not upgrade libdaxctl
      (userspace component that depends on the /sys/class/dax hierarchy),
      provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat
      driver implements the original /sys/class/dax sysfs layout rather than
      /sys/bus/dax. When userspace is upgraded it can blacklist this module
      and switch to the dax_pmem driver going forward.
      
      CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according
      to the dax_pmem entry in Documentation/ABI/obsolete/.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      796f6939
    • D
      device-dax: Add support for a dax override driver · e1da20e7
      Dan Williams 提交于
      commit d200781ef237a354d918ceff5cee350d88a93d42 upstream
      
      Introduce the 'new_id' concept for enabling a custom device-driver attach
      policy for dax-bus drivers. The intended use is to have a mechanism for
      hot-plugging device-dax ranges into the page allocator on-demand. With
      this in place the default policy of using device-dax for performance
      differentiated memory can be overridden by user-space policy that can
      arrange for the memory range to be managed as 'System RAM' with
      user-defined NUMA and other performance attributes.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      e1da20e7
    • D
      device-dax: Move resource pinning+mapping into the common driver · ced768b9
      Dan Williams 提交于
      commit 89ec9f2cfa36cc5fca2fb445ed221bb9add7b536 upstream
      
      Move the responsibility of calling devm_request_resource() and
      devm_memremap_pages() into the common device-dax driver. This is another
      preparatory step to allowing an alternate personality driver for a
      device-dax range.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      ced768b9
    • D
      device-dax: Introduce bus + driver model · 1ea15e85
      Dan Williams 提交于
      commit 9567da0b408a2553d32ca83cba4f1fc5a8aad459 upstream
      
      In support of multiple device-dax instances per device-dax-region and
      allowing the 'kmem' driver to attach to dax-instances instead of the
      current device-node access, convert the dax sub-system from a class to a
      bus. Recall that the kmem driver takes reserved / special purpose
      memories and assigns them to be managed by the core-mm.
      
      Aside from the fact the device-dax instances are registered and probed
      on a bus, two other lifetime-management changes are made:
      
      1/ Delay attaching a cdev until driver probe time
      
      2/ A new run_dax() helper is introduced to allow restoring dax-operation
         after a kill_dax() event. So, at driver ->probe() time we run_dax()
         and at ->remove() time we kill_dax() and invalidate all mappings.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      1ea15e85
    • D
      device-dax: Start defining a dax bus model · d19c14e6
      Dan Williams 提交于
      commit 51cf784c42d07fbd62cb604836a9270cf3361509 upstream
      
      Towards eliminating the dax_class, move the dax-device-attribute
      enabling to a new bus.c file in the core. The amount of code
      thrash of sub-sequent patches is reduced as no logic changes are made,
      just pure code movement.
      
      A temporary export of unregister_dex_dax() and dax_attribute_groups is
      needed to preserve compilation, but those symbols become static again in
      a follow-on patch.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      d19c14e6
    • D
      device-dax: Remove multi-resource infrastructure · c1b93b62
      Dan Williams 提交于
      commit 753a0850e707e9a8c5861356222f9b9e4eba7945 upstream
      
      The multi-resource implementation anticipated discontiguous sub-division
      support. That has not yet materialized, delete the infrastructure and
      related code.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      c1b93b62
    • D
      device-dax: Kill dax_region base · aff72d83
      Dan Williams 提交于
      commit 93694f9630b0ed29cda61df58e480dcb34ef52fd upstream
      
      Nothing consumes this attribute of a region and devres otherwise
      remembers the value for de-allocation purposes.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      aff72d83
    • D
      device-dax: Kill dax_region ida · e7d85980
      Dan Williams 提交于
      commit 21b9e979501fdb5f6797193d70428a2b00bd5247 upstream
      
      Commit bbb3be17 "device-dax: fix sysfs duplicate warnings" arranged
      for passing a dax instance-id to devm_create_dax_dev(), rather than
      generating one internally. Remove the dax_region ida and related code.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      e7d85980
    • S
      ICX: tools/power/x86: A tool to validate Intel Speed Select commands · 363d6c2f
      Srinivas Pandruvada 提交于
      commit 3fb4f7cd472c7f5905c91508e988f6b28372210d upstream.
      
      The Intel(R) Speed select technologies contains four features.
      
      Performance profile:An non architectural mechanism that allows multiple
      optimized performance profiles per system via static and/or dynamic
      adjustment of core count, workload, Tjmax, and TDP, etc. aka ISS
      in the documentation.
      
      Base Frequency: Enables users to increase guaranteed base frequency on
      certain cores (high priority cores) in exchange for lower base frequency
      on remaining cores (low priority cores). aka PBF in the documenation.
      
      Turbo frequency: Enables the ability to set different turbo ratio limits
      to cores based on priority. aka FACT in the documentation.
      
      Core power: An Interface that allows user to define per core/tile
      priority.
      
      There is a multi level help for commands and options. This can be used
      to check required arguments for each feature and commands for the
      feature.
      
      To start navigating the features start with
      
      $sudo intel-speed-select --help
      
      For help on a specific feature for example
      $sudo intel-speed-select perf-profile --help
      
      To get help for a command for a feature for example
      $sudo intel-speed-select perf-profile get-lock-status --help
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Acked-by: NLen Brown <len.brown@intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      363d6c2f
    • S
      ICX: platform/x86: ISST: Restore state on resume · 6fb83d87
      Srinivas Pandruvada 提交于
      commit f607874f35cbd276a837d7147d4e1ec752dfef44 upstream.
      
      Commands which causes PUNIT writes, store them and restore them on system
      resume. The driver stores all such requests in a hash table and stores the
      the latest mailbox request parameters. On resume these commands mail box
      commands are executed again. There are only 5 such mail box commands which
      will trigger such processing so a very low overhead in store and execute
      on resume. Also there is no order requirement for mail box commands for
      these write/set commands. There is one MSR request for changing turbo
      ratio limits, this also stored and get restored on resume and cpu online.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      6fb83d87
    • S
      ICX: platform/x86: ISST: Add Intel Speed Select PUNIT MSR interface · 485b58d0
      Srinivas Pandruvada 提交于
      commit e765f37b9b8b4fa65682e9a78a2ca2b11d3d9096 upstream.
      
      While using new non arhitectural features using PUNIT Mailbox and MMIO
      read/write interface, still there is need to operate using MSRs to
      control PUNIT. User space could have used user user-space MSR interface for
      this, but when user space MSR access is disabled, then it can't. Here only
      limited number of MSRs are allowed using this new interface.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      485b58d0
    • S
      ICX: platform/x86: ISST: Add Intel Speed Select mailbox interface via MSRs · 6b5bb234
      Srinivas Pandruvada 提交于
      commit 71b21bd7f68a6ee59003f63d2e4f84fd9b0a8d07 upstream.
      
      Add an IOCTL to send mailbox commands to PUNIT using PUNIT MSRs for
      mailbox. Some CPU models don't have PCI device, so need to use MSRs.
      A limited set of mailbox commands can be sent to PUNIT.
      
      This MMIO interface is used by the intel-speed-select tool under
      tools/x86/power to enumerate and control Intel Speed Select features.
      The MBOX commands ids and semantics of the message can be checked from
      the source code of the tool.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      6b5bb234
    • S
      ICX: platform/x86: ISST: Add Intel Speed Select mailbox interface via PCI · c195f5b3
      Srinivas Pandruvada 提交于
      commit 31a166fe9c269af17977e650846ee4ea50361c07 upstream.
      
      Add an IOCTL to send mailbox commands to PUNIT using PUNIT PCI device.
      A limited set of mailbox commands can be sent to PUNIT.
      
      This MMIO interface is used by the intel-speed-select tool under
      tools/x86/power to enumerate and control Intel Speed Select features.
      The MBOX commands ids and semantics of the message can be checked from
      the source code of the tool.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c195f5b3