1. 15 1月, 2020 40 次提交
    • X
      alinux: block: add counter to track io request's d2c time · ba2896ac
      Xiaoguang Wang 提交于
      Indeed tool iostat's await is not good enough, which is somewhat sketchy
      and could not show request's latency on device driver's side.
      
      Here we add a new counter to track io request's d2c time, also with this
      patch, we can extend iostat to show this value easily.
      
      Note:
      I had checked how iostat is implemented, it just reads fields it needs,
      so iostat won't be affected by this change, so does tsar.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ba2896ac
    • M
      alinux: fuse: add sysfs api to flush processing queue requests · fc0a9b55
      Ma Jie Yue 提交于
      The failover of fuse userspace daemon will reuse the existing fuse conn,
      without unmounting it, during daemon crashing and recovery procedure.
      But some requests might be in process in the daemon before sending out reply,
      when the crash happens. This will stuck the application since it will
      never get the reply after the failover.
      
      We add the sysfs api to flush these requests, after the daemon crash, before
      recovery. It is easy to reproduce the issue in the fuse userspace daemon,
      just exit after receiving the request and before sending the reply back.
      The application will hang up in some read/write operation, before
      echo 1 > /sys/fs/fuse/connection/xxx/flush. The flush operation will make
      the io fail and return the error to the application.
      Signed-off-by: NMa Jie Yue <majieyue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fc0a9b55
    • X
      alinux: jbd2: add proc entry to control whether doing buffer copy-out · 1ced8a5c
      Xiaoguang Wang 提交于
      When jbd2 tries to get write access to one buffer, and if this buffer
      is under writeback with BH_Shadow flag, jbd2 will wait until this buffer
      has been written to disk, but sometimes the time taken to wait may be
      much long, especially disk capacity is almost full.
      
      Here add a proc entry "force-copy", if its value is not zero, jbd2 will
      always do meta buffer copy-cout, then we can eliminate the unnecessary
      wating time here, and reduce long tail latency for buffered-write.
      
      I construct such test case below:
      
      $cat offline.fio
      ; fio-rand-RW.job for fiotest
      
      [global]
      name=fio-rand-RW
      filename=fio-rand-RW
      rw=randrw
      rwmixread=60
      rwmixwrite=40
      bs=4K
      direct=0
      numjobs=4
      time_based=1
      runtime=900
      
      [file1]
      size=60G
      ioengine=sync
      iodepth=16
      
      $cat online.fio
      ; fio-seq-write.job for fiotest
      
      [global]
      name=fio-seq-write
      filename=fio-seq-write
      rw=write
      bs=256K
      direct=0
      numjobs=1
      time_based=1
      runtime=60
      
      [file1]
      rate=50m
      size=10G
      ioengine=sync
      iodepth=16
      
      With this patch:
      $cat /proc/fs/jbd2/sda5-8/force_copy
      0
      
      online fio almost always get such long tail latency:
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22
           lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22
          clat percentiles (usec):
           |  1.00th=[    141],  5.00th=[    143], 10.00th=[    145],
           | 20.00th=[    147], 30.00th=[    147], 40.00th=[    149],
           | 50.00th=[    149], 60.00th=[    151], 70.00th=[    153],
           | 80.00th=[    155], 90.00th=[    159], 95.00th=[    163],
           | 99.00th=[    255], 99.50th=[    273], 99.90th=[    429],
           | 99.95th=[    441], 99.99th=[3640656]
      
      $cat /proc/fs/jbd2/sda5-8/force_copy
      1
      
      online fio latency is much better.
      
      Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta
      00m:00s]
      file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018
        write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec)
          clat (usec): min=137, max=545, avg=151.35, stdev=16.22
           lat (usec): min=140, max=548, avg=155.31, stdev=16.65
          clat percentiles (usec):
           |  1.00th=[  143],  5.00th=[  145], 10.00th=[  145], 20.00th=[
      147],
           | 30.00th=[  147], 40.00th=[  147], 50.00th=[  149], 60.00th=[
      149],
           | 70.00th=[  151], 80.00th=[  155], 90.00th=[  157], 95.00th=[
      161],
           | 99.00th=[  239], 99.50th=[  269], 99.90th=[  420], 99.95th=[
      429],
           | 99.99th=[  537]
      
      As to the cost: because we'll always need to copy meta buffer, will
      consume minor cpu time and some memory(at most 32MB for 128MB journal
      size).
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      1ced8a5c
    • X
      alinux: ext4: don't submit unwritten extent while holding active jbd2 handle · c7c8cb0e
      Xiaoguang Wang 提交于
      In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
      will try to find 2048 pages to map and normally one bio can contain 256
      pages at most. If we really found 2048 pages to map, there will be 4 bios
      and 4 ext4_io_submit() calls which are called both in ext4_writepages()
      and mpage_map_and_submit_extent().
      
      But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
      when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
      will wait this handle to finish, so wait the unwritten extent is written to
      disk, this will introduce unnecessary stall time, especially longer when
      the writeback operation is io throttled, need to fix this issue.
      
      Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and
      only submit these bios after dropping the jbd2 handle.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c7c8cb0e
    • Z
      alinux: fs,ext4: remove projid limit when create hard link · 08e6d768
      zhangliguang 提交于
      This is a temporary workaround plan to avoid the limitation when
      creating hard link cross two projids.
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      08e6d768
    • X
      alinux: jbd2: add new "stats" proc file · 7e2e7b9a
      Xiaoguang Wang 提交于
      /proc/fs/jbd2/${device}/info only shows whole average statistical
      info about jbd2's life cycle, but it can not show jbd2 info in
      specified time interval and sometimes this capability is very useful
      for trouble shooting. For example, we can not see how rs_locked and
      rs_flushing grows in specified time interval, but these two indexes
      can explain some reasons for app's behaviours.
      
      Here we add a new "stats" proc file like /proc/diskstats, then we can
      implement a simple tool jbd2_stats which'll display detailed jbd2 info
      in specified time interval. Like below(time interval 5s):
      
      [lege@localhost ~]$ cat /proc/fs/jbd2/vdb1-8/stats
      51 30 8192 0 1 241616 0 0 22 0 47158 891 942 1000 1000
      
      [lege@localhost ~]$ gcc -o jbd2_stat jbd2_stat.c ; ./jbd2_stat
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             1861       158       359     13.00      0.00
      2.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             1974       113       389     26.00      0.00
      5.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             2188       214       308     10.00      0.00
      7.00
      
      Device              tid     trans   handles    locked  flushing
      logging
      vdb1-8             2344       156       332     19.00      0.00
      4.00
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7e2e7b9a
    • J
      alinux: jbd2: create jbd2-ckpt thread for journal checkpoint · 3999cdd9
      Joseph Qi 提交于
      This is trying to do jbd2 checkpoint in a specific kernel thread, then
      checkpoint won't be under io throttle control.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
      Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3999cdd9
    • Y
      ICX: perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register · 582d52a7
      Yunying Sun 提交于
      commit 3b238a64c3009fed36eaea1af629d9377759d87d upstream.
      
      The Intel SDM states that bit 13 of Icelake's MSR_OFFCORE_RSP_x
      register is valid, and used for counting hardware generated prefetches
      of L3 cache. Update the bitmask to allow bit 13.
      
      Before:
      $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
       Performance counter stats for 'sleep 3':
         <not supported>      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
      
      After:
      $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
       Performance counter stats for 'sleep 3':
                   9,293      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
      Signed-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: jolsa@redhat.com
      Cc: namhyung@kernel.org
      Link: https://lkml.kernel.org/r/20190724082932.12833-1-yunying.sun@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLin Wang <lin.x.wang@intel.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      582d52a7
    • K
      ICX: perf/x86/intel: Add more Icelake CPUIDs · 31c70c6a
      Kan Liang 提交于
      commit faaeff98666c24376cebd0b106504d05a36881d1 upstream.
      
      Add new model number for Icelake desktop and server to perf.
      
      The data source encoding for Icelake server is the same as Skylake
      server.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: qiuxu.zhuo@intel.com
      Cc: rui.zhang@intel.com
      Cc: tony.luck@intel.com
      Link: https://lkml.kernel.org/r/20190603134122.13853-2-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLin Wang <lin.x.wang@intel.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      31c70c6a
    • B
      resource/docs: Complete kernel-doc style function documentation · f1f8f4a4
      Borislav Petkov 提交于
      commit f26621e60b35369bca9228bc936dc723b3e421af upstream.
      
      Add the missing kernel-doc style function parameters documentation.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: linux-tip-commits@vger.kernel.org
      Cc: rdunlap@infradead.org
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f1f8f4a4
    • R
      resource/docs: Fix new kernel-doc warnings · 275561cd
      Randy Dunlap 提交于
      commit f75d651587f719a813ebbbfeee570e6570731d55 upstream.
      
      The first group of warnings is caused by a "/**" kernel-doc notation
      marker but the function comments are not in kernel-doc format.
      Also add another error return value here.
      
        ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'
      
      Add the missing function parameter documentation for the other warnings:
      
        ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
        ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      275561cd
    • Q
      acpi/hmat: fix an uninitialized memory_target · 480342a1
      Qian Cai 提交于
      commit ab3a9f2ccc080d27873f76869c9a780be45e581e upstream.
      
      The commit 665ac7e92757 ("acpi/hmat: Register processor domain to its
      memory") introduced an uninitialized "struct memory_target" that could
      cause an incorrect branching.
      
      drivers/acpi/hmat/hmat.c:385:6: warning: variable 'target' is used
      uninitialized whenever 'if' condition is false
      [-Wsometimes-uninitialized]
              if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/acpi/hmat/hmat.c:392:6: note: uninitialized use occurs here
              if (target && p->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
                  ^~~~~~
      drivers/acpi/hmat/hmat.c:385:2: note: remove the 'if' if its condition
      is always true
              if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/acpi/hmat/hmat.c:369:30: note: initialize the variable 'target'
      to silence this warning
              struct memory_target *target;
                                          ^
                                           = NULL
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Fixes: 665ac7e92757 ("acpi/hmat: Register processor domain to its memory")
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      480342a1
    • T
      ICX: EDAC, i10nm: Fix randconfig builds · 30725711
      Tony Luck 提交于
      commit d6a9f7336d925364daca00557afa59a68e78b422 upstream.
      
      I10NM_EDAC depends on CONFIG_ACPI so make that dependency explicit.
      Reported-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20190205180200.26865-1-tony.luck@intel.comSigned-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      30725711
    • A
      tools x86 uapi asm: Sync the pt_regs.h copy with the kernel sources · 3f7e9bbe
      Arnaldo Carvalho de Melo 提交于
      commit 0ceb5499a8001e5ddac2c8bd7b45eb4c643469ad upstream.
      
      To get the changes in:
      
        878068ea270e ("perf/x86: Support outputting XMM registers")
      
      That will be used in a followup patch to allow users to ask for some or
      all of those registers to be collected in certain contatexts.
      
      This silences the following perf build warning:
      
        Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/perf_regs.h' differs from latest version at 'arch/x86/include/uapi/asm/perf_regs.h'
        diff -u tools/arch/x86/include/uapi/asm/perf_regs.h arch/x86/include/uapi/asm/perf_regs.h
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/n/tip-6pjnnrzqt3x3n2cd6br3wk7k@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3f7e9bbe
    • P
      device-dax: fix memory and resource leak if hotplug fails · 4447e303
      Pavel Tatashin 提交于
      commit 31e4ca92a7dd4cdebd7fe1456b3b0b6ace9a816f upstream
      
      Patch series ""Hotremove" persistent memory", v6.
      
      Recently, adding a persistent memory to be used like a regular RAM was
      added to Linux.  This work extends this functionality to also allow hot
      removing persistent memory.
      
      We (Microsoft) have an important use case for this functionality.
      
      The requirement is for physical machines with small amount of RAM (~8G)
      to be able to reboot in a very short period of time (<1s).  Yet, there
      is a userland state that is expensive to recreate (~2G).
      
      The solution is to boot machines with 2G preserved for persistent
      memory.
      
      Copy the state, and hotadd the persistent memory so machine still has
      all 8G available for runtime.  Before reboot, offline and hotremove
      device-dax 2G, copy the memory that is needed to be preserved to pmem0
      device, and reboot.
      
      The series of operations look like this:
      
      1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
         and free ramdisk.
      2. Convert raw pmem0 to devdax
         ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
      3. Hotadd to System RAM
         echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
         echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
         echo online_movable > /sys/devices/system/memoryXXX/state
      4. Before reboot hotremove device-dax memory from System RAM
         echo offline > /sys/devices/system/memoryXXX/state
         echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
      5. Create raw pmem0 device
         ndctl create-namespace --mode raw  -e namespace0.0 -f
      6. Copy the state that was stored by apps to ramdisk to pmem device
      7. Do kexec reboot or reboot through firmware if firmware does not
         zero memory in pmem0 region (These machines have only regular
         volatile memory). So to have pmem0 device either memmap kernel
         parameter is used, or devices nodes in dtb are specified.
      
      This patch (of 3):
      
      When add_memory() fails, the resource and the memory should be freed.
      
      Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com
      Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      4447e303
    • V
      device-dax: Add a 'resource' attribute · 16980fca
      Vishal Verma 提交于
      commit 40cdc60ac16a42eb4e013f84d0e7aa1d6ee060d3 upstream
      
      device-dax based devices were missing a 'resource' attribute to indicate
      the physical address range contributed by the device in question. This
      information is desirable to userspace tooling that may want to use the
      dax device as system-ram, and wants to selectively hotplug and online
      the memory blocks associated with a given device.
      
      Without this, the tooling would have to parse /proc/iomem for the memory
      ranges contributed by dax devices, which can be a workaround, but it is
      far easier to provide this information in the sysfs hierarchy.
      
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      16980fca
    • A
      drivers/dax: Allow to include DEV_DAX_PMEM as builtin · 6686a859
      Aneesh Kumar K.V 提交于
      commit 67476656febd7ec5f1fe1aeec3c441fcf53b1e45 upstream
      
      This move the dependency to DEV_DAX_PMEM_COMPAT such that only
      if DEV_DAX_PMEM is built as module we can allow the compat support.
      
      This allows to test the new code easily in a emulation setup where we
      often build things without module support.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 730926c3b099 ("device-dax: Add /sys/class/dax backwards compatibility")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      6686a859
    • D
      device-dax: "Hotplug" persistent memory for use like normal RAM · 55a0741b
      Dave Hansen 提交于
      commit c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 upstream
      
      This is intended for use with NVDIMMs that are physically persistent
      (physically like flash) so that they can be used as a cost-effective
      RAM replacement.  Intel Optane DC persistent memory is one
      implementation of this kind of NVDIMM.
      
      Currently, a persistent memory region is "owned" by a device driver,
      either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
      allow applications to explicitly use persistent memory, generally
      by being modified to use special, new libraries. (DIMM-based
      persistent memory hardware/software is described in great detail
      here: Documentation/nvdimm/nvdimm.txt).
      
      However, this limits persistent memory use to applications which
      *have* been modified.  To make it more broadly usable, this driver
      "hotplugs" memory into the kernel, to be managed and used just like
      normal RAM would be.
      
      To make this work, management software must remove the device from
      being controlled by the "Device DAX" infrastructure:
      
      	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
      
      and then tell the new driver that it can bind to the device:
      
      	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
      
      After this, there will be a number of new memory sections visible
      in sysfs that can be onlined, or that may get onlined by existing
      udev-initiated memory hotplug rules.
      
      This rebinding procedure is currently a one-way trip.  Once memory
      is bound to "kmem", it's there permanently and can not be
      unbound and assigned back to device_dax.
      
      The kmem driver will never bind to a dax device unless the device
      is *explicitly* bound to the driver.  There are two reasons for
      this: One, since it is a one-way trip, it can not be undone if
      bound incorrectly.  Two, the kmem driver destroys data on the
      device.  Think of if you had good data on a pmem device.  It
      would be catastrophic if you compile-in "kmem", but leave out
      the "device_dax" driver.  kmem would take over the device and
      write volatile data all over your good data.
      
      This inherits any existing NUMA information for the newly-added
      memory from the persistent memory device that came from the
      firmware.  On Intel platforms, the firmware has guarantees that
      require each socket's persistent memory to be in a separate
      memory-only NUMA node.  That means that this patch is not expected
      to create NUMA nodes, but will simply hotplug memory into existing
      nodes.
      
      Because NUMA nodes are created, the existing NUMA APIs and tools
      are sufficient to create policies for applications or memory areas
      to have affinity for or an aversion to using this memory.
      
      There is currently some metadata at the beginning of pmem regions.
      The section-size memory hotplug restrictions, plus this small
      reserved area can cause the "loss" of a section or two of capacity.
      This should be fixable in follow-on patches.  But, as a first step,
      losing 256MB of memory (worst case) out of hundreds of gigabytes
      is a good tradeoff vs. the required code to fix this up precisely.
      This calculation is also the reason we export
      memory_block_size_bytes().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      55a0741b
    • D
      mm/resource: Let walk_system_ram_range() search child resources · a9b17a5e
      Dave Hansen 提交于
      commit 2b539aefe9e48e3908cff02699aa63a8b9bd268e upstream
      
      In the process of onlining memory, we use walk_system_ram_range()
      to find the actual RAM areas inside of the area being onlined.
      
      However, it currently only finds memory resources which are
      "top-level" iomem_resources.  Children are not currently
      searched which causes it to skip System RAM in areas like this
      (in the format of /proc/iomem):
      
      a0000000-bfffffff : Persistent Memory (legacy)
        a0000000-afffffff : System RAM
      
      Changing the true->false here allows children to be searched
      as well.  We need this because we add a new "System RAM"
      resource underneath the "persistent memory" resource when
      we use persistent memory in a volatile mode.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      a9b17a5e
    • D
      mm/memory-hotplug: Allow memory resources to be children · 567bed57
      Dave Hansen 提交于
      commit 2794129e902d8eb69413d884dc6404b8716ed9ed upstream
      
      The mm/resource.c code is used to manage the physical address
      space.  The current resource configuration can be viewed in
      /proc/iomem.  An example of this is at the bottom of this
      description.
      
      The nvdimm subsystem "owns" the physical address resources which
      map to persistent memory and has resources inserted for them as
      "Persistent Memory".  The best way to repurpose this for volatile
      use is to leave the existing resource in place, but add a "System
      RAM" resource underneath it. This clearly communicates the
      ownership relationship of this memory.
      
      The request_resource_conflict() API only deals with the
      top-level resources.  Replace it with __request_region() which
      will search for !IORESOURCE_BUSY areas lower in the resource
      tree than the top level.
      
      We *could* also simply truncate the existing top-level
      "Persistent Memory" resource and take over the released address
      space.  But, this means that if we ever decide to hot-unplug the
      "RAM" and give it back, we need to recreate the original setup,
      which may mean going back to the BIOS tables.
      
      This should have no real effect on the existing collision
      detection because the areas that truly conflict should be marked
      IORESOURCE_BUSY.
      
      00000000-00000fff : Reserved
      00001000-0009fbff : System RAM
      0009fc00-0009ffff : Reserved
      000a0000-000bffff : PCI Bus 0000:00
      000c0000-000c97ff : Video ROM
      000c9800-000ca5ff : Adapter ROM
      000f0000-000fffff : Reserved
        000f0000-000fffff : System ROM
      00100000-9fffffff : System RAM
        01000000-01e071d0 : Kernel code
        01e071d1-027dfdff : Kernel data
        02dc6000-0305dfff : Kernel bss
      a0000000-afffffff : Persistent Memory (legacy)
        a0000000-a7ffffff : System RAM
      b0000000-bffdffff : System RAM
      bffe0000-bfffffff : Reserved
      c0000000-febfffff : PCI Bus 0000:00
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      567bed57
    • D
      mm/resource: Move HMM pr_debug() deeper into resource code · 0297fb96
      Dave Hansen 提交于
      commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream
      
      HMM consumes physical address space for its own use, even
      though nothing is mapped or accessible there.  It uses a
      special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
      to uniquely identify these areas.
      
      When HMM consumes address space, it makes a best guess about
      what to consume.  However, it is possible that a future memory
      or device hotplug can collide with the reserved area.  In the
      case of these conflicts, there is an error message in
      register_memory_resource().
      
      Later patches in this series move register_memory_resource()
      from using request_resource_conflict() to __request_region().
      Unfortunately, __request_region() does not return the conflict
      like the previous function did, which makes it impossible to
      check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
      resource.
      
      Instead of warning in register_memory_resource(), move the
      check into the core resource code itself (__request_region())
      where the conflicting resource _is_ available.  This has the
      added bonus of producing a warning in case of HMM conflicts
      with devices *or* RAM address space, as opposed to the RAM-
      only warnings that were there previously.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      0297fb96
    • D
      mm/resource: Return real error codes from walk failures · 88e75600
      Dave Hansen 提交于
      commit 5cd401ace914dc68556c6d2fcae0c349444d5f86 upstream
      
      walk_system_ram_range() can return an error code either becuase
      *it* failed, or because the 'func' that it calls returned an
      error.  The memory hotplug does the following:
      
      	ret = walk_system_ram_range(..., func);
              if (ret)
      		return ret;
      
      and 'ret' makes it out to userspace, eventually.  The problem
      s, walk_system_ram_range() failues that result from *it* failing
      (as opposed to 'func') return -1.  That leads to a very odd
      -EPERM (-1) return code out to userspace.
      
      Make walk_system_ram_range() return -EINVAL for internal
      failures to keep userspace less confused.
      
      This return code is compatible with all the callers that I
      audited.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      88e75600
    • O
      kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable · be4a8d62
      Oscar Salvador 提交于
      commit 65c78784135f847e49eb98e6b976e453e71100c3 upstream
      
      This is a preparation for the next patch.
      
      Currently, we only call release_mem_region_adjustable() in __remove_pages
      if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
      are being released by themselves with devm_release_mem_region.
      
      Since we do not want to touch any zone/page stuff during the removing of
      the memory (but during the offlining), we do not want to check for the
      zone here.  So we need another way to tell release_mem_region_adjustable()
      to not realease the resource in case it belongs to HMM/devm.
      
      HMM/devm acquires/releases a resource through
      devm_request_mem_region/devm_release_mem_region.
      
      These resources have the flag IORESOURCE_MEM, while resources acquired by
      hot-add memory path (register_memory_resource()) contain
      IORESOURCE_SYSTEM_RAM.
      
      So, we can check for this flag in release_mem_region_adjustable, and if
      the resource does not contain such flag, we know that we are dealing with
      a HMM/devm resource, so we can back off.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      be4a8d62
    • B
      resource: Clean it up a bit · 87e09e0a
      Borislav Petkov 提交于
      commit b69c2e20f6e4046da84ce5b33ba1ef89cb087b40 upstream
      
      - Drop BUG_ON()s and do normal error handling instead, in
        find_next_iomem_res().
      
      - Align function arguments on opening braces.
      
      - Get rid of local var sibling_only in find_next_iomem_res().
      
      - Shorten unnecessarily long first_level_children_only arg name.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Bjorn Helgaas <bhelgaas@google.com>
      CC: Brijesh Singh <brijesh.singh@amd.com>
      CC: Dan Williams <dan.j.williams@intel.com>
      CC: H. Peter Anvin <hpa@zytor.com>
      CC: Lianbo Jiang <lijiang@redhat.com>
      CC: Takashi Iwai <tiwai@suse.de>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Tom Lendacky <thomas.lendacky@amd.com>
      CC: Vivek Goyal <vgoyal@redhat.com>
      CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      CC: bhe@redhat.com
      CC: dan.j.williams@intel.com
      CC: dyoung@redhat.com
      CC: kexec@lists.infradead.org
      CC: mingo@redhat.com
      Link: <new submission>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      87e09e0a
    • V
      device-dax: Add a 'modalias' attribute to DAX 'bus' devices · 536559fe
      Vishal Verma 提交于
      commit c347bd71dcdb2d0ac8b3a771486584dca8c8dd80 upstream
      
      Add a 'modalias' attribute to devices under the DAX bus so that userspace
      is able to dynamically load modules as needed.
      
      Normally, udev can get the modalias from 'uevent', and that is correctly
      set up by the DAX bus. However other tooling such as 'libndctl' for
      interacting with drivers/nvdimm/, and 'libdaxctl' for drivers/dax/ can
      also use the modalias to dynamically load modules via libkmod lookups.
      
      The 'nd' bus set up by the libnvdimm subsystem exports a modalias
      attribute. Imitate this to export the same for the 'dax' bus.
      
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      536559fe
    • D
      device-dax: Add a 'target_node' attribute · e444f72f
      Dan Williams 提交于
      commit 21c75763a3ae18679e5c4e2260aa9379b073566b upstream
      
      The target-node attribute is the Linux numa-node that a device-dax
      instance may create when it is online. Prior to being online the
      device's 'numa_node' property reflects the closest online cpu node which
      is the typical expectation of a device 'numa_node'. Once it is online it
      becomes its own distinct numa node, i.e. 'target_node'.
      
      Export the 'target_node' property to give userspace tooling the ability
      to predict the effective numa-node from a device-dax instance configured
      to provide 'System RAM' capacity.
      
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      e444f72f
    • D
      device-dax: Auto-bind device after successful new_id · 1e16becc
      Dan Williams 提交于
      commit 664525b2d84abca1074c9546654ae9689de8a818 upstream
      
      The typical 'new_id' attribute behavior is to immediately attach a
      device to its driver after a new device-id is added. Implement this
      behavior for the dax bus.
      Reported-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reported-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      1e16becc
    • D
      acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · a0a4e71f
      Dan Williams 提交于
      commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf upstream
      
      Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
      Interface Table), is the first known instance of a memory range
      described by a unique "target" proximity domain. Where "initiator" and
      "target" proximity domains is an approach that the ACPI HMAT
      (Heterogeneous Memory Attributes Table) uses to described the unique
      performance properties of a memory range relative to a given initiator
      (e.g. CPU or DMA device).
      
      Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
      char-device follows the traditional notion of 'numa-node' where the
      attribute conveys the closest online numa-node. That numa-node attribute
      is useful for cpu-binding and memory-binding processes *near* the
      device. However, when the memory range backing a 'pmem', or 'dax' device
      is onlined (memory hot-add) the memory-only-numa-node representing that
      address needs to be differentiated from the set of online nodes. In
      other words, the numa-node association of the device depends on whether
      you can bind processes *near* the cpu-numa-node in the offline
      device-case, or bind process *on* the memory-range directly after the
      backing address range is onlined.
      
      Allow for the case that platform firmware describes persistent memory
      with a unique proximity domain, i.e. when it is distinct from the
      proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
      numa-node translation of that proximity through the libnvdimm region
      device to namespaces that are in device-dax mode. With this in place the
      proposed kmem driver [1] can optionally discover a unique numa-node
      number for the address range as it transitions the memory from an
      offline state managed by a device-driver to an online memory range
      managed by the core-mm.
      
      [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.comReported-by: NFan Du <fan.du@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      [yshi: Removed PowerPC stuff which is not applicable 4.19]
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      a0a4e71f
    • D
      device-dax: Add /sys/class/dax backwards compatibility · c827e296
      Dan Williams 提交于
      commit 730926c3b0998943654019f00296cf8e3b02277e upstream
      
      On the expectation that some environments may not upgrade libdaxctl
      (userspace component that depends on the /sys/class/dax hierarchy),
      provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat
      driver implements the original /sys/class/dax sysfs layout rather than
      /sys/bus/dax. When userspace is upgraded it can blacklist this module
      and switch to the dax_pmem driver going forward.
      
      CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according
      to the dax_pmem entry in Documentation/ABI/obsolete/.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      c827e296
    • D
      device-dax: Add support for a dax override driver · 3f8deff3
      Dan Williams 提交于
      commit d200781ef237a354d918ceff5cee350d88a93d42 upstream
      
      Introduce the 'new_id' concept for enabling a custom device-driver attach
      policy for dax-bus drivers. The intended use is to have a mechanism for
      hot-plugging device-dax ranges into the page allocator on-demand. With
      this in place the default policy of using device-dax for performance
      differentiated memory can be overridden by user-space policy that can
      arrange for the memory range to be managed as 'System RAM' with
      user-defined NUMA and other performance attributes.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      3f8deff3
    • D
      device-dax: Move resource pinning+mapping into the common driver · c2df7a3a
      Dan Williams 提交于
      commit 89ec9f2cfa36cc5fca2fb445ed221bb9add7b536 upstream
      
      Move the responsibility of calling devm_request_resource() and
      devm_memremap_pages() into the common device-dax driver. This is another
      preparatory step to allowing an alternate personality driver for a
      device-dax range.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      c2df7a3a
    • D
      device-dax: Introduce bus + driver model · d491ea9e
      Dan Williams 提交于
      commit 9567da0b408a2553d32ca83cba4f1fc5a8aad459 upstream
      
      In support of multiple device-dax instances per device-dax-region and
      allowing the 'kmem' driver to attach to dax-instances instead of the
      current device-node access, convert the dax sub-system from a class to a
      bus. Recall that the kmem driver takes reserved / special purpose
      memories and assigns them to be managed by the core-mm.
      
      Aside from the fact the device-dax instances are registered and probed
      on a bus, two other lifetime-management changes are made:
      
      1/ Delay attaching a cdev until driver probe time
      
      2/ A new run_dax() helper is introduced to allow restoring dax-operation
         after a kill_dax() event. So, at driver ->probe() time we run_dax()
         and at ->remove() time we kill_dax() and invalidate all mappings.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      d491ea9e
    • D
      device-dax: Start defining a dax bus model · 265e1089
      Dan Williams 提交于
      commit 51cf784c42d07fbd62cb604836a9270cf3361509 upstream
      
      Towards eliminating the dax_class, move the dax-device-attribute
      enabling to a new bus.c file in the core. The amount of code
      thrash of sub-sequent patches is reduced as no logic changes are made,
      just pure code movement.
      
      A temporary export of unregister_dex_dax() and dax_attribute_groups is
      needed to preserve compilation, but those symbols become static again in
      a follow-on patch.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      265e1089
    • D
      device-dax: Remove multi-resource infrastructure · 9910b7e1
      Dan Williams 提交于
      commit 753a0850e707e9a8c5861356222f9b9e4eba7945 upstream
      
      The multi-resource implementation anticipated discontiguous sub-division
      support. That has not yet materialized, delete the infrastructure and
      related code.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      9910b7e1
    • D
      device-dax: Kill dax_region base · 4f3e3b40
      Dan Williams 提交于
      commit 93694f9630b0ed29cda61df58e480dcb34ef52fd upstream
      
      Nothing consumes this attribute of a region and devres otherwise
      remembers the value for de-allocation purposes.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      4f3e3b40
    • D
      device-dax: Kill dax_region ida · dcd3a988
      Dan Williams 提交于
      commit 21b9e979501fdb5f6797193d70428a2b00bd5247 upstream
      
      Commit bbb3be17 "device-dax: fix sysfs duplicate warnings" arranged
      for passing a dax instance-id to devm_create_dax_dev(), rather than
      generating one internally. Remove the dax_region ida and related code.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      dcd3a988
    • S
      ICX: tools/power/x86: A tool to validate Intel Speed Select commands · 956e24a5
      Srinivas Pandruvada 提交于
      commit 3fb4f7cd472c7f5905c91508e988f6b28372210d upstream.
      
      The Intel(R) Speed select technologies contains four features.
      
      Performance profile:An non architectural mechanism that allows multiple
      optimized performance profiles per system via static and/or dynamic
      adjustment of core count, workload, Tjmax, and TDP, etc. aka ISS
      in the documentation.
      
      Base Frequency: Enables users to increase guaranteed base frequency on
      certain cores (high priority cores) in exchange for lower base frequency
      on remaining cores (low priority cores). aka PBF in the documenation.
      
      Turbo frequency: Enables the ability to set different turbo ratio limits
      to cores based on priority. aka FACT in the documentation.
      
      Core power: An Interface that allows user to define per core/tile
      priority.
      
      There is a multi level help for commands and options. This can be used
      to check required arguments for each feature and commands for the
      feature.
      
      To start navigating the features start with
      
      $sudo intel-speed-select --help
      
      For help on a specific feature for example
      $sudo intel-speed-select perf-profile --help
      
      To get help for a command for a feature for example
      $sudo intel-speed-select perf-profile get-lock-status --help
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Acked-by: NLen Brown <len.brown@intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      956e24a5
    • S
      ICX: platform/x86: ISST: Restore state on resume · 762ac32f
      Srinivas Pandruvada 提交于
      commit f607874f35cbd276a837d7147d4e1ec752dfef44 upstream.
      
      Commands which causes PUNIT writes, store them and restore them on system
      resume. The driver stores all such requests in a hash table and stores the
      the latest mailbox request parameters. On resume these commands mail box
      commands are executed again. There are only 5 such mail box commands which
      will trigger such processing so a very low overhead in store and execute
      on resume. Also there is no order requirement for mail box commands for
      these write/set commands. There is one MSR request for changing turbo
      ratio limits, this also stored and get restored on resume and cpu online.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      762ac32f
    • S
      ICX: platform/x86: ISST: Add Intel Speed Select PUNIT MSR interface · 30ac231e
      Srinivas Pandruvada 提交于
      commit e765f37b9b8b4fa65682e9a78a2ca2b11d3d9096 upstream.
      
      While using new non arhitectural features using PUNIT Mailbox and MMIO
      read/write interface, still there is need to operate using MSRs to
      control PUNIT. User space could have used user user-space MSR interface for
      this, but when user space MSR access is disabled, then it can't. Here only
      limited number of MSRs are allowed using this new interface.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      30ac231e
    • S
      ICX: platform/x86: ISST: Add Intel Speed Select mailbox interface via MSRs · eaa6e8e2
      Srinivas Pandruvada 提交于
      commit 71b21bd7f68a6ee59003f63d2e4f84fd9b0a8d07 upstream.
      
      Add an IOCTL to send mailbox commands to PUNIT using PUNIT MSRs for
      mailbox. Some CPU models don't have PCI device, so need to use MSRs.
      A limited set of mailbox commands can be sent to PUNIT.
      
      This MMIO interface is used by the intel-speed-select tool under
      tools/x86/power to enumerate and control Intel Speed Select features.
      The MBOX commands ids and semantics of the message can be checked from
      the source code of the tool.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      eaa6e8e2