1. 08 6月, 2020 6 次提交
    • E
      KVM: x86: Fix APIC page invalidation race · e649b3f0
      Eiichi Tsukata 提交于
      Commit b1394e74 ("KVM: x86: fix APIC page invalidation") tried
      to fix inappropriate APIC page invalidation by re-introducing arch
      specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
      kvm_mmu_notifier_invalidate_range_start. However, the patch left a
      possible race where the VMCS APIC address cache is updated *before*
      it is unmapped:
      
        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
        (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
        (KVM VCPU) vcpu_enter_guest()
        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
        (Invalidator) actually unmap page
      
      Because of the above race, there can be a mismatch between the
      host physical address stored in the APIC_ACCESS_PAGE VMCS field and
      the host physical address stored in the EPT entry for the APIC GPA
      (0xfee0000).  When this happens, the processor will not trap APIC
      accesses, and will instead show the raw contents of the APIC-access page.
      Because Windows OS periodically checks for unexpected modifications to
      the LAPIC register, this will show up as a BSOD crash with BugCheck
      CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
      https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
      
      The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
      cannot guarantee that no additional references are taken to the pages in
      the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
      this case is supported by the MMU notifier API, as documented in
      include/linux/mmu_notifier.h:
      
      	 * If the subsystem
               * can't guarantee that no additional references are taken to
               * the pages in the range, it has to implement the
               * invalidate_range() notifier to remove any references taken
               * after invalidate_range_start().
      
      The fix therefore is to reload the APIC-access page field in the VMCS
      from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
      
      Cc: stable@vger.kernel.org
      Fixes: b1394e74 ("KVM: x86: fix APIC page invalidation")
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951Signed-off-by: NEiichi Tsukata <eiichi.tsukata@nutanix.com>
      Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e649b3f0
    • P
      KVM: SVM: fix calls to is_intercept · fb7333df
      Paolo Bonzini 提交于
      is_intercept takes an INTERCEPT_* constant, not SVM_EXIT_*; because
      of this, the compiler was removing the body of the conditionals,
      as if is_intercept returned 0.
      
      This unveils a latent bug: when clearing the VINTR intercept,
      int_ctl must also be changed in the L1 VMCB (svm->nested.hsave),
      just like the intercept itself is also changed in the L1 VMCB.
      Otherwise V_IRQ remains set and, due to the VINTR intercept being clear,
      we get a spurious injection of a vector 0 interrupt on the next
      L2->L1 vmexit.
      Reported-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fb7333df
    • V
      KVM: selftests: fix vmx_preemption_timer_test build with GCC10 · 75ad6e80
      Vitaly Kuznetsov 提交于
      GCC10 fails to build vmx_preemption_timer_test:
      
      gcc -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99
      -fno-stack-protector -fno-PIE -I../../../../tools/include
       -I../../../../tools/arch/x86/include -I../../../../usr/include/
       -Iinclude -Ix86_64 -Iinclude/x86_64 -I..  -pthread  -no-pie
       x86_64/evmcs_test.c ./linux/tools/testing/selftests/kselftest_harness.h
       ./linux/tools/testing/selftests/kselftest.h
       ./linux/tools/testing/selftests/kvm/libkvm.a
       -o ./linux/tools/testing/selftests/kvm/x86_64/evmcs_test
      /usr/bin/ld: ./linux/tools/testing/selftests/kvm/libkvm.a(vmx.o):
       ./linux/tools/testing/selftests/kvm/include/x86_64/vmx.h:603:
       multiple definition of `ctrl_exit_rev'; /tmp/ccMQpvNt.o:
       ./linux/tools/testing/selftests/kvm/include/x86_64/vmx.h:603:
       first defined here
      /usr/bin/ld: ./linux/tools/testing/selftests/kvm/libkvm.a(vmx.o):
       ./linux/tools/testing/selftests/kvm/include/x86_64/vmx.h:602:
       multiple definition of `ctrl_pin_rev'; /tmp/ccMQpvNt.o:
       ./linux/tools/testing/selftests/kvm/include/x86_64/vmx.h:602:
       first defined here
       ...
      
      ctrl_exit_rev/ctrl_pin_rev/basic variables are only used in
      vmx_preemption_timer_test.c, just move them there.
      
      Fixes: 8d7fbf01 ("KVM: selftests: VMX preemption timer migration test")
      Reported-by: NMarcelo Bandeira Condotta <mcondotta@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200608112346.593513-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      75ad6e80
    • V
      KVM: selftests: Add x86_64/debug_regs to .gitignore · 5ae1452f
      Vitaly Kuznetsov 提交于
      Add x86_64/debug_regs to .gitignore.
      Reported-by: NMarcelo Bandeira Condotta <mcondotta@redhat.com>
      Fixes: 449aa906 ("KVM: selftests: Add KVM_SET_GUEST_DEBUG test")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200608112346.593513-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5ae1452f
    • V
      Revert "KVM: x86: work around leak of uninitialized stack contents" · 25597f64
      Vitaly Kuznetsov 提交于
      handle_vmptrst()/handle_vmread() stopped injecting #PF unconditionally
      and switched to nested_vmx_handle_memory_failure() which just kills the
      guest with KVM_EXIT_INTERNAL_ERROR in case of MMIO access, zeroing
      'exception' in kvm_write_guest_virt_system() is not needed anymore.
      
      This reverts commit 541ab2ae.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200605115906.532682-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      25597f64
    • V
      KVM: VMX: Properly handle kvm_read/write_guest_virt*() result · 7a35e515
      Vitaly Kuznetsov 提交于
      Syzbot reports the following issue:
      
      WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618
       kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
      ...
      Call Trace:
      ...
      RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
      ...
       nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638
       handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline]
       handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728
       vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067
      
      'exception' we're trying to inject with kvm_inject_emulated_page_fault()
      comes from:
      
        nested_vmx_get_vmptr()
         kvm_read_guest_virt()
           kvm_read_guest_virt_helper()
             vcpu->arch.walk_mmu->gva_to_gpa()
      
      but it is only set when GVA to GPA conversion fails. In case it doesn't but
      we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and
      nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed
      'exception'. This happen when the argument is MMIO.
      
      Paolo also noticed that nested_vmx_get_vmptr() is not the only place in
      KVM code where kvm_read/write_guest_virt*() return result is mishandled.
      VMX instructions along with INVPCID have the same issue. This was already
      noticed before, e.g. see commit 541ab2ae ("KVM: x86: work around
      leak of uninitialized stack contents") but was never fully fixed.
      
      KVM could've handled the request correctly by going to userspace and
      performing I/O but there doesn't seem to be a good need for such requests
      in the first place.
      
      Introduce vmx_handle_memory_failure() as an interim solution.
      
      Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF,
      KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is
      needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem
      to have a good enum describing this tristate, just add "int *ret" to
      nested_vmx_get_vmptr() interface to pass the information.
      
      Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200605115906.532682-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a35e515
  2. 05 6月, 2020 25 次提交
  3. 04 6月, 2020 9 次提交
    • P
      KVM: let kvm_destroy_vm_debugfs clean up vCPU debugfs directories · d56f5136
      Paolo Bonzini 提交于
      After commit 63d04348 ("KVM: x86: move kvm_create_vcpu_debugfs after
      last failure point") we are creating the pre-vCPU debugfs files
      after the creation of the vCPU file descriptor.  This makes it
      possible for userspace to reach kvm_vcpu_release before
      kvm_create_vcpu_debugfs has finished.  The vcpu->debugfs_dentry
      then does not have any associated inode anymore, and this causes
      a NULL-pointer dereference in debugfs_create_file.
      
      The solution is simply to avoid removing the files; they are
      cleaned up when the VM file descriptor is closed (and that must be
      after KVM_CREATE_VCPU returns).  We can stop storing the dentry
      in struct kvm_vcpu too, because it is not needed anywhere after
      kvm_create_vcpu_debugfs returns.
      
      Reported-by: syzbot+705f4401d5a93a59b87d@syzkaller.appspotmail.com
      Fixes: 63d04348 ("KVM: x86: move kvm_create_vcpu_debugfs after last failure point")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d56f5136
    • L
      atomisp: avoid warning about unused function · 6929f71e
      Linus Torvalds 提交于
      The atomisp_mrfld_power() function isn't actually ever called, because
      the two call-sites have commented out the use because it breaks on some
      platforms.  That results in:
      
        drivers/staging/media/atomisp/pci/atomisp_v4l2.c:764:12: warning: ‘atomisp_mrfld_power’ defined but not used [-Wunused-function]
          764 | static int atomisp_mrfld_power(struct atomisp_device *isp, bool enable)
              |            ^~~~~~~~~~~~~~~~~~~
      
      during the build.
      
      Rather than commenting out the use entirely, just disable it
      semantically instead (using a "0 &&" construct), leaving the call in
      place from a syntax standpoint, and avoiding the warning.
      
      I really don't want my builds to have any warnings that can then hide
      real issues.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6929f71e
    • L
      Merge tag 'media/v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · a98f670e
      Linus Torvalds 提交于
      Pull media updates from Mauro Carvalho Chehab:
      
       - Media documentation is now split into admin-guide, driver-api and
         userspace-api books (a longstanding request from Jon);
      
       - The media Kconfig was reorganized, in order to make easier to select
         drivers and their dependencies;
      
       - The testing drivers now has a separate directory;
      
       - added a new driver for Rockchip Video Decoder IP;
      
       - The atomisp staging driver was resurrected. It is meant to work with
         4 generations of cameras on Atom-based laptops, tablets and cell
         phones. So, it seems worth investing time to cleanup this driver and
         making it in good shape.
      
       - Added some V4L2 core ancillary routines to help with h264 codecs;
      
       - Added an ov2740 image sensor driver;
      
       - The si2157 gained support for Analog TV, which, in turn, added
         support for some cx231xx and cx23885 boards to also support analog
         standards;
      
       - Added some V4L2 controls (V4L2_CID_CAMERA_ORIENTATION and
         V4L2_CID_CAMERA_SENSOR_ROTATION) to help identifying where the camera
         is located at the device;
      
       - VIDIOC_ENUM_FMT was extended to support MC-centric devices;
      
       - Lots of drivers improvements and cleanups.
      
      * tag 'media/v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (503 commits)
        media: Documentation: media: Refer to mbus format documentation from CSI-2 docs
        media: s5k5baf: Replace zero-length array with flexible-array
        media: i2c: imx219: Drop <linux/clk-provider.h> and <linux/clkdev.h>
        media: i2c: Add ov2740 image sensor driver
        media: ov8856: Implement sensor module revision identification
        media: ov8856: Add devicetree support
        media: dt-bindings: ov8856: Document YAML bindings
        media: dvb-usb: Add Cinergy S2 PCIe Dual Port support
        media: dvbdev: Fix tuner->demod media controller link
        media: dt-bindings: phy: phy-rockchip-dphy-rx0: move rockchip dphy rx0 bindings out of staging
        media: staging: dt-bindings: phy-rockchip-dphy-rx0: remove non-used reg property
        media: atomisp: unify the version for isp2401 a0 and b0 versions
        media: atomisp: update TODO with the current data
        media: atomisp: adjust some code at sh_css that could be broken
        media: atomisp: don't produce errs for ignored IRQs
        media: atomisp: print IRQ when debugging
        media: atomisp: isp_mmu: don't use kmem_cache
        media: atomisp: add a notice about possible leak resources
        media: atomisp: disable the dynamic and reserved pools
        media: atomisp: turn on camera before setting it
        ...
      a98f670e
    • L
      Merge branch 'akpm' (patches from Andrew) · ee01c4d7
      Linus Torvalds 提交于
      Merge more updates from Andrew Morton:
       "More mm/ work, plenty more to come
      
        Subsystems affected by this patch series: slub, memcg, gup, kasan,
        pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
        thp, mmap, kconfig"
      
      * akpm: (131 commits)
        arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
        x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
        riscv: support DEBUG_WX
        mm: add DEBUG_WX support
        drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
        mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
        powerpc/mm: drop platform defined pmd_mknotpresent()
        mm: thp: don't need to drain lru cache when splitting and mlocking THP
        hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
        sparc32: register memory occupied by kernel as memblock.memory
        include/linux/memblock.h: fix minor typo and unclear comment
        mm, mempolicy: fix up gup usage in lookup_node
        tools/vm/page_owner_sort.c: filter out unneeded line
        mm: swap: memcg: fix memcg stats for huge pages
        mm: swap: fix vmstats for huge pages
        mm: vmscan: limit the range of LRU type balancing
        mm: vmscan: reclaim writepage is IO cost
        mm: vmscan: determine anon/file pressure balance at the reclaim root
        mm: balance LRU lists based on relative thrashing
        mm: only count actual rotations as LRU reclaim cost
        ...
      ee01c4d7
    • Z
      arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined · 09587a09
      Zong Li 提交于
      Extract DEBUG_WX to mm/Kconfig.debug for shared use.  Change to use
      ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port.
      Signed-off-by: NZong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/e19709e7576f65e303245fe520cad5f7bae72763.1587455584.git.zong.li@sifive.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09587a09
    • Z
      x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined · 7e01ccb4
      Zong Li 提交于
      Extract DEBUG_WX to mm/Kconfig.debug for shared use.  Change to use
      ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port.
      Signed-off-by: NZong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/430736828d149df3f5b462d291e845ec690e0141.1587455584.git.zong.li@sifive.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e01ccb4
    • Z
      riscv: support DEBUG_WX · b422d28b
      Zong Li 提交于
      Support DEBUG_WX to check whether there are mapping with write and execute
      permission at the same time.
      
      [akpm@linux-foundation.org: replace macros with C]
      Signed-off-by: NZong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/282e266311bced080bc6f7c255b92f87c1eb65d6.1587455584.git.zong.li@sifive.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b422d28b
    • Z
      mm: add DEBUG_WX support · 375d315c
      Zong Li 提交于
      Patch series "Extract DEBUG_WX to shared use".
      
      Some architectures support DEBUG_WX function, it's verbatim from each
      others, so extract to mm/Kconfig.debug for shared use.
      
      PPC and ARM ports don't support generic page dumper yet, so we only
      refine x86 and arm64 port in this patch series.
      
      For RISC-V port, the DEBUG_WX support depends on other patches which
      be merged already:
        - RISC-V page table dumper
        - Support strict kernel memory permissions for security
      
      This patch (of 4):
      
      Some architectures support DEBUG_WX function, it's verbatim from each
      others.  Extract to mm/Kconfig.debug for shared use.
      
      [akpm@linux-foundation.org: reword text, per Will Deacon & Zong Li]
        Link: http://lkml.kernel.org/r/20200427194245.oxRJKj3fn%25akpm@linux-foundation.org
      [zong.li@sifive.com: remove the specific name of arm64]
        Link: http://lkml.kernel.org/r/3a6a92ecedc54e1d0fc941398e63d504c2cd5611.1589178399.git.zong.li@sifive.com
      [zong.li@sifive.com: add MMU dependency for DEBUG_WX]
        Link: http://lkml.kernel.org/r/4a674ac7863ff39ca91847b10e51209771f99416.1589178399.git.zong.li@sifive.comSuggested-by: NPalmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: NZong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/cover.1587455584.git.zong.li@sifive.com
      Link: http://lkml.kernel.org/r/23980cd0f0e5d79e24a92169116407c75bcc650d.1587455584.git.zong.li@sifive.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      375d315c
    • S
      drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup · 4fb6eabf
      Scott Cheloha 提交于
      Searching for a particular memory block by id is an O(n) operation because
      each memory block's underlying device is kept in an unsorted linked list
      on the subsystem bus.
      
      We can cut the lookup cost to O(log n) if we cache each memory block
      in an xarray.  This time complexity improvement is significant on
      systems with many memory blocks.  For example:
      
      1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks.  With this
         change  memory_dev_init() completes ~12ms faster and walk_memory_blocks()
         completes ~12ms faster.
      
      Before:
      [    0.005042] memory_dev_init: adding memory blocks
      [    0.021591] memory_dev_init: added memory blocks
      [    0.022699] walk_memory_blocks: walking memory blocks
      [    0.038730] walk_memory_blocks: walked memory blocks 0-511
      
      After:
      [    0.005057] memory_dev_init: adding memory blocks
      [    0.009415] memory_dev_init: added memory blocks
      [    0.010519] walk_memory_blocks: walking memory blocks
      [    0.014135] walk_memory_blocks: walked memory blocks 0-511
      
      2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks.  With
         this change memory_dev_init() completes ~88ms faster and
         walk_memory_blocks() completes ~87ms faster.
      
      Before:
      [    0.252246] memory_dev_init: adding memory blocks
      [    0.395469] memory_dev_init: added memory blocks
      [    0.409413] walk_memory_blocks: walking memory blocks
      [    0.433028] walk_memory_blocks: walked memory blocks 0-511
      [    0.433094] walk_memory_blocks: walking memory blocks
      [    0.500244] walk_memory_blocks: walked memory blocks 131072-131583
      
      After:
      [    0.245063] memory_dev_init: adding memory blocks
      [    0.299539] memory_dev_init: added memory blocks
      [    0.313609] walk_memory_blocks: walking memory blocks
      [    0.315287] walk_memory_blocks: walked memory blocks 0-511
      [    0.315349] walk_memory_blocks: walking memory blocks
      [    0.316988] walk_memory_blocks: walked memory blocks 131072-131583
      
      3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks.  With
         this change we complete memory_dev_init() ~37 minutes faster and
         walk_memory_blocks() at least ~30 minutes faster.  The exact timing
         for walk_memory_blocks() is  missing, though I observed that the
         soft lockups in walk_memory_blocks() disappeared with the change,
         suggesting that lower bound.
      
      Before:
      [   13.703907] memory_dev_init: adding blocks
      [ 2287.406099] memory_dev_init: added all blocks
      [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      
      After:
      [   13.696898] memory_dev_init: adding blocks
      [   15.660035] memory_dev_init: added all blocks
      (the walk_memory_blocks traces disappear)
      
      There should be no significant negative impact for machines with few
      memory blocks.  A sparse xarray has a small footprint and an O(log n)
      lookup is negligibly slower than an O(n) lookup for only the smallest
      number of memory blocks.
      
      1. A 16GB x86 machine with 128MB memblocks has 132 blocks.  With this
         change memory_dev_init() completes ~300us faster and walk_memory_blocks()
         completes no faster or slower.  The improvement is pretty close to noise.
      
      Before:
      [    0.224752] memory_dev_init: adding memory blocks
      [    0.227116] memory_dev_init: added memory blocks
      [    0.227183] walk_memory_blocks: walking memory blocks
      [    0.227183] walk_memory_blocks: walked memory blocks 0-131
      
      After:
      [    0.224911] memory_dev_init: adding memory blocks
      [    0.226935] memory_dev_init: added memory blocks
      [    0.227089] walk_memory_blocks: walking memory blocks
      [    0.227089] walk_memory_blocks: walked memory blocks 0-131
      
      [david@redhat.com: document the locking]
        Link: http://lkml.kernel.org/r/bc21eec6-7251-4c91-2f57-9a0671f8d414@redhat.comSigned-off-by: NScott Cheloha <cheloha@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNathan Lynch <nathanl@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rick Lindsley <ricklind@linux.vnet.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Link: http://lkml.kernel.org/r/20200121231028.13699-1-cheloha@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fb6eabf