1. 21 4月, 2016 8 次提交
  2. 01 3月, 2016 2 次提交
  3. 16 1月, 2016 1 次提交
    • D
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
  4. 18 12月, 2015 1 次提交
  5. 05 12月, 2015 1 次提交
  6. 25 11月, 2015 1 次提交
    • A
      ARM/arm64: KVM: test properly for a PTE's uncachedness · e6fab544
      Ard Biesheuvel 提交于
      The open coded tests for checking whether a PTE maps a page as
      uncached use a flawed '(pte_val(xxx) & CONST) != CONST' pattern,
      which is not guaranteed to work since the type of a mapping is
      not a set of mutually exclusive bits
      
      For HYP mappings, the type is an index into the MAIR table (i.e, the
      index itself does not contain any information whatsoever about the
      type of the mapping), and for stage-2 mappings it is a bit field where
      normal memory and device types are defined as follows:
      
          #define MT_S2_NORMAL            0xf
          #define MT_S2_DEVICE_nGnRE      0x1
      
      I.e., masking *and* comparing with the latter matches on the former,
      and we have been getting lucky merely because the S2 device mappings
      also have the PTE_UXN bit set, or we would misidentify memory mappings
      as device mappings.
      
      Since the unmap_range() code path (which contains one instance of the
      flawed test) is used both for HYP mappings and stage-2 mappings, and
      considering the difference between the two, it is non-trivial to fix
      this by rewriting the tests in place, as it would involve passing
      down the type of mapping through all the functions.
      
      However, since HYP mappings and stage-2 mappings both deal with host
      physical addresses, we can simply check whether the mapping is backed
      by memory that is managed by the host kernel, and only perform the
      D-cache maintenance if this is the case.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NPavel Fedin <p.fedin@samsung.com>
      Reviewed-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      e6fab544
  7. 16 9月, 2015 1 次提交
    • M
      arm: KVM: Fix incorrect device to IPA mapping · ca09f02f
      Marek Majtyka 提交于
      A critical bug has been found in device memory stage1 translation for
      VMs with more then 4GB of address space. Once vm_pgoff size is smaller
      then pa (which is true for LPAE case, u32 and u64 respectively) some
      more significant bits of pa may be lost as a shift operation is performed
      on u32 and later cast onto u64.
      
      Example: vm_pgoff(u32)=0x00210030, PAGE_SHIFT=12
              expected pa(u64):   0x0000002010030000
              produced pa(u64):   0x0000000010030000
      
      The fix is to change the order of operations (casting first onto phys_addr_t
      and then shifting).
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      [maz: fixed changelog and patch formatting]
      Cc: stable@vger.kernel.org
      Signed-off-by: NMarek Majtyka <marek.majtyka@tieto.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      ca09f02f
  8. 10 6月, 2015 1 次提交
  9. 28 5月, 2015 1 次提交
  10. 26 5月, 2015 3 次提交
  11. 23 3月, 2015 1 次提交
  12. 20 3月, 2015 1 次提交
    • A
      ARM, arm64: kvm: get rid of the bounce page · 06f75a1f
      Ard Biesheuvel 提交于
      The HYP init bounce page is a runtime construct that ensures that the
      HYP init code does not cross a page boundary. However, this is something
      we can do perfectly well at build time, by aligning the code appropriately.
      
      For arm64, we just align to 4 KB, and enforce that the code size is less
      than 4 KB, regardless of the chosen page size.
      
      For ARM, the whole code is less than 256 bytes, so we tweak the linker
      script to align at a power of 2 upper bound of the code size
      
      Note that this also fixes a benign off-by-one error in the original bounce
      page code, where a bounce page would be allocated unnecessarily if the code
      was exactly 1 page in size.
      
      On ARM, it also fixes an issue with very large kernels reported by Arnd
      Bergmann, where stub sections with linker emitted veneers could erroneously
      trigger the size/alignment ASSERT() in the linker script.
      Tested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      06f75a1f
  13. 13 3月, 2015 3 次提交
  14. 11 3月, 2015 2 次提交
    • M
      arm64: KVM: Do not use pgd_index to index stage-2 pgd · 04b8dc85
      Marc Zyngier 提交于
      The kernel's pgd_index macro is designed to index a normal, page
      sized array. KVM is a bit diffferent, as we can use concatenated
      pages to have a bigger address space (for example 40bit IPA with
      4kB pages gives us an 8kB PGD.
      
      In the above case, the use of pgd_index will always return an index
      inside the first 4kB, which makes a guest that has memory above
      0x8000000000 rather unhappy, as it spins forever in a page fault,
      whist the host happilly corrupts the lower pgd.
      
      The obvious fix is to get our own kvm_pgd_index that does the right
      thing(tm).
      
      Tested on X-Gene with a hacked kvmtool that put memory at a stupidly
      high address.
      Reviewed-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      04b8dc85
    • M
      arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting · a987370f
      Marc Zyngier 提交于
      We're using __get_free_pages with to allocate the guest's stage-2
      PGD. The standard behaviour of this function is to return a set of
      pages where only the head page has a valid refcount.
      
      This behaviour gets us into trouble when we're trying to increment
      the refcount on a non-head page:
      
      page:ffff7c00cfb693c0 count:0 mapcount:0 mapping:          (null) index:0x0
      flags: 0x4000000000000000()
      page dumped because: VM_BUG_ON_PAGE((*({ __attribute__((unused)) typeof((&page->_count)->counter) __var = ( typeof((&page->_count)->counter)) 0; (volatile typeof((&page->_count)->counter) *)&((&page->_count)->counter); })) <= 0)
      BUG: failure at include/linux/mm.h:548/get_page()!
      Kernel panic - not syncing: BUG!
      CPU: 1 PID: 1695 Comm: kvm-vcpu-0 Not tainted 4.0.0-rc1+ #3825
      Hardware name: APM X-Gene Mustang board (DT)
      Call trace:
      [<ffff80000008a09c>] dump_backtrace+0x0/0x13c
      [<ffff80000008a1e8>] show_stack+0x10/0x1c
      [<ffff800000691da8>] dump_stack+0x74/0x94
      [<ffff800000690d78>] panic+0x100/0x240
      [<ffff8000000a0bc4>] stage2_get_pmd+0x17c/0x2bc
      [<ffff8000000a1dc4>] kvm_handle_guest_abort+0x4b4/0x6b0
      [<ffff8000000a420c>] handle_exit+0x58/0x180
      [<ffff80000009e7a4>] kvm_arch_vcpu_ioctl_run+0x114/0x45c
      [<ffff800000099df4>] kvm_vcpu_ioctl+0x2e0/0x754
      [<ffff8000001c0a18>] do_vfs_ioctl+0x424/0x5c8
      [<ffff8000001c0bfc>] SyS_ioctl+0x40/0x78
      CPU0: stopping
      
      A possible approach for this is to split the compound page using
      split_page() at allocation time, and change the teardown path to
      free one page at a time.  It turns out that alloc_pages_exact() and
      free_pages_exact() does exactly that.
      
      While we're at it, the PGD allocation code is reworked to reduce
      duplication.
      
      This has been tested on an X-Gene platform with a 4kB/48bit-VA host
      kernel, and kvmtool hacked to place memory in the second page of
      the hardware PGD (PUD for the host kernel). Also regression-tested
      on a Cubietruck (Cortex-A7).
      
       [ Reworked to use alloc_pages_exact() and free_pages_exact() and to
         return pointers directly instead of by reference as arguments
          - Christoffer ]
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      a987370f
  15. 30 1月, 2015 3 次提交
    • M
      arm/arm64: KVM: Use kernel mapping to perform invalidation on page fault · 0d3e4d4f
      Marc Zyngier 提交于
      When handling a fault in stage-2, we need to resync I$ and D$, just
      to be sure we don't leave any old cache line behind.
      
      That's very good, except that we do so using the *user* address.
      Under heavy load (swapping like crazy), we may end up in a situation
      where the page gets mapped in stage-2 while being unmapped from
      userspace by another CPU.
      
      At that point, the DC/IC instructions can generate a fault, which
      we handle with kvm->mmu_lock held. The box quickly deadlocks, user
      is unhappy.
      
      Instead, perform this invalidation through the kernel mapping,
      which is guaranteed to be present. The box is much happier, and so
      am I.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      0d3e4d4f
    • M
      arm/arm64: KVM: Invalidate data cache on unmap · 363ef89f
      Marc Zyngier 提交于
      Let's assume a guest has created an uncached mapping, and written
      to that page. Let's also assume that the host uses a cache-coherent
      IO subsystem. Let's finally assume that the host is under memory
      pressure and starts to swap things out.
      
      Before this "uncached" page is evicted, we need to make sure
      we invalidate potential speculated, clean cache lines that are
      sitting there, or the IO subsystem is going to swap out the
      cached view, loosing the data that has been written directly
      into memory.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      363ef89f
    • M
      arm/arm64: KVM: Use set/way op trapping to track the state of the caches · 3c1e7165
      Marc Zyngier 提交于
      Trying to emulate the behaviour of set/way cache ops is fairly
      pointless, as there are too many ways we can end-up missing stuff.
      Also, there is some system caches out there that simply ignore
      set/way operations.
      
      So instead of trying to implement them, let's convert it to VA ops,
      and use them as a way to re-enable the trapping of VM ops. That way,
      we can detect the point when the MMU/caches are turned off, and do
      a full VM flush (which is what the guest was trying to do anyway).
      
      This allows a 32bit zImage to boot on the APM thingy, and will
      probably help bootloaders in general.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      3c1e7165
  16. 29 1月, 2015 1 次提交
  17. 23 1月, 2015 1 次提交
  18. 16 1月, 2015 4 次提交
  19. 13 12月, 2014 1 次提交
    • C
      arm/arm64: KVM: Introduce stage2_unmap_vm · 957db105
      Christoffer Dall 提交于
      Introduce a new function to unmap user RAM regions in the stage2 page
      tables.  This is needed on reboot (or when the guest turns off the MMU)
      to ensure we fault in pages again and make the dcache, RAM, and icache
      coherent.
      
      Using unmap_stage2_range for the whole guest physical range does not
      work, because that unmaps IO regions (such as the GIC) which will not be
      recreated or in the best case faulted in on a page-by-page basis.
      
      Call this function on secondary and subsequent calls to the
      KVM_ARM_VCPU_INIT ioctl so that a reset VCPU will detect the guest
      Stage-1 MMU is off when faulting in pages and make the caches coherent.
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      957db105
  20. 26 11月, 2014 2 次提交
    • A
      arm/arm64: kvm: drop inappropriate use of kvm_is_mmio_pfn() · bb55e9b1
      Ard Biesheuvel 提交于
      Instead of using kvm_is_mmio_pfn() to decide whether a host region
      should be stage 2 mapped with device attributes, add a new static
      function kvm_is_device_pfn() that disregards RAM pages with the
      reserved bit set, as those should usually not be mapped as device
      memory.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb55e9b1
    • M
      arm64: KVM: fix unmapping with 48-bit VAs · 7cbb87d6
      Mark Rutland 提交于
      Currently if using a 48-bit VA, tearing down the hyp page tables (which
      can happen in the absence of a GICH or GICV resource) results in the
      rather nasty splat below, evidently becasue we access a table that
      doesn't actually exist.
      
      Commit 38f791a4 (arm64: KVM: Implement 48 VA support for KVM EL2
      and Stage-2) added a pgd_none check to __create_hyp_mappings to account
      for the additional level of tables, but didn't add a corresponding check
      to unmap_range, and this seems to be the source of the problem.
      
      This patch adds the missing pgd_none check, ensuring we don't try to
      access tables that don't exist.
      
      Original splat below:
      
      kvm [1]: Using HYP init bounce page @83fe94a000
      kvm [1]: Cannot obtain GICH resource
      Unable to handle kernel paging request at virtual address ffff7f7fff000000
      pgd = ffff800000770000
      [ffff7f7fff000000] *pgd=0000000000000000
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc2+ #89
      task: ffff8003eb500000 ti: ffff8003eb45c000 task.ti: ffff8003eb45c000
      PC is at unmap_range+0x120/0x580
      LR is at free_hyp_pgds+0xac/0xe4
      pc : [<ffff80000009b768>] lr : [<ffff80000009cad8>] pstate: 80000045
      sp : ffff8003eb45fbf0
      x29: ffff8003eb45fbf0 x28: ffff800000736000
      x27: ffff800000735000 x26: ffff7f7fff000000
      x25: 0000000040000000 x24: ffff8000006f5000
      x23: 0000000000000000 x22: 0000007fffffffff
      x21: 0000800000000000 x20: 0000008000000000
      x19: 0000000000000000 x18: ffff800000648000
      x17: ffff800000537228 x16: 0000000000000000
      x15: 000000000000001f x14: 0000000000000000
      x13: 0000000000000001 x12: 0000000000000020
      x11: 0000000000000062 x10: 0000000000000006
      x9 : 0000000000000000 x8 : 0000000000000063
      x7 : 0000000000000018 x6 : 00000003ff000000
      x5 : ffff800000744188 x4 : 0000000000000001
      x3 : 0000000040000000 x2 : ffff800000000000
      x1 : 0000007fffffffff x0 : 000000003fffffff
      
      Process swapper/0 (pid: 1, stack limit = 0xffff8003eb45c058)
      Stack: (0xffff8003eb45fbf0 to 0xffff8003eb460000)
      fbe0:                                     eb45fcb0 ffff8003 0009cad8 ffff8000
      fc00: 00000000 00000080 00736140 ffff8000 00736000 ffff8000 00000000 00007c80
      fc20: 00000000 00000080 006f5000 ffff8000 00000000 00000080 00743000 ffff8000
      fc40: 00735000 ffff8000 006d3030 ffff8000 006fe7b8 ffff8000 00000000 00000080
      fc60: ffffffff 0000007f fdac1000 ffff8003 fd94b000 ffff8003 fda47000 ffff8003
      fc80: 00502b40 ffff8000 ff000000 ffff7f7f fdec6000 00008003 fdac1630 ffff8003
      fca0: eb45fcb0 ffff8003 ffffffff 0000007f eb45fd00 ffff8003 0009b378 ffff8000
      fcc0: ffffffea 00000000 006fe000 ffff8000 00736728 ffff8000 00736120 ffff8000
      fce0: 00000040 00000000 00743000 ffff8000 006fe7b8 ffff8000 0050cd48 00000000
      fd00: eb45fd60 ffff8003 00096070 ffff8000 006f06e0 ffff8000 006f06e0 ffff8000
      fd20: fd948b40 ffff8003 0009a320 ffff8000 00000000 00000000 00000000 00000000
      fd40: 00000ae0 00000000 006aa25c ffff8000 eb45fd60 ffff8003 0017ca44 00000002
      fd60: eb45fdc0 ffff8003 0009a33c ffff8000 006f06e0 ffff8000 006f06e0 ffff8000
      fd80: fd948b40 ffff8003 0009a320 ffff8000 00000000 00000000 00735000 ffff8000
      fda0: 006d3090 ffff8000 006aa25c ffff8000 00735000 ffff8000 006d3030 ffff8000
      fdc0: eb45fdd0 ffff8003 000814c0 ffff8000 eb45fe50 ffff8003 006aaac4 ffff8000
      fde0: 006ddd90 ffff8000 00000006 00000000 006d3000 ffff8000 00000095 00000000
      fe00: 006a1e90 ffff8000 00735000 ffff8000 006d3000 ffff8000 006aa25c ffff8000
      fe20: 00735000 ffff8000 006d3030 ffff8000 eb45fe50 ffff8003 006fac68 ffff8000
      fe40: 00000006 00000006 fe293ee6 ffff8003 eb45feb0 ffff8003 004f8ee8 ffff8000
      fe60: 004f8ed4 ffff8000 00735000 ffff8000 00000000 00000000 00000000 00000000
      fe80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      fea0: 00000000 00000000 00000000 00000000 00000000 00000000 000843d0 ffff8000
      fec0: 004f8ed4 ffff8000 00000000 00000000 00000000 00000000 00000000 00000000
      fee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      ff00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      ff20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      ff40: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      ff60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      ff80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      ffa0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      ffc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000005 00000000
      ffe0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
      Call trace:
      [<ffff80000009b768>] unmap_range+0x120/0x580
      [<ffff80000009cad4>] free_hyp_pgds+0xa8/0xe4
      [<ffff80000009b374>] kvm_arch_init+0x268/0x44c
      [<ffff80000009606c>] kvm_init+0x24/0x260
      [<ffff80000009a338>] arm_init+0x18/0x24
      [<ffff8000000814bc>] do_one_initcall+0x88/0x1a0
      [<ffff8000006aaac0>] kernel_init_freeable+0x148/0x1e8
      [<ffff8000004f8ee4>] kernel_init+0x10/0xd4
      Code: 8b000263 92628479 d1000720 eb01001f (f9400340)
      ---[ end trace 3bc230562e926fa4 ]---
      Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Jungseok Lee <jungseoklee85@gmail.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7cbb87d6
  21. 25 11月, 2014 1 次提交