1. 20 3月, 2019 4 次提交
    • S
      KVM: arm/arm64: Enforce PTE mappings at stage2 when needed · a80868f3
      Suzuki K Poulose 提交于
      commit 6794ad54 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
      made the checks to skip huge mappings, stricter. However it introduced
      a bug where we still use huge mappings, ignoring the flag to
      use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE.
      
      Also, the checks do not cover the PUD huge pages, that was
      under review during the same period. This patch fixes both
      the issues.
      
      Fixes : 6794ad54 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
      Reported-by: NZenghui Yu <yuzenghui@huawei.com>
      Cc: Zenghui Yu <yuzenghui@huawei.com>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      a80868f3
    • M
      KVM: arm/arm64: vgic-its: Take the srcu lock when parsing the memslots · 7494cec6
      Marc Zyngier 提交于
      Calling kvm_is_visible_gfn() implies that we're parsing the memslots,
      and doing this without the srcu lock is frown upon:
      
      [12704.164532] =============================
      [12704.164544] WARNING: suspicious RCU usage
      [12704.164560] 5.1.0-rc1-00008-g600025238f51-dirty #16 Tainted: G        W
      [12704.164573] -----------------------------
      [12704.164589] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      [12704.164602] other info that might help us debug this:
      [12704.164616] rcu_scheduler_active = 2, debug_locks = 1
      [12704.164631] 6 locks held by qemu-system-aar/13968:
      [12704.164644]  #0: 000000007ebdae4f (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
      [12704.164691]  #1: 000000007d751022 (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
      [12704.164726]  #2: 00000000219d2706 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164761]  #3: 00000000a760aecd (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164794]  #4: 000000000ef8e31d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164827]  #5: 000000007a872093 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164861] stack backtrace:
      [12704.164878] CPU: 2 PID: 13968 Comm: qemu-system-aar Tainted: G        W         5.1.0-rc1-00008-g600025238f51-dirty #16
      [12704.164887] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
      [12704.164896] Call trace:
      [12704.164910]  dump_backtrace+0x0/0x138
      [12704.164920]  show_stack+0x24/0x30
      [12704.164934]  dump_stack+0xbc/0x104
      [12704.164946]  lockdep_rcu_suspicious+0xcc/0x110
      [12704.164958]  gfn_to_memslot+0x174/0x190
      [12704.164969]  kvm_is_visible_gfn+0x28/0x70
      [12704.164980]  vgic_its_check_id.isra.0+0xec/0x1e8
      [12704.164991]  vgic_its_save_tables_v0+0x1ac/0x330
      [12704.165001]  vgic_its_set_attr+0x298/0x3a0
      [12704.165012]  kvm_device_ioctl_attr+0x9c/0xd8
      [12704.165022]  kvm_device_ioctl+0x8c/0xf8
      [12704.165035]  do_vfs_ioctl+0xc8/0x960
      [12704.165045]  ksys_ioctl+0x8c/0xa0
      [12704.165055]  __arm64_sys_ioctl+0x28/0x38
      [12704.165067]  el0_svc_common+0xd8/0x138
      [12704.165078]  el0_svc_handler+0x38/0x78
      [12704.165089]  el0_svc+0x8/0xc
      
      Make sure the lock is taken when doing this.
      
      Fixes: bf308242 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      7494cec6
    • M
      KVM: arm/arm64: vgic-its: Take the srcu lock when writing to guest memory · a6ecfb11
      Marc Zyngier 提交于
      When halting a guest, QEMU flushes the virtual ITS caches, which
      amounts to writing to the various tables that the guest has allocated.
      
      When doing this, we fail to take the srcu lock, and the kernel
      shouts loudly if running a lockdep kernel:
      
      [   69.680416] =============================
      [   69.680819] WARNING: suspicious RCU usage
      [   69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted
      [   69.682096] -----------------------------
      [   69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      [   69.683225]
      [   69.683225] other info that might help us debug this:
      [   69.683225]
      [   69.683975]
      [   69.683975] rcu_scheduler_active = 2, debug_locks = 1
      [   69.684598] 6 locks held by qemu-system-aar/4097:
      [   69.685059]  #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
      [   69.686087]  #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
      [   69.686919]  #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.687698]  #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.688475]  #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.689978]  #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.690729]
      [   69.690729] stack backtrace:
      [   69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18
      [   69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
      [   69.692831] Call trace:
      [   69.694072]  lockdep_rcu_suspicious+0xcc/0x110
      [   69.694490]  gfn_to_memslot+0x174/0x190
      [   69.694853]  kvm_write_guest+0x50/0xb0
      [   69.695209]  vgic_its_save_tables_v0+0x248/0x330
      [   69.695639]  vgic_its_set_attr+0x298/0x3a0
      [   69.696024]  kvm_device_ioctl_attr+0x9c/0xd8
      [   69.696424]  kvm_device_ioctl+0x8c/0xf8
      [   69.696788]  do_vfs_ioctl+0xc8/0x960
      [   69.697128]  ksys_ioctl+0x8c/0xa0
      [   69.697445]  __arm64_sys_ioctl+0x28/0x38
      [   69.697817]  el0_svc_common+0xd8/0x138
      [   69.698173]  el0_svc_handler+0x38/0x78
      [   69.698528]  el0_svc+0x8/0xc
      
      The fix is to obviously take the srcu lock, just like we do on the
      read side of things since bf308242. One wonders why this wasn't
      fixed at the same time, but hey...
      
      Fixes: bf308242 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      a6ecfb11
    • M
      arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled · ca71228b
      Marc Zyngier 提交于
      The normal interrupt flow is not to enable the vgic when no virtual
      interrupt is to be injected (i.e. the LRs are empty). But when a guest
      is likely to use GICv4 for LPIs, we absolutely need to switch it on
      at all times. Otherwise, VLPIs only get delivered when there is something
      in the LRs, which doesn't happen very often.
      Reported-by: NNianyao Tang <tangnianyao@huawei.com>
      Tested-by: NShameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      ca71228b
  2. 22 2月, 2019 1 次提交
  3. 21 2月, 2019 1 次提交
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258
      Sean Christopherson 提交于
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15248258
  4. 20 2月, 2019 14 次提交
  5. 08 2月, 2019 1 次提交
  6. 07 2月, 2019 3 次提交
  7. 24 1月, 2019 3 次提交
  8. 05 1月, 2019 1 次提交
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
  9. 21 12月, 2018 1 次提交
  10. 20 12月, 2018 8 次提交
    • M
      arm/arm64: KVM: Add ARM_EXCEPTION_IS_TRAP macro · 58466766
      Marc Zyngier 提交于
      32 and 64bit use different symbols to identify the traps.
      32bit has a fine grained approach (prefetch abort, data abort and HVC),
      while 64bit is pretty happy with just "trap".
      
      This has been fine so far, except that we now need to decode some
      of that in tracepoints that are common to both architectures.
      
      Introduce ARM_EXCEPTION_IS_TRAP which abstracts the trap symbols
      and make the tracepoint use it.
      Acked-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      58466766
    • C
      KVM: arm/arm64: Fix unintended stage 2 PMD mappings · 6794ad54
      Christoffer Dall 提交于
      There are two things we need to take care of when we create block
      mappings in the stage 2 page tables:
      
        (1) The alignment within a PMD between the host address range and the
        guest IPA range must be the same, since otherwise we end up mapping
        pages with the wrong offset.
      
        (2) The head and tail of a memory slot may not cover a full block
        size, and we have to take care to not map those with block
        descriptors, since we could expose memory to the guest that the host
        did not intend to expose.
      
      So far, we have been taking care of (1), but not (2), and our commentary
      describing (1) was somewhat confusing.
      
      This commit attempts to factor out the checks of both into a common
      function, and if we don't pass the check, we won't attempt any PMD
      mappings for neither hugetlbfs nor THP.
      
      Note that we used to only check the alignment for THP, not for
      hugetlbfs, but as far as I can tell the check needs to be applied to
      both scenarios.
      
      Cc: Ralph Palutke <ralph.palutke@fau.de>
      Cc: Lukas Braun <koomi@moshbit.net>
      Reported-by: NLukas Braun <koomi@moshbit.net>
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      6794ad54
    • M
      arm/arm64: KVM: vgic: Force VM halt when changing the active state of GICv3 PPIs/SGIs · 107352a2
      Marc Zyngier 提交于
      We currently only halt the guest when a vCPU messes with the active
      state of an SPI. This is perfectly fine for GICv2, but isn't enough
      for GICv3, where all vCPUs can access the state of any other vCPU.
      
      Let's broaden the condition to include any GICv3 interrupt that
      has an active state (i.e. all but LPIs).
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      107352a2
    • C
      KVM: arm/arm64: arch_timer: Simplify kvm_timer_vcpu_terminate · 6e14ef1d
      Christoffer Dall 提交于
      kvm_timer_vcpu_terminate can only be called in two scenarios:
      
       1. As part of cleanup during a failed VCPU create
       2. As part of freeing the whole VM (struct kvm refcount == 0)
      
      In the first case, we cannot have programmed any timers or mapped any
      IRQs, and therefore we do not have to cancel anything or unmap anything.
      
      In the second case, the VCPU will have gone through kvm_timer_vcpu_put,
      which will have canceled the emulated physical timer's hrtimer, and we
      do not need to that here as well.  We also do not care if the irq is
      recorded as mapped or not in the VGIC data structure, because the whole
      VM is going away.  That leaves us only with having to ensure that we
      cancel the bg_timer if we were blocking the last time we called
      kvm_timer_vcpu_put().
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      6e14ef1d
    • C
      KVM: arm/arm64: Remove arch timer workqueue · 8a411b06
      Christoffer Dall 提交于
      The use of a work queue in the hrtimer expire function for the bg_timer
      is a leftover from the time when we would inject interrupts when the
      bg_timer expired.
      
      Since we are no longer doing that, we can instead call
      kvm_vcpu_wake_up() directly from the hrtimer function and remove all
      workqueue functionality from the arch timer code.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      8a411b06
    • C
      KVM: arm/arm64: Fixup the kvm_exit tracepoint · 71a7e47f
      Christoffer Dall 提交于
      The kvm_exit tracepoint strangely always reported exits as being IRQs.
      This seems to be because either the __print_symbolic or the tracepoint
      macros use a variable named idx.
      
      Take this chance to update the fields in the tracepoint to reflect the
      concepts in the arm64 architecture that we pass to the tracepoint and
      move the exception type table to the same location and header files as
      the exits code.
      
      We also clear out the exception code to 0 for IRQ exits (which
      translates to UNKNOWN in text) to make it slighyly less confusing to
      parse the trace output.
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      71a7e47f
    • C
      KVM: arm/arm64: vgic: Consider priority and active state for pending irq · 9009782a
      Christoffer Dall 提交于
      When checking if there are any pending IRQs for the VM, consider the
      active state and priority of the IRQs as well.
      
      Otherwise we could be continuously scheduling a guest hypervisor without
      it seeing an IRQ.
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      9009782a
    • G
      KVM: arm/arm64: vgic: Fix off-by-one bug in vgic_get_irq() · c23b2e6f
      Gustavo A. R. Silva 提交于
      When using the nospec API, it should be taken into account that:
      
      "...if the CPU speculates past the bounds check then
       * array_index_nospec() will clamp the index within the range of [0,
       * size)."
      
      The above is part of the header for macro array_index_nospec() in
      linux/nospec.h
      
      Now, in this particular case, if intid evaluates to exactly VGIC_MAX_SPI
      or to exaclty VGIC_MAX_PRIVATE, the array_index_nospec() macro ends up
      returning VGIC_MAX_SPI - 1 or VGIC_MAX_PRIVATE - 1 respectively, instead
      of VGIC_MAX_SPI or VGIC_MAX_PRIVATE, which, based on the original logic:
      
      	/* SGIs and PPIs */
      	if (intid <= VGIC_MAX_PRIVATE)
       		return &vcpu->arch.vgic_cpu.private_irqs[intid];
      
       	/* SPIs */
      	if (intid <= VGIC_MAX_SPI)
       		return &kvm->arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS];
      
      are valid values for intid.
      
      Fix this by calling array_index_nospec() macro with VGIC_MAX_PRIVATE + 1
      and VGIC_MAX_SPI + 1 as arguments for its parameter size.
      
      Fixes: 41b87599 ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()")
      Cc: stable@vger.kernel.org
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      [dropped the SPI part which was fixed separately]
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      c23b2e6f
  11. 18 12月, 2018 3 次提交