1. 25 6月, 2021 12 次提交
    • S
      KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken · 63f5a190
      Sean Christopherson 提交于
      Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
      instability.  Initialize last_vmentry_cpu to -1 and use it to detect if
      the vCPU has been run at least once when its CPUID model is changed.
      
      KVM does not correctly handle changes to paging related settings in the
      guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc...  KVM
      could theoretically zap all shadow pages, but actually making that happen
      is a mess due to lock inversion (vcpu->mutex is held).  And even then,
      updating paging settings on the fly would only work if all vCPUs are
      stopped, updated in concert with identical settings, then restarted.
      
      To support running vCPUs with different vCPU models (that affect paging),
      KVM would need to track all relevant information in kvm_mmu_page_role.
      Note, that's the _page_ role, not the full mmu_role.  Updating mmu_role
      isn't sufficient as a vCPU can reuse a shadow page translation that was
      created by a vCPU with different settings and thus completely skip the
      reserved bit checks (that are tied to CPUID).
      
      Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
      it would require doubling gfn_track from a u16 to a u32, i.e. would
      increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
      E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
      would all need to be tracked.
      
      In practice, there is no remotely sane use case for changing any paging
      related CPUID entries on the fly, so just sweep it under the rug (after
      yelling at userspace).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      63f5a190
    • S
      KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified · 49c6f875
      Sean Christopherson 提交于
      Invalidate all MMUs' roles after a CPUID update to force reinitizliation
      of the MMU context/helpers.  Despite the efforts of commit de3ccd26
      ("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
      there are still a handful of CPUID-based properties that affect MMU
      behavior but are not incorporated into mmu_role.  E.g. 1gb hugepage
      support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
      factor into the guest's reserved PTE bits.
      
      The obvious alternative would be to add all such properties to mmu_role,
      but doing so provides no benefit over simply forcing a reinitialization
      on every CPUID update, as setting guest CPUID is a rare operation.
      
      Note, reinitializing all MMUs after a CPUID update does not fix all of
      KVM's woes.  Specifically, kvm_mmu_page_role doesn't track the CPUID
      properties, which means that a vCPU can reuse shadow pages that should
      not exist for the new vCPU model, e.g. that map GPAs that are now illegal
      (due to MAXPHYADDR changes) or that set bits that are now reserved
      (PAGE_SIZE for 1gb pages), etc...
      
      Tracking the relevant CPUID properties in kvm_mmu_page_role would address
      the majority of problems, but fully tracking that much state in the
      shadow page role comes with an unpalatable cost as it would require a
      non-trivial increase in KVM's memory footprint.  The GBPAGES case is even
      worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
      support in the hardware page walker, i.e. it's a virtualization hole that
      can't be closed when using TDP.
      
      In other words, resetting the MMU after a CPUID update is largely a
      superficial fix.  But, it will allow reverting the tracking of MAXPHYADDR
      in the mmu_role, and that case in particular needs to mostly work because
      KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
      is supported.  For cases where KVM botches guest behavior, the damage is
      limited to that guest.  But for the shadow_root_level, a misconfigured
      MMU can cause KVM to incorrectly access memory, e.g. due to walking off
      the end of its shadow page tables.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49c6f875
    • S
      Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack" · f71a53d1
      Sean Christopherson 提交于
      Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested
      virtualization.  When KVM (L0) is using TDP, CR4.LA57 is not reflected in
      mmu_role.base.level because that tracks the shadow root level, i.e. TDP
      level.  Normally, this is not an issue because LA57 can't be toggled
      while long mode is active, i.e. the guest has to first disable paging,
      then toggle LA57, then re-enable paging, thus ensuring an MMU
      reinitialization.
      
      But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57
      without having to bounce through an unpaged section.  L1 can also load a
      new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a
      single entry PML5 is sufficient.  Such shenanigans are only problematic
      if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets
      reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode.
      
      Note, in the L2 case with nested TDP, even though L1 can switch between
      L2s with different LA57 settings, thus bypassing the paging requirement,
      in that case KVM's nested_mmu will track LA57 in base.level.
      
      This reverts commit 8053f924.
      
      Fixes: 8053f924 ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f71a53d1
    • S
      KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT walk · ef318b9e
      Sean Christopherson 提交于
      Use the MMU's role to get its effective SMEP value when injecting a fault
      into the guest.  When walking L1's (nested) NPT while L2 is active, vCPU
      state will reflect L2, whereas NPT uses the host's (L1 in this case) CR0,
      CR4, EFER, etc...  If L1 and L2 have different settings for SMEP and
      L1 does not have EFER.NX=1, this can result in an incorrect PFEC.FETCH
      when injecting #NPF.
      
      Fixes: e57d4a35 ("KVM: Add instruction fetch checking when walking guest page table")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ef318b9e
    • S
      KVM: x86: Properly reset MMU context at vCPU RESET/INIT · 0aa18375
      Sean Christopherson 提交于
      Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG
      was set prior to INIT.  Simply re-initializing the current MMU is not
      sufficient as the current root HPA may not be usable in the new context.
      E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode,
      KVM will fail to switch to the 32-bit pae_root and bomb on the next
      VM-Enter due to running with a 64-bit CR3 in 32-bit mode.
      
      This bug was papered over in both VMX and SVM, but still managed to rear
      its head in the MMU role on VMX.  Because EFER.LMA=1 requires CR0.PG=1,
      kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first
      checking CR0.PG.  VMX's RESET/INIT flow writes CR0 before EFER, and so
      an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to
      generate the wrong MMU role.
      
      In VMX, the INIT issue is specific to running without unrestricted guest
      since unrestricted guest is available if and only if EPT is enabled.
      Commit 8668a3c4 ("KVM: VMX: Reset mmu context when entering real
      mode") resolved the issue by forcing a reset when entering emulated real
      mode.
      
      In SVM, commit ebae871a ("kvm: svm: reset mmu on VCPU reset") forced
      a MMU reset on every INIT to workaround the flaw in common x86.  Note, at
      the time the bug was fixed, the SVM problem was exacerbated by a complete
      lack of a CR4 update.
      
      The vendor resets will be reverted in future patches, primarily to aid
      bisection in case there are non-INIT flows that rely on the existing VMX
      logic.
      
      Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and
      all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that
      CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU
      context reset.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0aa18375
    • S
      KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUs · 112022bd
      Sean Christopherson 提交于
      Mark NX as being used for all non-nested shadow MMUs, as KVM will set the
      NX bit for huge SPTEs if the iTLB mutli-hit mitigation is enabled.
      Checking the mitigation itself is not sufficient as it can be toggled on
      at any time and KVM doesn't reset MMU contexts when that happens.  KVM
      could reset the contexts, but that would require purging all SPTEs in all
      MMUs, for no real benefit.  And, KVM already forces EFER.NX=1 when TDP is
      disabled (for WP=0, SMEP=1, NX=0), so technically NX is never reserved
      for shadow MMUs.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      112022bd
    • S
      KVM: x86/mmu: Remove broken WARN that fires on 32-bit KVM w/ nested EPT · f0d43790
      Sean Christopherson 提交于
      Remove a misguided WARN that attempts to detect the scenario where using
      a special A/D tracking flag will set reserved bits on a non-MMIO spte.
      The WARN triggers false positives when using EPT with 32-bit KVM because
      of the !64-bit clause, which is just flat out wrong.  The whole A/D
      tracking goo is specific to EPT, and one of the big selling points of EPT
      is that EPT is decoupled from the host's native paging mode.
      
      Drop the WARN instead of trying to salvage the check.  Keeping a check
      specific to A/D tracking bits would essentially regurgitate the same code
      that led to KVM needed the tracking bits in the first place.
      
      A better approach would be to add a generic WARN on reserved bits being
      set, which would naturally cover the A/D tracking bits, work for all
      flavors of paging, and be self-documenting to some extent.
      
      Fixes: 8a406c89 ("KVM: x86/mmu: Rename and document A/D scheme for TDP SPTEs")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f0d43790
    • J
      KVM: debugfs: Reuse binary stats descriptors · bc9e9e67
      Jing Zhang 提交于
      To remove code duplication, use the binary stats descriptors in the
      implementation of the debugfs interface for statistics. This unifies
      the definition of statistics for the binary and debugfs interfaces.
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bc9e9e67
    • J
      KVM: selftests: Add selftest for KVM statistics data binary interface · 0b45d587
      Jing Zhang 提交于
      Add selftest to check KVM stats descriptors validity.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Tested-by: Fuad Tabba <tabba@google.com> #arm64
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-7-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b45d587
    • J
      KVM: stats: Add documentation for binary statistics interface · fdc09ddd
      Jing Zhang 提交于
      This new API provides a file descriptor for every VM and VCPU to read
      KVM statistics data in binary format.
      It is meant to provide a lightweight, flexible, scalable and efficient
      lock-free solution for user space telemetry applications to pull the
      statistics data periodically for large scale systems. The pulling
      frequency could be as high as a few times per second.
      The statistics descriptors are defined by KVM in kernel and can be
      by userspace to discover VM/VCPU statistics during the one-time setup
      stage.
      The statistics data itself could be read out by userspace telemetry
      periodically without any extra parsing or setup effort.
      There are a few existed interface protocols and definitions, but no
      one can fulfil all the requirements this interface implemented as
      below:
      1. During high frequency periodic stats reading, there should be no
         extra efforts except the stats data read itself.
      2. Support stats annotation, like type (cumulative, instantaneous,
         peak, histogram, etc) and unit (counter, time, size, cycles, etc).
      3. The stats data reading should be free of lock/synchronization. We
         don't care about the consistency between all the stats data. All
         stats data can not be read out at exactly the same time. We really
         care about the change or trend of the stats data. The lock-free
         solution is not just for efficiency and scalability, also for the
         stats data accuracy and usability. For example, in the situation
         that all the stats data readings are protected by a global lock,
         if one VCPU died somehow with that lock held, then all stats data
         reading would be blocked, then we have no way from stats data that
         which VCPU has died.
      4. The stats data reading workload can be handed over to other
         unprivileged process.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NFuad Tabba <tabba@google.com>
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-6-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdc09ddd
    • J
      KVM: stats: Support binary stats retrieval for a VCPU · ce55c049
      Jing Zhang 提交于
      Add a VCPU ioctl to get a statistics file descriptor by which a read
      functionality is provided for userspace to read out VCPU stats header,
      descriptors and data.
      Define VCPU statistics descriptors and header for all architectures.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NFuad Tabba <tabba@google.com>
      Tested-by: Fuad Tabba <tabba@google.com> #arm64
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-5-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce55c049
    • J
      KVM: stats: Support binary stats retrieval for a VM · fcfe1bae
      Jing Zhang 提交于
      Add a VM ioctl to get a statistics file descriptor by which a read
      functionality is provided for userspace to read out VM stats header,
      descriptors and data.
      Define VM statistics descriptors and header for all architectures.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NFuad Tabba <tabba@google.com>
      Tested-by: Fuad Tabba <tabba@google.com> #arm64
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-4-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fcfe1bae
  2. 24 6月, 2021 26 次提交
  3. 23 6月, 2021 1 次提交
  4. 22 6月, 2021 1 次提交