1. 31 5月, 2018 2 次提交
    • S
      KVM: PPC: Book3S PR: Move kvmppc_save_tm/kvmppc_restore_tm to separate file · 009c872a
      Simon Guo 提交于
      It is a simple patch just for moving kvmppc_save_tm/kvmppc_restore_tm()
      functionalities to tm.S. There is no logic change. The reconstruct of
      those APIs will be done in later patches to improve readability.
      
      It is for preparation of reusing those APIs on both HV/PR PPC KVM.
      
      Some slight change during move the functions includes:
      - surrounds some HV KVM specific code with CONFIG_KVM_BOOK3S_HV_POSSIBLE
      for compilation.
      - use _GLOBAL() to define kvmppc_save_tm/kvmppc_restore_tm()
      
      [paulus@ozlabs.org - rebased on top of 7b0e827c ("KVM: PPC: Book3S HV:
       Factor fake-suspend handling out of kvmppc_save/restore_tm", 2018-05-30)]
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      009c872a
    • P
      KVM: PPC: Book3S HV: Factor fake-suspend handling out of kvmppc_save/restore_tm · 7b0e827c
      Paul Mackerras 提交于
      This splits out the handling of "fake suspend" mode, part of the
      hypervisor TM assist code for POWER9, and puts almost all of it in
      new kvmppc_save_tm_hv and kvmppc_restore_tm_hv functions.  The new
      functions branch to kvmppc_save/restore_tm if the CPU does not
      require hypervisor TM assistance.
      
      With this, it will be more straightforward to move kvmppc_save_tm and
      kvmppc_restore_tm to another file and use them for transactional
      memory support in PR KVM.  Additionally, it also makes the code a
      bit clearer and reduces the number of feature sections.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7b0e827c
  2. 18 5月, 2018 2 次提交
  3. 17 5月, 2018 2 次提交
    • P
      KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path · df158189
      Paul Mackerras 提交于
      A radix guest can execute tlbie instructions to invalidate TLB entries.
      After a tlbie or a group of tlbies, it must then do the architected
      sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation
      has been processed by all CPUs in the system before it can rely on
      no CPU using any translation that it just invalidated.
      
      In fact it is the ptesync which does the actual synchronization in
      this sequence, and hardware has a requirement that the ptesync must
      be executed on the same CPU thread as the tlbies which it is expected
      to order.  Thus, if a vCPU gets moved from one physical CPU to
      another after it has done some tlbies but before it can get to do the
      ptesync, the ptesync will not have the desired effect when it is
      executed on the second physical CPU.
      
      To fix this, we do a ptesync in the exit path for radix guests.  If
      there are any pending tlbies, this will wait for them to complete.
      If there aren't, then ptesync will just do the same as sync.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      df158189
    • P
      KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry · 57b8daa7
      Paul Mackerras 提交于
      Currently, the HV KVM guest entry/exit code adds the timebase offset
      from the vcore struct to the timebase on guest entry, and subtracts
      it on guest exit.  Which is fine, except that it is possible for
      userspace to change the offset using the SET_ONE_REG interface while
      the vcore is running, as there is only one timebase offset per vcore
      but potentially multiple VCPUs in the vcore.  If that were to happen,
      KVM would subtract a different offset on guest exit from that which
      it had added on guest entry, leading to the timebase being out of sync
      between cores in the host, which then leads to bad things happening
      such as hangs and spurious watchdog timeouts.
      
      To fix this, we add a new field 'tb_offset_applied' to the vcore struct
      which stores the offset that is currently applied to the timebase.
      This value is set from the vcore tb_offset field on guest entry, and
      is what is subtracted from the timebase on guest exit.  Since it is
      zero when the timebase offset is not applied, we can simplify the
      logic in kvmhv_start_timing and kvmhv_accumulate_time.
      
      In addition, we had secondary threads reading the timebase while
      running concurrently with code on the primary thread which would
      eventually add or subtract the timebase offset from the timebase.
      This occurred while saving or restoring the DEC register value on
      the secondary threads.  Although no specific incorrect behaviour has
      been observed, this is a race which should be fixed.  To fix it, we
      move the DEC saving code to just before we call kvmhv_commence_exit,
      and the DEC restoring code to after the point where we have waited
      for the primary thread to switch the MMU context and add the timebase
      offset.  That way we are sure that the timebase contains the guest
      timebase value in both cases.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57b8daa7
  4. 31 3月, 2018 1 次提交
    • A
      powerpc/kvm: Fix guest boot failure on Power9 since DAWR changes · ca9a16c3
      Aneesh Kumar K.V 提交于
      SLOF checks for 'sc 1' (hypercall) support by issuing a hcall with
      H_SET_DABR. Since the recent commit e8ebedbf ("KVM: PPC: Book3S
      HV: Return error from h_set_dabr() on POWER9") changed H_SET_DABR to
      return H_UNSUPPORTED on Power9, we see guest boot failures, the
      symptom is the boot seems to just stop in SLOF, eg:
      
        SLOF ***************************************************************
        QEMU Starting
         Build Date = Sep 24 2017 12:23:07
         FW Version = buildd@ release 20170724
        <no further output>
      
      SLOF can cope if H_SET_DABR returns H_HARDWARE. So wwitch the return
      value to H_HARDWARE instead of H_UNSUPPORTED so that we don't break
      the guest boot.
      
      That does mean we return a different error to PowerVM in this case,
      but that's probably not a big concern.
      
      Fixes: e8ebedbf ("KVM: PPC: Book3S HV: Return error from h_set_dabr() on POWER9")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ca9a16c3
  5. 30 3月, 2018 1 次提交
  6. 27 3月, 2018 2 次提交
  7. 23 3月, 2018 4 次提交
    • P
      KVM: PPC: Book3S HV: Work around TEXASR bug in fake suspend state · 681c617b
      Paul Mackerras 提交于
      This works around a hardware bug in "Nimbus" POWER9 DD2.2 processors,
      where the contents of the TEXASR can get corrupted while a thread is
      in fake suspend state.  The workaround is for the instruction emulation
      code to use the value saved at the most recent guest exit in real
      suspend mode.  We achieve this by simply not saving the TEXASR into
      the vcpu struct on an exit in fake suspend state.  We also have to
      take care to set the orig_texasr field only on guest exit in real
      suspend state.
      
      This also means that on guest entry in fake suspend state, TEXASR
      will be restored to the value it had on the last exit in real suspend
      state, effectively counteracting any hardware-caused corruption.  This
      works because TEXASR may not be written in suspend state.
      
      With this, the guest might see the wrong values in TEXASR if it reads
      it while in suspend state, but will see the correct value in
      non-transactional state (e.g. after a treclaim), and treclaim will
      work correctly.
      
      With this workaround, the code will actually run slightly faster, and
      will operate correctly on systems without the TEXASR bug (since TEXASR
      may not be written in suspend state, and is only changed by failure
      recording, which will have already been done before we get into fake
      suspend state).  Therefore these changes are not made subject to a CPU
      feature bit.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      681c617b
    • S
      KVM: PPC: Book3S HV: Work around XER[SO] bug in fake suspend mode · 87a11bb6
      Suraj Jitindar Singh 提交于
      This works around a hardware bug in "Nimbus" POWER9 DD2.2 processors,
      where a treclaim performed in fake suspend mode can cause subsequent
      reads from the XER register to return inconsistent values for the SO
      (summary overflow) bit.  The inconsistent SO bit state can potentially
      be observed on any thread in the core.  We have to do the treclaim
      because that is the only way to get the thread out of suspend state
      (fake or real) and into non-transactional state.
      
      The workaround for the bug is to force the core into SMT4 mode before
      doing the treclaim.  This patch adds the code to do that, conditional
      on the CPU_FTR_P9_TM_XER_SO_BUG feature bit.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      87a11bb6
    • P
      KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 · 4bb3c7a0
      Paul Mackerras 提交于
      POWER9 has hardware bugs relating to transactional memory and thread
      reconfiguration (changes to hardware SMT mode).  Specifically, the core
      does not have enough storage to store a complete checkpoint of all the
      architected state for all four threads.  The DD2.2 version of POWER9
      includes hardware modifications designed to allow hypervisor software
      to implement workarounds for these problems.  This patch implements
      those workarounds in KVM code so that KVM guests see a full, working
      transactional memory implementation.
      
      The problems center around the use of TM suspended state, where the
      CPU has a checkpointed state but execution is not transactional.  The
      workaround is to implement a "fake suspend" state, which looks to the
      guest like suspended state but the CPU does not store a checkpoint.
      In this state, any instruction that would cause a transition to
      transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
      checkpointed state (treclaim) causes a "soft patch" interrupt (vector
      0x1500) to the hypervisor so that it can be emulated.  The trechkpt
      instruction also causes a soft patch interrupt.
      
      On POWER9 DD2.2, we avoid returning to the guest in any state which
      would require a checkpoint to be present.  The trechkpt in the guest
      entry path which would normally create that checkpoint is replaced by
      either a transition to fake suspend state, if the guest is in suspend
      state, or a rollback to the pre-transactional state if the guest is in
      transactional state.  Fake suspend state is indicated by a flag in the
      PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only and
      reads back as 0.
      
      On exit from the guest, if the guest is in fake suspend state, we still
      do the treclaim instruction as we would in real suspend state, in order
      to get into non-transactional state, but we do not save the resulting
      register state since there was no checkpoint.
      
      Emulation of the instructions that cause a softpatch interrupt is
      handled in two paths.  If the guest is in real suspend mode, we call
      kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
      transitioning to transactional state.  This is called before we do the
      treclaim in the guest exit path; because we haven't done treclaim, we
      can get back to the guest with the transaction still active.  If the
      instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
      handle, or if the guest is in fake suspend state, then we proceed to
      do the complete guest exit path and subsequently call
      kvmhv_p9_tm_emulation() in host context with the MMU on.  This handles
      all the cases including the cases that generate program interrupts
      (illegal instruction or TM Bad Thing) and facility unavailable
      interrupts.
      
      The emulation is reasonably straightforward and is mostly concerned
      with checking for exception conditions and updating the state of
      registers such as MSR and CR0.  The treclaim emulation takes care to
      ensure that the TEXASR register gets updated as if it were the guest
      treclaim instruction that had done failure recording, not the treclaim
      done in hypervisor state in the guest exit path.
      
      With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
      transactional memory is not available to host userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bb3c7a0
    • P
      KVM: PPC: Book3S HV: Fix duplication of host SLB entries · cda4a147
      Paul Mackerras 提交于
      Since commit 6964e6a4 ("KVM: PPC: Book3S HV: Do SLB load/unload
      with guest LPCR value loaded", 2018-01-11), we have been seeing
      occasional machine check interrupts on POWER8 systems when running
      KVM guests, due to SLB multihit errors.
      
      This turns out to be due to the guest exit code reloading the host
      SLB entries from the SLB shadow buffer when the SLB was not previously
      cleared in the guest entry path.  This can happen because the path
      which skips from the guest entry code to the guest exit code without
      entering the guest now does the skip before the SLB is cleared and
      loaded with guest values, but the host values are loaded after the
      point in the guest exit path that we skip to.
      
      To fix this, we move the code that reloads the host SLB values up
      so that it occurs just before the point in the guest exit code (the
      label guest_bypass:) where we skip to from the guest entry path.
      Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Fixes: 6964e6a4 ("KVM: PPC: Book3S HV: Do SLB load/unload with guest LPCR value loaded")
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      cda4a147
  8. 14 3月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entry · a8b48a4d
      Paul Mackerras 提交于
      This fixes a bug where the trap number that is returned by
      __kvmppc_vcore_entry gets corrupted.  The effect of the corruption
      is that IPIs get ignored on POWER9 systems when the IPI is sent via
      a doorbell interrupt to a CPU which is executing in a KVM guest.
      The effect of the IPI being ignored is often that another CPU locks
      up inside smp_call_function_many() (and if that CPU is holding a
      spinlock, other CPUs then lock up inside raw_spin_lock()).
      
      The trap number is currently held in register r12 for most of the
      assembly-language part of the guest exit path.  In that path, we
      call kvmppc_subcore_exit_guest(), which is a C function, without
      restoring r12 afterwards.  Depending on the kernel config and the
      compiler, it may modify r12 or it may not, so some config/compiler
      combinations see the bug and others don't.
      
      To fix this, we arrange for the trap number to be stored on the
      stack from the 'guest_bypass:' label until the end of the function,
      then the trap number is loaded and returned in r12 as before.
      
      Cc: stable@vger.kernel.org # v4.8+
      Fixes: fd7bacbc ("KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a8b48a4d
  9. 09 2月, 2018 1 次提交
    • A
      KVM: PPC: Book3S HV: Branch inside feature section · d20fe50a
      Alexander Graf 提交于
      We ended up with code that did a conditional branch inside a feature
      section to code outside of the feature section. Depending on how the
      object file gets organized, that might mean we exceed the 14bit
      relocation limit for conditional branches:
      
        arch/powerpc/kvm/built-in.o:arch/powerpc/kvm/book3s_hv_rmhandlers.S:416:(__ftr_alt_97+0x8): relocation truncated to fit: R_PPC64_REL14 against `.text'+1ca4
      
      So instead of doing a conditional branch outside of the feature section,
      let's just jump at the end of the same, making the branch very short.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d20fe50a
  10. 19 1月, 2018 5 次提交
  11. 18 1月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Improve handling of debug-trigger HMIs on POWER9 · d075745d
      Paul Mackerras 提交于
      Hypervisor maintenance interrupts (HMIs) are generated by various
      causes, signalled by bits in the hypervisor maintenance exception
      register (HMER).  In most cases calling OPAL to handle the interrupt
      is the correct thing to do, but the "debug trigger" HMIs signalled by
      PPC bit 17 (bit 46) of HMER are used to invoke software workarounds
      for hardware bugs, and OPAL does not have any code to handle this
      cause.  The debug trigger HMI is used in POWER9 DD2.0 and DD2.1 chips
      to work around a hardware bug in executing vector load instructions to
      cache inhibited memory.  In POWER9 DD2.2 chips, it is generated when
      conditions are detected relating to threads being in TM (transactional
      memory) suspended mode when the core SMT configuration needs to be
      reconfigured.
      
      The kernel currently has code to detect the vector CI load condition,
      but only when the HMI occurs in the host, not when it occurs in a
      guest.  If a HMI occurs in the guest, it is always passed to OPAL, and
      then we always re-sync the timebase, because the HMI cause might have
      been a timebase error, for which OPAL would re-sync the timebase, thus
      removing the timebase offset which KVM applied for the guest.  Since
      we don't know what OPAL did, we don't know whether to subtract the
      timebase offset from the timebase, so instead we re-sync the timebase.
      
      This adds code to determine explicitly what the cause of a debug
      trigger HMI will be.  This is based on a new device-tree property
      under the CPU nodes called ibm,hmi-special-triggers, if it is
      present, or otherwise based on the PVR (processor version register).
      The handling of debug trigger HMIs is pulled out into a separate
      function which can be called from the KVM guest exit code.  If this
      function handles and clears the HMI, and no other HMI causes remain,
      then we skip calling OPAL and we proceed to subtract the guest
      timebase offset from the timebase.
      
      The overall handling for HMIs that occur in the host (i.e. not in a
      KVM guest) is largely unchanged, except that we now don't set the flag
      for the vector CI load workaround on DD2.2 processors.
      
      This also removes a BUG_ON in the KVM code.  BUG_ON is generally not
      useful in KVM guest entry/exit code since it is difficult to handle
      the resulting trap gracefully.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d075745d
  12. 17 1月, 2018 2 次提交
    • P
      KVM: PPC: Book3S HV: Do SLB load/unload with guest LPCR value loaded · 6964e6a4
      Paul Mackerras 提交于
      This moves the code that loads and unloads the guest SLB values so that
      it is done while the guest LPCR value is loaded in the LPCR register.
      The reason for doing this is that on POWER9, the behaviour of the
      slbmte instruction depends on the LPCR[UPRT] bit.  If UPRT is 1, as
      it is for a radix host (or guest), the SLB index is truncated to
      2 bits.  This means that for a HPT guest on a radix host, the SLB
      was not being loaded correctly, causing the guest to crash.
      
      The SLB is now loaded much later in the guest entry path, after the
      LPCR is loaded, which for a secondary thread is after it sees that
      the primary thread has switched the MMU to the guest.  The loop that
      waits for the primary thread has a branch out to the exit code that
      is taken if it sees that other threads have commenced exiting the
      guest.  Since we have now not loaded the SLB at this point, we make
      this path branch to a new label 'guest_bypass' and we move the SLB
      unload code to before this label.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6964e6a4
    • P
      KVM: PPC: Book3S HV: Make sure we don't re-enter guest without XIVE loaded · 43ff3f65
      Paul Mackerras 提交于
      This fixes a bug where it is possible to enter a guest on a POWER9
      system without having the XIVE (interrupt controller) context loaded.
      This can happen because we unload the XIVE context from the CPU
      before doing the real-mode handling for machine checks.  After the
      real-mode handler runs, it is possible that we re-enter the guest
      via a fast path which does not load the XIVE context.
      
      To fix this, we move the unloading of the XIVE context to come after
      the real-mode machine check handler is called.
      
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.11+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      43ff3f65
  13. 11 1月, 2018 1 次提交
  14. 10 1月, 2018 1 次提交
  15. 01 11月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts · c0101509
      Paul Mackerras 提交于
      This patch removes the restriction that a radix host can only run
      radix guests, allowing us to run HPT (hashed page table) guests as
      well.  This is useful because it provides a way to run old guest
      kernels that know about POWER8 but not POWER9.
      
      Unfortunately, POWER9 currently has a restriction that all threads
      in a given code must either all be in HPT mode, or all in radix mode.
      This means that when entering a HPT guest, we have to obtain control
      of all 4 threads in the core and get them to switch their LPIDR and
      LPCR registers, even if they are not going to run a guest.  On guest
      exit we also have to get all threads to switch LPIDR and LPCR back
      to host values.
      
      To make this feasible, we require that KVM not be in the "independent
      threads" mode, and that the CPU cores be in single-threaded mode from
      the host kernel's perspective (only thread 0 online; threads 1, 2 and
      3 offline).  That allows us to use the same code as on POWER8 for
      obtaining control of the secondary threads.
      
      To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
      struct to contain the information needed by the secondary threads.
      All threads perform a barrier synchronization (where all threads wait
      for every other thread to reach the synchronization point) on guest
      entry, both before and after loading LPCR and LPIDR.  On guest exit,
      they all once again perform a barrier synchronization both before
      and after loading host values into LPCR and LPIDR.
      
      Finally, it is also currently necessary to flush the entire TLB every
      time we enter a HPT guest on a radix host.  We do this on thread 0
      with a loop of tlbiel instructions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c0101509
    • P
      KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded mode · 516f7898
      Paul Mackerras 提交于
      This patch allows for a mode on POWER9 hosts where we control all the
      threads of a core, much as we do on POWER8.  The mode is controlled by
      a module parameter on the kvm_hv module, called "indep_threads_mode".
      The normal mode on POWER9 is the "independent threads" mode, with
      indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any
      desired SMT mode) and each thread independently enters and exits from
      KVM guests without reference to what other threads in the core are
      doing.
      
      If indep_threads_mode is set to N at the point when a VM is started,
      KVM will expect every core that the guest runs on to be in single
      threaded mode (that is, threads 1, 2 and 3 offline), and will set the
      flag that prevents secondary threads from coming online.  We can still
      use all four threads; the code that implements dynamic micro-threading
      on POWER8 will become active in over-commit situations and will allow
      up to three other VCPUs to be run on the secondary threads of the core
      whenever a VCPU is run.
      
      The reason for wanting this mode is that this will allow us to run HPT
      guests on a radix host on a POWER9 machine that does not support
      "mixed mode", that is, having some threads in a core be in HPT mode
      while other threads are in radix mode.  It will also make it possible
      to implement a "strict threads" mode in future, if desired.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      516f7898
  16. 19 10月, 2017 1 次提交
  17. 16 10月, 2017 1 次提交
    • B
      KVM: PPC: Book3S HV: Add more barriers in XIVE load/unload code · ad98dd1a
      Benjamin Herrenschmidt 提交于
      On POWER9 systems, we push the VCPU context onto the XIVE (eXternal
      Interrupt Virtualization Engine) hardware when entering a guest,
      and pull the context off the XIVE when exiting the guest.  The push
      is done with cache-inhibited stores, and the pull with cache-inhibited
      loads.
      
      Testing has revealed that it is possible (though very rare) for
      the stores to get reordered with the loads so that we end up with the
      guest VCPU context still loaded on the XIVE after we have exited the
      guest.  When that happens, it is possible for the same VCPU context
      to then get loaded on another CPU, which causes the machine to
      checkstop.
      
      To fix this, we add I/O barrier instructions (eieio) before and
      after the push and pull operations.  As partial compensation for the
      potential slowdown caused by the extra barriers, we remove the eieio
      instructions between the two stores in the push operation, and between
      the two loads in the pull operation.  (The architecture requires
      loads to cache-inhibited, guarded storage to be kept in order, and
      requires stores to cache-inhibited, guarded storage likewise to be
      kept in order, but allows such loads and stores to be reordered with
      respect to each other.)
      Reported-by: NCarol L Soto <clsoto@us.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ad98dd1a
  18. 14 10月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Handle unexpected interrupts better · 857b99e1
      Paul Mackerras 提交于
      At present, if an interrupt (i.e. an exception or trap) occurs in the
      code where KVM is switching the MMU to or from guest context, we jump
      to kvmppc_bad_host_intr, where we simply spin with interrupts disabled.
      In this situation, it is hard to debug what happened because we get no
      indication as to which interrupt occurred or where.  Typically we get
      a cascade of stall and soft lockup warnings from other CPUs.
      
      In order to get more information for debugging, this adds code to
      create a stack frame on the emergency stack and save register values
      to it.  We start half-way down the emergency stack in order to give
      ourselves some chance of being able to do a stack trace on secondary
      threads that are already on the emergency stack.
      
      On POWER7 or POWER8, we then just spin, as before, because we don't
      know what state the MMU context is in or what other threads are doing,
      and we can't switch back to host context without coordinating with
      other threads.  On POWER9 we can do better; there we load up the host
      MMU context and jump to C code, which prints an oops message to the
      console and panics.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      857b99e1
    • N
      KVM: PPC: Book3S HV: POWER9 more doorbell fixes · 2cde3716
      Nicholas Piggin 提交于
      - Add another case where msgsync is required.
      - Required barrier sequence for global doorbells is msgsync ; lwsync
      
      When msgsnd is used for IPIs to other cores, msgsync must be executed by
      the target to order stores performed on the source before its msgsnd
      (provided the source executes the appropriate sync).
      
      Fixes: 1704a81c ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2cde3716
  19. 22 9月, 2017 1 次提交
    • M
      KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception · e001fa78
      Michael Neuling 提交于
      On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
      Interrupt (HDSI) the HDSISR is not be updated at all.
      
      To work around this we put a canary value into the HDSISR before
      returning to a guest and then check for this canary when we take a
      HDSI. If we find the canary on a HDSI, we know the hardware didn't
      update the HDSISR. In this case we return to the guest to retake the
      HDSI which should correctly update the HDSISR the second time HDSI
      entry.
      
      After talking to Paulus we've applied this workaround to all POWER9
      CPUs. The workaround of returning to the guest shouldn't ever be
      triggered on well behaving CPU. The extra instructions should have
      negligible performance impact.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e001fa78
  20. 12 9月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly · 67f8a8c1
      Paul Mackerras 提交于
      Aneesh Kumar reported seeing host crashes when running recent kernels
      on POWER8.  The symptom was an oops like this:
      
      Unable to handle kernel paging request for data at address 0xf00000000786c620
      Faulting instruction address: 0xc00000000030e1e4
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: powernv_op_panel
      CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
      task: c000000fdeadfe80 task.stack: c000000fdeb68000
      NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
      REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
      CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
      GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
      GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
      GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
      GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
      GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
      GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
      GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
      GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
      NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
      LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
      Call Trace:
      [c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
      [c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
      [c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
      [c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
      [c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
      [c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
      [c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
      [c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
      [c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
      [c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
      Instruction dump:
      7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
      794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
      ---[ end trace fad4a342d0414aa2 ]---
      
      It turns out that what has happened is that the SLB entry for the
      vmmemap region hasn't been reloaded on exit from a guest, and it has
      the wrong page size.  Then, when the host next accesses the vmemmap
      region, it gets a page fault.
      
      Commit a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with
      KVM", 2017-07-24) modified the guest exit code so that it now only clears
      out the SLB for hash guest.  The code tests the radix flag and puts the
      result in a non-volatile CR field, CR2, and later branches based on CR2.
      
      Unfortunately, the kvmppc_save_tm function, which gets called between
      those two points, modifies all the user-visible registers in the case
      where the guest was in transactional or suspended state, except for a
      few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.
      
      This fixes the problem by re-doing the comparison just before the
      result is needed.  For good measure, this also adds comments next to
      the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
      that non-volatile register state will be lost.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM")
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      67f8a8c1
  21. 31 8月, 2017 2 次提交
  22. 29 8月, 2017 1 次提交
  23. 24 8月, 2017 1 次提交
  24. 26 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Workaround prefetch issue with KVM · a25bd72b
      Benjamin Herrenschmidt 提交于
      There's a somewhat architectural issue with Radix MMU and KVM.
      
      When coming out of a guest with AIL (Alternate Interrupt Location, ie,
      MMU enabled), we start executing hypervisor code with the PID register
      still containing whatever the guest has been using.
      
      The problem is that the CPU can (and will) then start prefetching or
      speculatively load from whatever host context has that same PID (if
      any), thus bringing translations for that context into the TLB, which
      Linux doesn't know about.
      
      This can cause stale translations and subsequent crashes.
      
      Fixing this in a way that is neither racy nor a huge performance
      impact is difficult. We could just make the host invalidations always
      use broadcast forms but that would hurt single threaded programs for
      example.
      
      We chose to fix it instead by partitioning the PID space between guest
      and host. This is possible because today Linux only use 19 out of the
      20 bits of PID space, so existing guests will work if we make the host
      use the top half of the 20 bits space.
      
      We additionally add support for a property to indicate to Linux the
      size of the PID register which will be useful if we eventually have
      processors with a larger PID space available.
      
      There is still an issue with malicious guests purposefully setting the
      PID register to a value in the hosts PID range. Hopefully future HW
      can prevent that, but in the meantime, we handle it with a pair of
      kludges:
      
       - On the way out of a guest, before we clear the current VCPU in the
         PACA, we check the PID and if it's outside of the permitted range
         we flush the TLB for that PID.
      
       - When context switching, if the mm is "new" on that CPU (the
         corresponding bit was set for the first time in the mm cpumask), we
         check if any sibling thread is in KVM (has a non-NULL VCPU pointer
         in the PACA). If that is the case, we also flush the PID for that
         CPU (core).
      
      This second part is needed to handle the case where a process is
      migrated (or starts a new pthread) on a sibling thread of the CPU
      coming out of KVM, as there's a window where stale translations can
      exist before we detect it and flush them out.
      
      A future optimization could be added by keeping track of whether the
      PID has ever been used and avoid doing that for completely fresh PIDs.
      We could similarily mark PIDs that have been the subject of a global
      invalidation as "fresh". But for now this will do.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
            unneeded include of kvm_book3s_asm.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a25bd72b
  25. 01 7月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Close race with testing for signals on guest entry · 8b24e69f
      Paul Mackerras 提交于
      At present, interrupts are hard-disabled fairly late in the guest
      entry path, in the assembly code.  Since we check for pending signals
      for the vCPU(s) task(s) earlier in the guest entry path, it is
      possible for a signal to be delivered before we enter the guest but
      not be noticed until after we exit the guest for some other reason.
      
      Similarly, it is possible for the scheduler to request a reschedule
      while we are in the guest entry path, and we won't notice until after
      we have run the guest, potentially for a whole timeslice.
      
      Furthermore, with a radix guest on POWER9, we can take the interrupt
      with the MMU on.  In this case we end up leaving interrupts
      hard-disabled after the guest exit, and they are likely to stay
      hard-disabled until we exit to userspace or context-switch to
      another process.  This was masking the fact that we were also not
      setting the RI (recoverable interrupt) bit in the MSR, meaning
      that if we had taken an interrupt, it would have crashed the host
      kernel with an unrecoverable interrupt message.
      
      To close these races, we need to check for signals and reschedule
      requests after hard-disabling interrupts, and then keep interrupts
      hard-disabled until we enter the guest.  If there is a signal or a
      reschedule request from another CPU, it will send an IPI, which will
      cause a guest exit.
      
      This puts the interrupt disabling before we call kvmppc_start_thread()
      for all the secondary threads of this core that are going to run vCPUs.
      The reason for that is that once we have started the secondary threads
      there is no easy way to back out without going through at least part
      of the guest entry path.  However, kvmppc_start_thread() includes some
      code for radix guests which needs to call smp_call_function(), which
      must be called with interrupts enabled.  To solve this problem, this
      patch moves that code into a separate function that is called earlier.
      
      When the guest exit is caused by an external interrupt, a hypervisor
      doorbell or a hypervisor maintenance interrupt, we now handle these
      using the replay facility.  __kvmppc_vcore_entry() now returns the
      trap number that caused the exit on this thread, and instead of the
      assembly code jumping to the handler entry, we return to C code with
      interrupts still hard-disabled and set the irq_happened flag in the
      PACA, so that when we do local_irq_enable() the appropriate handler
      gets called.
      
      With all this, we now have the interrupt soft-enable flag clear while
      we are in the guest.  This is useful because code in the real-mode
      hypercall handlers that checks whether interrupts are enabled will
      now see that they are disabled, which is correct, since interrupts
      are hard-disabled in the real-mode code.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8b24e69f