1. 01 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded mode · 516f7898
      Paul Mackerras 提交于
      This patch allows for a mode on POWER9 hosts where we control all the
      threads of a core, much as we do on POWER8.  The mode is controlled by
      a module parameter on the kvm_hv module, called "indep_threads_mode".
      The normal mode on POWER9 is the "independent threads" mode, with
      indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any
      desired SMT mode) and each thread independently enters and exits from
      KVM guests without reference to what other threads in the core are
      doing.
      
      If indep_threads_mode is set to N at the point when a VM is started,
      KVM will expect every core that the guest runs on to be in single
      threaded mode (that is, threads 1, 2 and 3 offline), and will set the
      flag that prevents secondary threads from coming online.  We can still
      use all four threads; the code that implements dynamic micro-threading
      on POWER8 will become active in over-commit situations and will allow
      up to three other VCPUs to be run on the secondary threads of the core
      whenever a VCPU is run.
      
      The reason for wanting this mode is that this will allow us to run HPT
      guests on a radix host on a POWER9 machine that does not support
      "mixed mode", that is, having some threads in a core be in HPT mode
      while other threads are in radix mode.  It will also make it possible
      to implement a "strict threads" mode in future, if desired.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      516f7898
  2. 19 10月, 2017 1 次提交
  3. 14 10月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Handle unexpected interrupts better · 857b99e1
      Paul Mackerras 提交于
      At present, if an interrupt (i.e. an exception or trap) occurs in the
      code where KVM is switching the MMU to or from guest context, we jump
      to kvmppc_bad_host_intr, where we simply spin with interrupts disabled.
      In this situation, it is hard to debug what happened because we get no
      indication as to which interrupt occurred or where.  Typically we get
      a cascade of stall and soft lockup warnings from other CPUs.
      
      In order to get more information for debugging, this adds code to
      create a stack frame on the emergency stack and save register values
      to it.  We start half-way down the emergency stack in order to give
      ourselves some chance of being able to do a stack trace on secondary
      threads that are already on the emergency stack.
      
      On POWER7 or POWER8, we then just spin, as before, because we don't
      know what state the MMU context is in or what other threads are doing,
      and we can't switch back to host context without coordinating with
      other threads.  On POWER9 we can do better; there we load up the host
      MMU context and jump to C code, which prints an oops message to the
      console and panics.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      857b99e1
  4. 22 9月, 2017 1 次提交
    • M
      KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception · e001fa78
      Michael Neuling 提交于
      On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
      Interrupt (HDSI) the HDSISR is not be updated at all.
      
      To work around this we put a canary value into the HDSISR before
      returning to a guest and then check for this canary when we take a
      HDSI. If we find the canary on a HDSI, we know the hardware didn't
      update the HDSISR. In this case we return to the guest to retake the
      HDSI which should correctly update the HDSISR the second time HDSI
      entry.
      
      After talking to Paulus we've applied this workaround to all POWER9
      CPUs. The workaround of returning to the guest shouldn't ever be
      triggered on well behaving CPU. The extra instructions should have
      negligible performance impact.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e001fa78
  5. 12 9月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly · 67f8a8c1
      Paul Mackerras 提交于
      Aneesh Kumar reported seeing host crashes when running recent kernels
      on POWER8.  The symptom was an oops like this:
      
      Unable to handle kernel paging request for data at address 0xf00000000786c620
      Faulting instruction address: 0xc00000000030e1e4
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: powernv_op_panel
      CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
      task: c000000fdeadfe80 task.stack: c000000fdeb68000
      NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
      REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
      CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
      GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
      GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
      GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
      GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
      GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
      GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
      GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
      GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
      NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
      LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
      Call Trace:
      [c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
      [c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
      [c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
      [c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
      [c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
      [c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
      [c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
      [c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
      [c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
      [c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
      Instruction dump:
      7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
      794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
      ---[ end trace fad4a342d0414aa2 ]---
      
      It turns out that what has happened is that the SLB entry for the
      vmmemap region hasn't been reloaded on exit from a guest, and it has
      the wrong page size.  Then, when the host next accesses the vmemmap
      region, it gets a page fault.
      
      Commit a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with
      KVM", 2017-07-24) modified the guest exit code so that it now only clears
      out the SLB for hash guest.  The code tests the radix flag and puts the
      result in a non-volatile CR field, CR2, and later branches based on CR2.
      
      Unfortunately, the kvmppc_save_tm function, which gets called between
      those two points, modifies all the user-visible registers in the case
      where the guest was in transactional or suspended state, except for a
      few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.
      
      This fixes the problem by re-doing the comparison just before the
      result is needed.  For good measure, this also adds comments next to
      the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
      that non-volatile register state will be lost.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM")
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      67f8a8c1
  6. 31 8月, 2017 2 次提交
  7. 29 8月, 2017 1 次提交
  8. 24 8月, 2017 1 次提交
  9. 26 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Workaround prefetch issue with KVM · a25bd72b
      Benjamin Herrenschmidt 提交于
      There's a somewhat architectural issue with Radix MMU and KVM.
      
      When coming out of a guest with AIL (Alternate Interrupt Location, ie,
      MMU enabled), we start executing hypervisor code with the PID register
      still containing whatever the guest has been using.
      
      The problem is that the CPU can (and will) then start prefetching or
      speculatively load from whatever host context has that same PID (if
      any), thus bringing translations for that context into the TLB, which
      Linux doesn't know about.
      
      This can cause stale translations and subsequent crashes.
      
      Fixing this in a way that is neither racy nor a huge performance
      impact is difficult. We could just make the host invalidations always
      use broadcast forms but that would hurt single threaded programs for
      example.
      
      We chose to fix it instead by partitioning the PID space between guest
      and host. This is possible because today Linux only use 19 out of the
      20 bits of PID space, so existing guests will work if we make the host
      use the top half of the 20 bits space.
      
      We additionally add support for a property to indicate to Linux the
      size of the PID register which will be useful if we eventually have
      processors with a larger PID space available.
      
      There is still an issue with malicious guests purposefully setting the
      PID register to a value in the hosts PID range. Hopefully future HW
      can prevent that, but in the meantime, we handle it with a pair of
      kludges:
      
       - On the way out of a guest, before we clear the current VCPU in the
         PACA, we check the PID and if it's outside of the permitted range
         we flush the TLB for that PID.
      
       - When context switching, if the mm is "new" on that CPU (the
         corresponding bit was set for the first time in the mm cpumask), we
         check if any sibling thread is in KVM (has a non-NULL VCPU pointer
         in the PACA). If that is the case, we also flush the PID for that
         CPU (core).
      
      This second part is needed to handle the case where a process is
      migrated (or starts a new pthread) on a sibling thread of the CPU
      coming out of KVM, as there's a window where stale translations can
      exist before we detect it and flush them out.
      
      A future optimization could be added by keeping track of whether the
      PID has ever been used and avoid doing that for completely fresh PIDs.
      We could similarily mark PIDs that have been the subject of a global
      invalidation as "fresh". But for now this will do.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
            unneeded include of kvm_book3s_asm.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a25bd72b
  10. 01 7月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Close race with testing for signals on guest entry · 8b24e69f
      Paul Mackerras 提交于
      At present, interrupts are hard-disabled fairly late in the guest
      entry path, in the assembly code.  Since we check for pending signals
      for the vCPU(s) task(s) earlier in the guest entry path, it is
      possible for a signal to be delivered before we enter the guest but
      not be noticed until after we exit the guest for some other reason.
      
      Similarly, it is possible for the scheduler to request a reschedule
      while we are in the guest entry path, and we won't notice until after
      we have run the guest, potentially for a whole timeslice.
      
      Furthermore, with a radix guest on POWER9, we can take the interrupt
      with the MMU on.  In this case we end up leaving interrupts
      hard-disabled after the guest exit, and they are likely to stay
      hard-disabled until we exit to userspace or context-switch to
      another process.  This was masking the fact that we were also not
      setting the RI (recoverable interrupt) bit in the MSR, meaning
      that if we had taken an interrupt, it would have crashed the host
      kernel with an unrecoverable interrupt message.
      
      To close these races, we need to check for signals and reschedule
      requests after hard-disabling interrupts, and then keep interrupts
      hard-disabled until we enter the guest.  If there is a signal or a
      reschedule request from another CPU, it will send an IPI, which will
      cause a guest exit.
      
      This puts the interrupt disabling before we call kvmppc_start_thread()
      for all the secondary threads of this core that are going to run vCPUs.
      The reason for that is that once we have started the secondary threads
      there is no easy way to back out without going through at least part
      of the guest entry path.  However, kvmppc_start_thread() includes some
      code for radix guests which needs to call smp_call_function(), which
      must be called with interrupts enabled.  To solve this problem, this
      patch moves that code into a separate function that is called earlier.
      
      When the guest exit is caused by an external interrupt, a hypervisor
      doorbell or a hypervisor maintenance interrupt, we now handle these
      using the replay facility.  __kvmppc_vcore_entry() now returns the
      trap number that caused the exit on this thread, and instead of the
      assembly code jumping to the handler entry, we return to C code with
      interrupts still hard-disabled and set the irq_happened flag in the
      PACA, so that when we do local_irq_enable() the appropriate handler
      gets called.
      
      With all this, we now have the interrupt soft-enable flag clear while
      we are in the guest.  This is useful because code in the real-mode
      hypercall handlers that checks whether interrupts are enabled will
      now see that they are disabled, which is correct, since interrupts
      are hard-disabled in the real-mode code.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8b24e69f
  11. 22 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Exit guest upon MCE when FWNMI capability is enabled · e20bbd3d
      Aravinda Prasad 提交于
      Enhance KVM to cause a guest exit with KVM_EXIT_NMI
      exit reason upon a machine check exception (MCE) in
      the guest address space if the KVM_CAP_PPC_FWNMI
      capability is enabled (instead of delivering a 0x200
      interrupt to guest). This enables QEMU to build error
      log and deliver machine check exception to guest via
      guest registered machine check handler.
      
      This approach simplifies the delivery of machine
      check exception to guest OS compared to the earlier
      approach of KVM directly invoking 0x200 guest interrupt
      vector.
      
      This design/approach is based on the feedback for the
      QEMU patches to handle machine check exception. Details
      of earlier approach of handling machine check exception
      in QEMU and related discussions can be found at:
      
      https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
      
      Note:
      
      This patch now directly invokes machine_check_print_event_info()
      from kvmppc_handle_exit_hv() to print the event to host console
      at the time of guest exit before the exception is passed on to the
      guest. Hence, the host-side handling which was performed earlier
      via machine_check_fwnmi is removed.
      
      The reasons for this approach is (i) it is not possible
      to distinguish whether the exception occurred in the
      guest or the host from the pt_regs passed on the
      machine_check_exception(). Hence machine_check_exception()
      calls panic, instead of passing on the exception to
      the guest, if the machine check exception is not
      recoverable. (ii) the approach introduced in this
      patch gives opportunity to the host kernel to perform
      actions in virtual mode before passing on the exception
      to the guest. This approach does not require complex
      tweaks to machine_check_fwnmi and friends.
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e20bbd3d
  12. 19 6月, 2017 4 次提交
    • N
      powerpc/64s/idle: Avoid SRR usage in idle sleep/wake paths · 9d292501
      Nicholas Piggin 提交于
      Idle code now always runs at the 0xc... effective address whether
      in real or virtual mode. This means rfid can be ditched, along
      with a lot of SRR manipulations.
      
      In the wakeup path, carry SRR1 around in r12. Use mtmsrd to change
      MSR states as required.
      
      This also balances the return prediction for the idle call, by
      doing blr rather than rfid to return to the idle caller.
      
      On POWER9, 2-process context switch on different cores, with snooze
      disabled, increases performance by 2%.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Incorporate v2 fixes from Nick]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9d292501
    • P
      KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 · 57900694
      Paul Mackerras 提交于
      On POWER9, we no longer have the restriction that we had on POWER8
      where all threads in a core have to be in the same partition, so
      the CPU threads are now independent.  However, we still want to be
      able to run guests with a virtual SMT topology, if only to allow
      migration of guests from POWER8 systems to POWER9.
      
      A guest that has a virtual SMT mode greater than 1 will expect to
      be able to use the doorbell facility; it will expect the msgsndp
      and msgclrp instructions to work appropriately and to be able to read
      sensible values from the TIR (thread identification register) and
      DPDES (directed privileged doorbell exception status) special-purpose
      registers.  However, since each CPU thread is a separate sub-processor
      in POWER9, these instructions and registers can only be used within
      a single CPU thread.
      
      In order for these instructions to appear to act correctly according
      to the guest's virtual SMT mode, we have to trap and emulate them.
      We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
      register.  The emulation is triggered by the hypervisor facility
      unavailable interrupt that occurs when the guest uses them.
      
      To cause a doorbell interrupt to occur within the guest, we set the
      DPDES register to 1.  If the guest has interrupts enabled, the CPU
      will generate a doorbell interrupt and clear the DPDES register in
      hardware.  The DPDES hardware register for the guest is saved in the
      vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
      exit code, other VCPUs wishing to cause a doorbell interrupt don't
      write that field directly, but instead set a vcpu->arch.doorbell_request
      flag.  This is consumed and set to 0 by the guest entry code, which
      then sets DPDES to 1.
      
      Emulating reads of the DPDES register is somewhat involved, because
      it requires reading the doorbell pending interrupt status of all of the
      VCPU threads in the virtual core, and if any of those VCPUs are
      running, their doorbell status is only up-to-date in the hardware
      DPDES registers of the CPUs where they are running.  In order to get
      a reasonable approximation of the current doorbell status, we send
      those CPUs an IPI, causing an exit from the guest which will update
      the vcpu->arch.vcore->dpdes field.  We then use that value in
      constructing the emulated DPDES register value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57900694
    • P
      KVM: PPC: Book3S HV: Context-switch HFSCR between host and guest on POWER9 · 769377f7
      Paul Mackerras 提交于
      This adds code to allow us to use a different value for the HFSCR
      (Hypervisor Facilities Status and Control Register) when running the
      guest from that which applies in the host.  The reason for doing this
      is to allow us to trap the msgsndp instruction and related operations
      in future so that they can be virtualized.  We also save the value of
      HFSCR when a hypervisor facility unavailable interrupt occurs, because
      the high byte of HFSCR indicates which facility the guest attempted to
      access.
      
      We save and restore the host value on guest entry/exit because some
      bits of it affect host userspace execution.
      
      We only do all this on POWER9, not on POWER8, because we are not
      intending to virtualize any of the facilities controlled by HFSCR on
      POWER8.  In particular, the HFSCR bit that controls execution of
      msgsndp and related operations does not exist on POWER8.  The HFSCR
      doesn't exist at all on POWER7.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      769377f7
    • P
      KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9 · 1bc3fe81
      Paul Mackerras 提交于
      This allows userspace (e.g. QEMU) to enable large decrementer mode for
      the guest when running on a POWER9 host, by setting the LPCR_LD bit in
      the guest LPCR value.  With this, the guest exit code saves 64 bits of
      the guest DEC value on exit.  Other places that use the guest DEC
      value check the LPCR_LD bit in the guest LPCR value, and if it is set,
      omit the 32-bit sign extension that would otherwise be done.
      
      This doesn't change the DEC emulation used by PR KVM because PR KVM
      is not supported on POWER9 yet.
      
      This is partly based on an earlier patch by Oliver O'Halloran.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1bc3fe81
  13. 16 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Save/restore host values of debug registers · 7ceaa6dc
      Paul Mackerras 提交于
      At present, HV KVM on POWER8 and POWER9 machines loses any instruction
      or data breakpoint set in the host whenever a guest is run.
      Instruction breakpoints are currently only used by xmon, but ptrace
      and the perf_event subsystem can set data breakpoints as well as xmon.
      
      To fix this, we save the host values of the debug registers (CIABR,
      DAWR and DAWRX) before entering the guest and restore them on exit.
      To provide space to save them in the stack frame, we expand the stack
      frame allocated by kvmppc_hv_entry() from 112 to 144 bytes.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7ceaa6dc
  14. 15 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit · 4c3bb4cc
      Paul Mackerras 提交于
      This restores several special-purpose registers (SPRs) to sane values
      on guest exit that were missed before.
      
      TAR and VRSAVE are readable and writable by userspace, and we need to
      save and restore them to prevent the guest from potentially affecting
      userspace execution (not that TAR or VRSAVE are used by any known
      program that run uses the KVM_RUN ioctl).  We save/restore these
      in kvmppc_vcpu_run_hv() rather than on every guest entry/exit.
      
      FSCR affects userspace execution in that it can prohibit access to
      certain facilities by userspace.  We restore it to the normal value
      for the task on exit from the KVM_RUN ioctl.
      
      IAMR is normally 0, and is restored to 0 on guest exit.  However,
      with a radix host on POWER9, it is set to a value that prevents the
      kernel from executing user-accessible memory.  On POWER9, we save
      IAMR on guest entry and restore it on guest exit to the saved value
      rather than 0.  On POWER8 we continue to set it to 0 on guest exit.
      
      PSPB is normally 0.  We restore it to 0 on guest exit to prevent
      userspace taking advantage of the guest having set it non-zero
      (which would allow userspace to set its SMT priority to high).
      
      UAMOR is normally 0.  We restore it to 0 on guest exit to prevent
      the AMR from being used as a covert channel between userspace
      processes, since the AMR is not context-switched at present.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4c3bb4cc
  15. 29 5月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Cope with host using large decrementer mode · 2f272463
      Paul Mackerras 提交于
      POWER9 introduces a new mode for the decrementer register, called
      large decrementer mode, in which the decrementer counter is 56 bits
      wide rather than 32, and reads are sign-extended rather than
      zero-extended.  For the decrementer, this new mode is optional and
      controlled by a bit in the LPCR.  The hypervisor decrementer (HDEC)
      is 56 bits wide on POWER9 and has no mode control.
      
      Since KVM code reads and writes the decrementer and hypervisor
      decrementer registers in a few places, it needs to be aware of the
      need to treat the decrementer value as a 64-bit quantity, and only do
      a 32-bit sign extension when large decrementer mode is not in effect.
      Similarly, the HDEC should always be treated as a 64-bit quantity on
      POWER9.  We define a new EXTEND_HDEC macro to encapsulate the feature
      test for POWER9 and the sign extension.
      
      To enable the sign extension to be removed in large decrementer mode,
      we test the LPCR_LD bit in the host LPCR image stored in the struct
      kvm for the guest.  If is set then large decrementer mode is enabled
      and the sign extension should be skipped.
      
      This is partly based on an earlier patch by Oliver O'Halloran.
      
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2f272463
  16. 27 4月, 2017 1 次提交
  17. 01 3月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't use ASDR for real-mode HPT faults on POWER9 · 4e5acdc2
      Paul Mackerras 提交于
      In HPT mode on POWER9, the ASDR register is supposed to record
      segment information for hypervisor page faults.  It turns out that
      POWER9 DD1 does not record the page size information in the ASDR
      for faults in guest real mode.  We have the necessary information
      in memory already, so by moving the checks for real mode that already
      existed, we can use the in-memory copy.  Since a load is likely to
      be faster than reading an SPR, we do this unconditionally (not just
      for POWER9 DD1).
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4e5acdc2
  18. 31 1月, 2017 6 次提交
    • P
      KVM: PPC: Book3S HV: Invalidate ERAT on guest entry/exit for POWER9 DD1 · f11f6f79
      Paul Mackerras 提交于
      On POWER9 DD1, we need to invalidate the ERAT (effective to real
      address translation cache) when changing the PIDR register, which
      we do as part of guest entry and exit.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f11f6f79
    • P
      KVM: PPC: Book3S HV: Allow guest exit path to have MMU on · 53af3ba2
      Paul Mackerras 提交于
      If we allow LPCR[AIL] to be set for radix guests, then interrupts from
      the guest to the host can be delivered by the hardware with relocation
      on, and thus the code path starting at kvmppc_interrupt_hv can be
      executed in virtual mode (MMU on) for radix guests (previously it was
      only ever executed in real mode).
      
      Most of the code is indifferent to whether the MMU is on or off, but
      the calls to OPAL that use the real-mode OPAL entry code need to
      be switched to use the virtual-mode code instead.  The affected
      calls are the calls to the OPAL XICS emulation functions in
      kvmppc_read_one_intr() and related functions.  We test the MSR[IR]
      bit to detect whether we are in real or virtual mode, and call the
      opal_rm_* or opal_* function as appropriate.
      
      The other place that depends on the MMU being off is the optimization
      where the guest exit code jumps to the external interrupt vector or
      hypervisor doorbell interrupt vector, or returns to its caller (which
      is __kvmppc_vcore_entry).  If the MMU is on and we are returning to
      the caller, then we don't need to use an rfid instruction since the
      MMU is already on; a simple blr suffices.  If there is an external
      or hypervisor doorbell interrupt to handle, we branch to the
      relocation-on version of the interrupt vector.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      53af3ba2
    • P
      KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement · a29ebeaf
      Paul Mackerras 提交于
      With radix, the guest can do TLB invalidations itself using the tlbie
      (global) and tlbiel (local) TLB invalidation instructions.  Linux guests
      use local TLB invalidations for translations that have only ever been
      accessed on one vcpu.  However, that doesn't mean that the translations
      have only been accessed on one physical cpu (pcpu) since vcpus can move
      around from one pcpu to another.  Thus a tlbiel might leave behind stale
      TLB entries on a pcpu where the vcpu previously ran, and if that task
      then moves back to that previous pcpu, it could see those stale TLB
      entries and thus access memory incorrectly.  The usual symptom of this
      is random segfaults in userspace programs in the guest.
      
      To cope with this, we detect when a vcpu is about to start executing on
      a thread in a core that is a different core from the last time it
      executed.  If that is the case, then we mark the core as needing a
      TLB flush and then send an interrupt to any thread in the core that is
      currently running a vcpu from the same guest.  This will get those vcpus
      out of the guest, and the first one to re-enter the guest will do the
      TLB flush.  The reason for interrupting the vcpus executing on the old
      core is to cope with the following scenario:
      
      	CPU 0			CPU 1			CPU 4
      	(core 0)			(core 0)			(core 1)
      
      	VCPU 0 runs task X      VCPU 1 runs
      	core 0 TLB gets
      	entries from task X
      	VCPU 0 moves to CPU 4
      							VCPU 0 runs task X
      							Unmap pages of task X
      							tlbiel
      
      				(still VCPU 1)			task X moves to VCPU 1
      				task X runs
      				task X sees stale TLB
      				entries
      
      That is, as soon as the VCPU starts executing on the new core, it
      could unmap and tlbiel some page table entries, and then the task
      could migrate to one of the VCPUs running on the old core and
      potentially see stale TLB entries.
      
      Since the TLB is shared between all the threads in a core, we only
      use the bit of kvm->arch.need_tlb_flush corresponding to the first
      thread in the core.  To ensure that we don't have a window where we
      can miss a flush, this moves the clearing of the bit from before the
      actual flush to after it.  This way, two threads might both do the
      flush, but we prevent the situation where one thread can enter the
      guest before the flush is finished.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a29ebeaf
    • P
      KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests · f4c51f84
      Paul Mackerras 提交于
      This adds code to  branch around the parts that radix guests don't
      need - clearing and loading the SLB with the guest SLB contents,
      saving the guest SLB contents on exit, and restoring the host SLB
      contents.
      
      Since the host is now using radix, we need to save and restore the
      host value for the PID register.
      
      On hypervisor data/instruction storage interrupts, we don't do the
      guest HPT lookup on radix, but just save the guest physical address
      for the fault (from the ASDR register) in the vcpu struct.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4c51f84
    • P
      KVM: PPC: Book3S HV: Use ASDR for HPT guests on POWER9 · ef8c640c
      Paul Mackerras 提交于
      POWER9 adds a register called ASDR (Access Segment Descriptor
      Register), which is set by hypervisor data/instruction storage
      interrupts to contain the segment descriptor for the address
      being accessed, assuming the guest is using HPT translation.
      (For radix guests, it contains the guest real address of the
      access.)
      
      Thus, for HPT guests on POWER9, we can use this register rather
      than looking up the SLB with the slbfee. instruction.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ef8c640c
    • N
      KVM: PPC: Book3S: 64-bit CONFIG_RELOCATABLE support for interrupts · a97a65d5
      Nicholas Piggin 提交于
      64-bit Book3S exception handlers must find the dynamic kernel base
      to add to the target address when branching beyond __end_interrupts,
      in order to support kernel running at non-0 physical address.
      
      Support this in KVM by branching with CTR, similarly to regular
      interrupt handlers. The guest CTR saved in HSTATE_SCRATCH1 and
      restored after the branch.
      
      Without this, the host kernel hangs and crashes randomly when it is
      running at a non-0 address and a KVM guest is started.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a97a65d5
  19. 27 1月, 2017 1 次提交
  20. 24 11月, 2016 5 次提交
    • P
      KVM: PPC: Book3S HV: Use stop instruction rather than nap on POWER9 · bf53c88e
      Paul Mackerras 提交于
      POWER9 replaces the various power-saving mode instructions on POWER8
      (doze, nap, sleep and rvwinkle) with a single "stop" instruction, plus
      a register, PSSCR, which controls the depth of the power-saving mode.
      This replaces the use of the nap instruction when threads are idle
      during guest execution with the stop instruction, and adds code to
      set PSSCR to a value which will allow an SMT mode switch while the
      thread is idle (given that the core as a whole won't be idle in these
      cases).
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      bf53c88e
    • P
      KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9 · 7c5b06ca
      Paul Mackerras 提交于
      POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
      and tlbiel (local tlbie) instructions.  Both instructions get a
      set of new parameters (RIC, PRS and R) which appear as bits in the
      instruction word.  The tlbiel instruction now has a second register
      operand, which contains a PID and/or LPID value if needed, and
      should otherwise contain 0.
      
      This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
      as well as older processors.  Since we only handle HPT guests so
      far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
      word as on previous processors, so we don't need to conditionally
      execute different instructions depending on the processor.
      
      The local flush on first entry to a guest in book3s_hv_rmhandlers.S
      is a loop which depends on the number of TLB sets.  Rather than
      using feature sections to set the number of iterations based on
      which CPU we're on, we now work out this number at VM creation time
      and store it in the kvm_arch struct.  That will make it possible to
      get the number from the device tree in future, which will help with
      compatibility with future processors.
      
      Since mmu_partition_table_set_entry() does a global flush of the
      whole LPID, we don't need to do the TLB flush on first entry to the
      guest on each processor.  Therefore we don't set all bits in the
      tlb_need_flush bitmap on VM startup on POWER9.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7c5b06ca
    • P
      KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs · e9cf1e08
      Paul Mackerras 提交于
      This adds code to handle two new guest-accessible special-purpose
      registers on POWER9: TIDR (thread ID register) and PSSCR (processor
      stop status and control register).  They are context-switched
      between host and guest, and the guest values can be read and set
      via the one_reg interface.
      
      The PSSCR contains some fields which are guest-accessible and some
      which are only accessible in hypervisor mode.  We only allow the
      guest-accessible fields to be read or set by userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e9cf1e08
    • P
      KVM: PPC: Book3S HV: Adjust host/guest context switch for POWER9 · 83677f55
      Paul Mackerras 提交于
      Some special-purpose registers that were present and accessible
      by guests on POWER8 no longer exist on POWER9, so this adds
      feature sections to ensure that we don't try to context-switch
      them when going into or out of a guest on POWER9.  These are
      all relatively obscure, rarely-used registers, but we had to
      context-switch them on POWER8 to avoid creating a covert channel.
      They are: SPMC1, SPMC2, MMCRS, CSIGR, TACR, TCSCR, and ACOP.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      83677f55
    • P
      KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9 · 7a84084c
      Paul Mackerras 提交于
      On POWER9, the SDR1 register (hashed page table base address) is no
      longer used, and instead the hardware reads the HPT base address
      and size from the partition table.  The partition table entry also
      contains the bits that specify the page size for the VRMA mapping,
      which were previously in the LPCR.  The VPM0 bit of the LPCR is
      now reserved; the processor now always uses the VRMA (virtual
      real-mode area) mechanism for guest real-mode accesses in HPT mode,
      and the RMO (real-mode offset) mechanism has been dropped.
      
      When entering or exiting the guest, we now only have to set the
      LPIDR (logical partition ID register), not the SDR1 register.
      There is also no requirement now to transition via a reserved
      LPID value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7a84084c
  21. 21 11月, 2016 1 次提交
    • P
      KVM: PPC: Book3S HV: Save/restore XER in checkpointed register state · 0d808df0
      Paul Mackerras 提交于
      When switching from/to a guest that has a transaction in progress,
      we need to save/restore the checkpointed register state.  Although
      XER is part of the CPU state that gets checkpointed, the code that
      does this saving and restoring doesn't save/restore XER.
      
      This fixes it by saving and restoring the XER.  To allow userspace
      to read/write the checkpointed XER value, we also add a new ONE_REG
      specifier.
      
      The visible effect of this bug is that the guest may see its XER
      value being corrupted when it uses transactions.
      
      Fixes: e4e38121 ("KVM: PPC: Book3S HV: Add transactional memory support")
      Fixes: 0a8eccef ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0d808df0
  22. 27 9月, 2016 1 次提交
    • P
      KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread · 88b02cf9
      Paul Mackerras 提交于
      POWER8 has one virtual timebase (VTB) register per subcore, not one
      per CPU thread.  The HV KVM code currently treats VTB as a per-thread
      register, which can lead to spurious soft lockup messages from guests
      which use the VTB as the time source for the soft lockup detector.
      (CPUs before POWER8 did not have the VTB register.)
      
      For HV KVM, this fixes the problem by making only the primary thread
      in each virtual core save and restore the VTB value.  With this,
      the VTB state becomes part of the kvmppc_vcore structure.  This
      also means that "piggybacking" of multiple virtual cores onto one
      subcore is not possible on POWER8, because then the virtual cores
      would share a single VTB register.
      
      PR KVM emulates a VTB register, which is per-vcpu because PR KVM
      has no notion of CPU threads or SMT.  For PR KVM we move the VTB
      state into the kvmppc_vcpu_book3s struct.
      
      Cc: stable@vger.kernel.org # v3.14+
      Reported-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      88b02cf9
  23. 12 9月, 2016 2 次提交
    • S
      KVM: PPC: Book3S HV: Complete passthrough interrupt in host · f7af5209
      Suresh Warrier 提交于
      In existing real mode ICP code, when updating the virtual ICP
      state, if there is a required action that cannot be completely
      handled in real mode, as for instance, a VCPU needs to be woken
      up, flags are set in the ICP to indicate the required action.
      This is checked when returning from hypercalls to decide whether
      the call needs switch back to the host where the action can be
      performed in virtual mode. Note that if h_ipi_redirect is enabled,
      real mode code will first try to message a free host CPU to
      complete this job instead of returning the host to do it ourselves.
      
      Currently, the real mode PCI passthrough interrupt handling code
      checks if any of these flags are set and simply returns to the host.
      This is not good enough as the trap value (0x500) is treated as an
      external interrupt by the host code. It is only when the trap value
      is a hypercall that the host code searches for and acts on unfinished
      work by calling kvmppc_xics_rm_complete.
      
      This patch introduces a special trap BOOK3S_INTERRUPT_HV_RM_HARD
      which is returned by KVM if there is unfinished business to be
      completed in host virtual mode after handling a PCI passthrough
      interrupt. The host checks for this special interrupt condition
      and calls into the kvmppc_xics_rm_complete, which is made an
      exported function for this reason.
      
      [paulus@ozlabs.org - moved logic to set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
       in book3s_hv_rmhandlers.S into the end of kvmppc_check_wake_reason.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f7af5209
    • S
      KVM: PPC: Book3S HV: Handle passthrough interrupts in guest · e3c13e56
      Suresh Warrier 提交于
      Currently, KVM switches back to the host to handle any external
      interrupt (when the interrupt is received while running in the
      guest). This patch updates real-mode KVM to check if an interrupt
      is generated by a passthrough adapter that is owned by this guest.
      If so, the real mode KVM will directly inject the corresponding
      virtual interrupt to the guest VCPU's ICS and also EOI the interrupt
      in hardware. In short, the interrupt is handled entirely in real
      mode in the guest context without switching back to the host.
      
      In some rare cases, the interrupt cannot be completely handled in
      real mode, for instance, a VCPU that is sleeping needs to be woken
      up. In this case, KVM simply switches back to the host with trap
      reason set to 0x500. This works, but it is clearly not very efficient.
      A following patch will distinguish this case and handle it
      correctly in the host. Note that we can use the existing
      check_too_hard() routine even though we are not in a hypercall to
      determine if there is unfinished business that needs to be
      completed in host virtual mode.
      
      The patch assumes that the mapping between hardware interrupt IRQ
      and virtual IRQ to be injected to the guest already exists for the
      PCI passthrough interrupts that need to be handled in real mode.
      If the mapping does not exist, KVM falls back to the default
      existing behavior.
      
      The KVM real mode code reads mappings from the mapped array in the
      passthrough IRQ map without taking any lock.  We carefully order the
      loads and stores of the fields in the kvmppc_irq_map data structure
      using memory barriers to avoid an inconsistent mapping being seen by
      the reader. Thus, although it is possible to miss a map entry, it is
      not possible to read a stale value.
      
      [paulus@ozlabs.org - get irq_chip from irq_map rather than pimap,
       pulled out powernv eoi change into a separate patch, made
       kvmppc_read_intr get the vcpu from the paca rather than being
       passed in, rewrote the logic at the end of kvmppc_read_intr to
       avoid deep indentation, simplified logic in book3s_hv_rmhandlers.S
       since we were always restoring SRR0/1 anyway, get rid of the cached
       array (just use the mapped array), removed the kick_all_cpus_sync()
       call, clear saved_xirr PACA field when we handle the interrupt in
       real mode, fix compilation with CONFIG_KVM_XICS=n.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e3c13e56
  24. 09 9月, 2016 1 次提交
    • S
      KVM: PPC: Book3S HV: Convert kvmppc_read_intr to a C function · 37f55d30
      Suresh Warrier 提交于
      Modify kvmppc_read_intr to make it a C function.  Because it is called
      from kvmppc_check_wake_reason, any of the assembler code that calls
      either kvmppc_read_intr or kvmppc_check_wake_reason now has to assume
      that the volatile registers might have been modified.
      
      This also adds in the optimization of clearing saved_xirr in the case
      where we completely handle and EOI an IPI.  Without this, the next
      device interrupt will require two trips through the host interrupt
      handling code.
      
      [paulus@ozlabs.org - made kvmppc_check_wake_reason create a stack frame
       when it is calling kvmppc_read_intr, which means we can set r12 to
       the trap number (0x500) after the call to kvmppc_read_intr, instead
       of using r31.  Also moved the deliver_guest_interrupt label so as to
       restore XER and CTR, plus other minor tweaks.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      37f55d30
  25. 28 7月, 2016 2 次提交
    • P
      KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE · 93d17397
      Paul Mackerras 提交于
      It turns out that if the guest does a H_CEDE while the CPU is in
      a transactional state, and the H_CEDE does a nap, and the nap
      loses the architected state of the CPU (which is is allowed to do),
      then we lose the checkpointed state of the virtual CPU.  In addition,
      the transactional-memory state recorded in the MSR gets reset back
      to non-transactional, and when we try to return to the guest, we take
      a TM bad thing type of program interrupt because we are trying to
      transition from non-transactional to transactional with a hrfid
      instruction, which is not permitted.
      
      The result of the program interrupt occurring at that point is that
      the host CPU will hang in an infinite loop with interrupts disabled.
      Thus this is a denial of service vulnerability in the host which can
      be triggered by any guest (and depending on the guest kernel, it can
      potentially triggered by unprivileged userspace in the guest).
      
      This vulnerability has been assigned the ID CVE-2016-5412.
      
      To fix this, we save the TM state before napping and restore it
      on exit from the nap, when handling a H_CEDE in real mode.  The
      case where H_CEDE exits to host virtual mode is already OK (as are
      other hcalls which exit to host virtual mode) because the exit
      path saves the TM state.
      
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      93d17397
    • P
      KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures · f024ee09
      Paul Mackerras 提交于
      This moves the transactional memory state save and restore sequences
      out of the guest entry/exit paths into separate procedures.  This is
      so that these sequences can be used in going into and out of nap
      in a subsequent patch.
      
      The only code changes here are (a) saving and restore LR on the
      stack, since these new procedures get called with a bl instruction,
      (b) explicitly saving r1 into the PACA instead of assuming that
      HSTATE_HOST_R1(r13) is already set, and (c) removing an unnecessary
      and redundant setting of MSR[TM] that should have been removed by
      commit 9d4d0bdd9e0a ("KVM: PPC: Book3S HV: Add transactional memory
      support", 2013-09-24) but wasn't.
      
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f024ee09