1. 24 3月, 2014 1 次提交
  2. 30 12月, 2013 1 次提交
    • M
      powerpc: Fix bad stack check in exception entry · 90ff5d68
      Michael Neuling 提交于
      In EXCEPTION_PROLOG_COMMON() we check to see if the stack pointer (r1)
      is valid when coming from the kernel.  If it's not valid, we die but
      with a nice oops message.
      
      Currently we allocate a stack frame (subtract INT_FRAME_SIZE) before we
      check to see if the stack pointer is negative.  Unfortunately, this
      won't detect a bad stack where r1 is less than INT_FRAME_SIZE.
      
      This patch fixes the check to compare the modified r1 with
      -INT_FRAME_SIZE.  With this, bad kernel stack pointers (including NULL
      pointers) are correctly detected again.
      
      Kudos to Paulus for finding this.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      cc: stable@vger.kernel.org
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      90ff5d68
  3. 05 12月, 2013 1 次提交
  4. 17 10月, 2013 3 次提交
  5. 14 8月, 2013 2 次提交
    • M
      powerpc: Avoid link stack corruption for MMU on exceptions · bc2e6c6a
      Michael Neuling 提交于
      When we have MMU on exceptions (POWER8) and a relocatable kernel, we
      need to branch from the initial exception vectors at 0x0 to up high
      where the kernel might be located.  Currently we do this using the link
      register.
      
      Unfortunately this corrupts the link stack and instead we should use the
      count register.  We did this for the syscall entry path in:
        6a404806 powerpc: Avoid link stack corruption in MMU on syscall entry path
      but I stupidly forgot to do the same for other exceptions.
      
      This patch changes the initial exception vectors to use the count
      register instead of the link register when we need to branch up to the
      relocated kernel.
      
      I have a dodgy userspace test which loops calling a function that reads
      the PVR (mfpvr in userspace will be emulated by the kernel via the
      program check exception).  On POWER8 and with CONFIG_RELOCATABLE=y, I
      get a ~10% performance improvement with my userspace test with this
      patch.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      bc2e6c6a
    • T
      powerpc/ppc64: Rename SOFT_DISABLE_INTS with RECONCILE_IRQ_STATE · de021bb7
      Tiejun Chen 提交于
      The SOFT_DISABLE_INTS seems an odd name for something that updates the
      software state to be consistent with interrupts being hard disabled, so
      rename SOFT_DISABLE_INTS with RECONCILE_IRQ_STATE to avoid this confusion.
      Signed-off-by: NTiejun Chen <tiejun.chen@windriver.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      de021bb7
  6. 01 7月, 2013 1 次提交
    • M
      powerpc: Remove KVMTEST from RELON exception handlers · c9f69518
      Michael Ellerman 提交于
      KVMTEST is a macro which checks whether we are taking an exception from
      guest context, if so we branch out of line and eventually call into the
      KVM code to handle the switch.
      
      When running real guests on bare metal (HV KVM) the hardware ensures
      that we never take a relocation on exception when transitioning from
      guest to host. For PR KVM we disable relocation on exceptions ourself in
      kvmppc_core_init_vm(), as of commit a413f474 "Disable relocation on
      exceptions whenever PR KVM is active".
      
      So convert all the RELON macros to use NOTEST, and drop the remaining
      KVM_HANDLER() definitions we have for 0xe40 and 0xe80.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      CC: <stable@vger.kernel.org> [v3.9+]
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c9f69518
  7. 15 6月, 2013 1 次提交
    • M
      powerpc: Fix stack overflow crash in resume_kernel when ftracing · 0e37739b
      Michael Ellerman 提交于
      It's possible for us to crash when running with ftrace enabled, eg:
      
        Bad kernel stack pointer bffffd12 at c00000000000a454
        cpu 0x3: Vector: 300 (Data Access) at [c00000000ffe3d40]
            pc: c00000000000a454: resume_kernel+0x34/0x60
            lr: c00000000000335c: performance_monitor_common+0x15c/0x180
            sp: bffffd12
           msr: 8000000000001032
           dar: bffffd12
         dsisr: 42000000
      
      If we look at current's stack (paca->__current->stack) we see it is
      equal to c0000002ecab0000. Our stack is 16K, and comparing to
      paca->kstack (c0000002ecab3e30) we can see that we have overflowed our
      kernel stack. This leads to us writing over our struct thread_info, and
      in this case we have corrupted thread_info->flags and set
      _TIF_EMULATE_STACK_STORE.
      
      Dumping the stack we see:
      
        3:mon> t c0000002ecab0000
        [c0000002ecab0000] c00000000002131c .performance_monitor_exception+0x5c/0x70
        [c0000002ecab0080] c00000000000335c performance_monitor_common+0x15c/0x180
        --- Exception: f01 (Performance Monitor) at c0000000000fb2ec .trace_hardirqs_off+0x1c/0x30
        [c0000002ecab0370] c00000000016fdb0 .trace_graph_entry+0xb0/0x280 (unreliable)
        [c0000002ecab0410] c00000000003d038 .prepare_ftrace_return+0x98/0x130
        [c0000002ecab04b0] c00000000000a920 .ftrace_graph_caller+0x14/0x28
        [c0000002ecab0520] c0000000000d6b58 .idle_cpu+0x18/0x90
        [c0000002ecab05a0] c00000000000a934 .return_to_handler+0x0/0x34
        [c0000002ecab0620] c00000000001e660 .timer_interrupt+0x160/0x300
        [c0000002ecab06d0] c0000000000025dc decrementer_common+0x15c/0x180
        --- Exception: 901 (Decrementer) at c0000000000104d4 .arch_local_irq_restore+0x74/0xa0
        [c0000002ecab09c0] c0000000000fe044 .trace_hardirqs_on+0x14/0x30 (unreliable)
        [c0000002ecab0fb0] c00000000016fe3c .trace_graph_entry+0x13c/0x280
        [c0000002ecab1050] c00000000003d038 .prepare_ftrace_return+0x98/0x130
        [c0000002ecab10f0] c00000000000a920 .ftrace_graph_caller+0x14/0x28
        [c0000002ecab1160] c0000000000161f0 .__ppc64_runlatch_on+0x10/0x40
        [c0000002ecab11d0] c00000000000a934 .return_to_handler+0x0/0x34
        --- Exception: 901 (Decrementer) at c0000000000104d4 .arch_local_irq_restore+0x74/0xa0
      
        ... and so on
      
      __ppc64_runlatch_on() is called from RUNLATCH_ON in the exception entry
      path. At that point the irq state is not consistent, ie. interrupts are
      hard disabled (by the exception entry), but the paca soft-enabled flag
      may be out of sync.
      
      This leads to the local_irq_restore() in trace_graph_entry() actually
      enabling interrupts, which we do not want. Because we have not yet
      reprogrammed the decrementer we immediately take another decrementer
      exception, and recurse.
      
      The fix is twofold. Firstly make sure we call DISABLE_INTS before
      calling RUNLATCH_ON. The badly named DISABLE_INTS actually reconciles
      the irq state in the paca with the hardware, making it safe again to
      call local_irq_save/restore().
      
      Although that should be sufficient to fix the bug, we also mark the
      runlatch routines as notrace. They are called very early in the
      exception entry and we are asking for trouble tracing them. They are
      also fairly uninteresting and tracing them just adds unnecessary
      overhead.
      
      [ This regression was introduced by fe1952fc
        "powerpc: Rework runlatch code" by myself --BenH
      ]
      
      CC: <stable@vger.kernel.org> [v3.4+]
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0e37739b
  8. 26 4月, 2013 1 次提交
    • P
      powerpc: Fix "attempt to move .org backwards" error · a485c709
      Paul Mackerras 提交于
      Building a 64-bit powerpc kernel with PR KVM enabled currently gives
      this error:
      
        AS      arch/powerpc/kernel/head_64.o
      arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
      arch/powerpc/kernel/exceptions-64s.S:258: Error: attempt to move .org backwards
      make[2]: *** [arch/powerpc/kernel/head_64.o] Error 1
      
      This happens because the MASKABLE_EXCEPTION_PSERIES macro turns into
      33 instructions, but we only have space for 32 at the decrementer
      interrupt vector (from 0x900 to 0x980).
      
      In the code generated by the MASKABLE_EXCEPTION_PSERIES macro, we
      currently have two instances of the HMT_MEDIUM macro, which has the
      effect of setting the SMT thread priority to medium.  One is the
      first instruction, and is overwritten by a no-op on processors where
      we save the PPR (processor priority register), that is, POWER7 or
      later.  The other is after we have saved the PPR.
      
      In order to reduce the code at 0x900 by one instruction, we omit the
      first HMT_MEDIUM.  On processors without SMT this will have no effect
      since HMT_MEDIUM is a no-op there.  On POWER5 and RS64 machines this
      will mean that the first few instructions take a little longer in the
      case where a decrementer interrupt occurs when the hardware thread is
      running at low SMT priority.  On POWER6 and later machines, the
      hardware automatically boosts the thread priority when a decrementer
      interrupt is taken if the thread priority was below medium, so this
      change won't make any difference.
      
      The alternative would be to branch out of line after saving the CFAR.
      However, that would incur an extra overhead on all processors, whereas
      the approach adopted here only adds overhead on older threaded processors.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a485c709
  9. 15 2月, 2013 2 次提交
    • P
      powerpc/kvm/book3s_hv: Preserve guest CFAR register value · 0acb9111
      Paul Mackerras 提交于
      The CFAR (Come-From Address Register) is a useful debugging aid that
      exists on POWER7 processors.  Currently HV KVM doesn't save or restore
      the CFAR register for guest vcpus, making the CFAR of limited use in
      guests.
      
      This adds the necessary code to capture the CFAR value saved in the
      early exception entry code (it has to be saved before any branch is
      executed), save it in the vcpu.arch struct, and restore it on entry
      to the guest.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0acb9111
    • P
      powerpc: Save CFAR before branching in interrupt entry paths · 1707dd16
      Paul Mackerras 提交于
      Some of the interrupt vectors on 64-bit POWER server processors are
      only 32 bytes long, which is not enough for the full first-level
      interrupt handler.  For these we currently just have a branch to an
      out-of-line handler.  However, this means that we corrupt the CFAR
      (come-from address register) on POWER7 and later processors.
      
      To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces:
      EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR
      is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest.  We
      then put EXCEPTION_PROLOG_0 in the short interrupt vectors before
      we branch to the out-of-line handler, which contains the rest of the
      first-level interrupt handler.  To facilitate this, we define new
      _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc.
      
      In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more
      than 6 instructions, it was necessary to move the stores that move
      the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and
      to get rid of one of the two HMT_MEDIUM instructions.  Previously
      there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was
      nop'd out on processors with the PPR (POWER7 and later), and then
      another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside
      __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR.
      Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally
      and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although
      this leaves it in for the interrupt vectors where there is room for
      it.
      
      Previously we had a handler for hypervisor maintenance interrupts at
      0xe50, which doesn't leave enough room for the vector for hypervisor
      emulation assist interrupts at 0xe40, since we need 8 instructions.
      The 0xe50 vector was only used on POWER6, as the HMI vector was moved
      to 0xe60 on POWER7.  Since we don't support running in hypervisor mode
      on POWER6, we just remove the handler at 0xe50.
      
      This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0
      instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD
      from the relocation-on vectors (since any CPU that supports
      relocation-on interrupts also has the PPR).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1707dd16
  10. 10 1月, 2013 6 次提交
  11. 15 11月, 2012 3 次提交
    • M
      powerpc: Add relocation on exception vector handlers · c1fb6816
      Michael Neuling 提交于
      POWER8/v2.07 allows exceptions to be taken with the MMU still on.
      
      A new set of exception vectors is added at 0xc000_0000_0000_4xxx.  When the HW
      takes us here, MSR IR/DR will be set already and we no longer need a costly
      RFID to turn the MMU back on again.
      
      The original 0x0 based exception vectors remain for when the HW can't leave the
      MMU on.  Examples of this are when we can't trust the current MMU mappings,
      like when we are changing from guest to hypervisor (HV 0 -> 1) or when the MMU
      was off already.  In these cases the HW will take us to the original 0x0 based
      exception vectors with the MMU off as before.
      
      This uses the new macros added previously too implement these new execption
      vectors at 0xc000_0000_0000_4xxx.  We exit these exception vectors using
      mflr/blr (rather than mtspr SSR0/RFID), since we don't need the costly MMU
      switch anymore.
      
      This moves the __end_interrupts marker down past these new 0x4000 vectors since
      they will need to be copied down to 0x0 when the kernel is not at 0x0.
      Signed-off-by: NMatt Evans <matt@ozlabs.org>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c1fb6816
    • M
      powerpc: Add new macros needed for relocation on exceptions · 4700dfaf
      Michael Neuling 提交于
      POWER8/v2.07 allows exceptions to be taken with the MMU still on.
      
      A new set of exception vectors is added at 0xc000_0000_0000_4xxx.  When the HW
      takes us here, MSR IR/DR will be set already and we no longer need a costly
      RFID to turn the MMU back on again.
      
      The original 0x0 based exception vectors remain for when the HW can't leave the
      MMU on.  Examples of this are when we can't trust the current the MMU mappings,
      like when we are changing from guest to hypervisor (HV 0 -> 1) or when the MMU
      was off already.  In these cases the HW will take us to the original 0x0 based
      exception vectors with the MMU off as before.
      
      The below macros are copies of the macros used at the 0x0 offset but modified
      to handle the MMU being on.  In these macros we use the link register to jump
      to the secondary handlers rather than using RFID (RFID was also use to turn on
      the MMU).
      Signed-off-by: NMatt Evans <matt@ozlabs.org>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4700dfaf
    • M
      powerpc: Make load_hander handle upto 64k offset · 61e2390e
      Michael Neuling 提交于
      If we change load_hander() to use an ori instead of addi, we can load handlers
      upto 64k away provided we are still 64k aligned.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      61e2390e
  12. 11 7月, 2012 1 次提交
  13. 09 5月, 2012 1 次提交
  14. 09 3月, 2012 6 次提交
    • B
      powerpc: Rework lazy-interrupt handling · 7230c564
      Benjamin Herrenschmidt 提交于
      The current implementation of lazy interrupts handling has some
      issues that this tries to address.
      
      We don't do the various workarounds we need to do when re-enabling
      interrupts in some cases such as when returning from an interrupt
      and thus we may still lose or get delayed decrementer or doorbell
      interrupts.
      
      The current scheme also makes it much harder to handle the external
      "edge" interrupts provided by some BookE processors when using the
      EPR facility (External Proxy) and the Freescale Hypervisor.
      
      Additionally, we tend to keep interrupts hard disabled in a number
      of cases, such as decrementer interrupts, external interrupts, or
      when a masked decrementer interrupt is pending. This is sub-optimal.
      
      This is an attempt at fixing it all in one go by reworking the way
      we do the lazy interrupt disabling from the ground up.
      
      The base idea is to replace the "hard_enabled" field with a
      "irq_happened" field in which we store a bit mask of what interrupt
      occurred while soft-disabled.
      
      When re-enabling, either via arch_local_irq_restore() or when returning
      from an interrupt, we can now decide what to do by testing bits in that
      field.
      
      We then implement replaying of the missed interrupts either by
      re-using the existing exception frame (in exception exit case) or via
      the creation of a new one from an assembly trampoline (in the
      arch_local_irq_enable case).
      
      This removes the need to play with the decrementer to try to create
      fake interrupts, among others.
      
      In addition, this adds a few refinements:
      
       - We no longer  hard disable decrementer interrupts that occur
      while soft-disabled. We now simply bump the decrementer back to max
      (on BookS) or leave it stopped (on BookE) and continue with hard interrupts
      enabled, which means that we'll potentially get better sample quality from
      performance monitor interrupts.
      
       - Timer, decrementer and doorbell interrupts now hard-enable
      shortly after removing the source of the interrupt, which means
      they no longer run entirely hard disabled. Again, this will improve
      perf sample quality.
      
       - On Book3E 64-bit, we now make the performance monitor interrupt
      act as an NMI like Book3S (the necessary C code for that to work
      appear to already be present in the FSL perf code, notably calling
      nmi_enter instead of irq_enter). (This also fixes a bug where BookE
      perfmon interrupts could clobber r14 ... oops)
      
       - We could make "masked" decrementer interrupts act as NMIs when doing
      timer-based perf sampling to improve the sample quality.
      
      Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      ---
      
      v2:
      
      - Add hard-enable to decrementer, timer and doorbells
      - Fix CR clobber in masked irq handling on BookE
      - Make embedded perf interrupt act as an NMI
      - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
        to retrigger an interrupt without preventing hard-enable
      
      v3:
      
       - Fix or vs. ori bug on Book3E
       - Fix enabling of interrupts for some exceptions on Book3E
      
      v4:
      
       - Fix resend of doorbells on return from interrupt on Book3E
      
      v5:
      
       - Rebased on top of my latest series, which involves some significant
      rework of some aspects of the patch.
      
      v6:
       - 32-bit compile fix
       - more compile fixes with various .config combos
       - factor out the asm code to soft-disable interrupts
       - remove the C wrapper around preempt_schedule_irq
      
      v7:
       - Fix a bug with hard irq state tracking on native power7
      7230c564
    • B
      powerpc: Replace mfmsr instructions with load from PACA kernel_msr field · d9ada91a
      Benjamin Herrenschmidt 提交于
      On 64-bit, the mfmsr instruction can be quite slow, slower
      than loading a field from the cache-hot PACA, which happens
      to already contain the value we want in most cases.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d9ada91a
    • B
      powerpc: Improve behaviour of irq tracing on 64-bit exception entry · 1b701179
      Benjamin Herrenschmidt 提交于
      Some exceptions would unconditionally disable interrupts on entry,
      which is fine, but calling lockdep every time not only adds more
      overhead than strictly needed, but also means we get quite a few
      "redudant" disable logged, which makes it hard to spot the really
      bad ones.
      
      So instead, split the macro used by the exception code into a
      normal one and a separate one used when CONFIG_TRACE_IRQFLAGS is
      enabled, and make the later skip th tracing if interrupts were
      already disabled.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1b701179
    • B
      powerpc: Rework runlatch code · fe1952fc
      Benjamin Herrenschmidt 提交于
      This moves the inlines into system.h and changes the runlatch
      code to use the thread local flags (non-atomic) rather than
      the TIF flags (atomic) to keep track of the latch state.
      
      The code to turn it back on in an asynchronous interrupt is
      now simplified and partially inlined.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fe1952fc
    • B
      powerpc: Use the same interrupt prolog for perfmon as other interrupts · 7450f6f0
      Benjamin Herrenschmidt 提交于
      The perfmon interrupt is the sole user of a special variant of the
      interrupt prolog which differs from the one used by external and timer
      interrupts in that it saves the non-volatile GPRs and doesn't turn the
      runlatch on.
      
      The former is unnecessary and the later is arguably incorrect, so
      let's clean that up by using the same prolog. While at it we rename
      that prolog to use the _ASYNC prefix.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7450f6f0
    • B
      powerpc: Remove legacy iSeries bits from assembly files · 4f8cf36f
      Benjamin Herrenschmidt 提交于
      This removes the various bits of assembly in the kernel entry,
      exception handling and SLB management code that were specific
      to running under the legacy iSeries hypervisor which is no
      longer supported.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4f8cf36f
  15. 12 7月, 2011 4 次提交
    • P
      KVM: PPC: book3s_hv: Add support for PPC970-family processors · 9e368f29
      Paul Mackerras 提交于
      This adds support for running KVM guests in supervisor mode on those
      PPC970 processors that have a usable hypervisor mode.  Unfortunately,
      Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
      1), but the YDL PowerStation does have a usable hypervisor mode.
      
      There are several differences between the PPC970 and POWER7 in how
      guests are managed.  These differences are accommodated using the
      CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
      bits.  Notably, on PPC970:
      
      * The LPCR, LPID or RMOR registers don't exist, and the functions of
        those registers are provided by bits in HID4 and one bit in HID0.
      
      * External interrupts can be directed to the hypervisor, but unlike
        POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
        SRR0/1 not HSRR0/1.
      
      * There is no virtual RMA (VRMA) mode; the guest must use an RMO
        (real mode offset) area.
      
      * The TLB entries are not tagged with the LPID, so it is necessary to
        flush the whole TLB on partition switch.  Furthermore, when switching
        partitions we have to ensure that no other CPU is executing the tlbie
        or tlbsync instructions in either the old or the new partition,
        otherwise undefined behaviour can occur.
      
      * The PMU has 8 counters (PMC registers) rather than 6.
      
      * The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.
      
      * The SLB has 64 entries rather than 32.
      
      * There is no mediated external interrupt facility, so if we switch to
        a guest that has a virtual external interrupt pending but the guest
        has MSR[EE] = 0, we have to arrange to have an interrupt pending for
        it so that we can get control back once it re-enables interrupts.  We
        do that by sending ourselves an IPI with smp_send_reschedule after
        hard-disabling interrupts.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9e368f29
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948
    • P
      KVM: PPC: Split host-state fields out of kvmppc_book3s_shadow_vcpu · 3c42bf8a
      Paul Mackerras 提交于
      There are several fields in struct kvmppc_book3s_shadow_vcpu that
      temporarily store bits of host state while a guest is running,
      rather than anything relating to the particular guest or vcpu.
      This splits them out into a new kvmppc_host_state structure and
      modifies the definitions in asm-offsets.c to suit.
      
      On 32-bit, we have a kvmppc_host_state structure inside the
      kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
      to get to them both with one pointer.  On 64-bit they are separate
      fields in the PACA.  This means that on 64-bit we don't need to
      copy the kvmppc_host_state in and out on vcpu load/unload, and
      in future will mean that the book3s_hv code doesn't need a
      shadow_vcpu struct in the PACA at all.  That does mean that we
      have to be careful not to rely on any values persisting in the
      hstate field of the paca across any point where we could block
      or get preempted.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3c42bf8a
    • P
      powerpc, KVM: Rework KVM checks in first-level interrupt handlers · b01c8b54
      Paul Mackerras 提交于
      Instead of branching out-of-line with the DO_KVM macro to check if we
      are in a KVM guest at the time of an interrupt, this moves the KVM
      check inline in the first-level interrupt handlers.  This speeds up
      the non-KVM case and makes sure that none of the interrupt handlers
      are missing the check.
      
      Because the first-level interrupt handlers are now larger, some things
      had to be move out of line in exceptions-64s.S.
      
      This all necessitated some minor changes to the interrupt entry code
      in KVM.  This also streamlines the book3s_32 KVM test.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b01c8b54
  16. 04 5月, 2011 2 次提交
    • P
      powerpc: Save Come-From Address Register (CFAR) in exception frame · 48404f2e
      Paul Mackerras 提交于
      Recent 64-bit server processors (POWER6 and POWER7) have a "Come-From
      Address Register" (CFAR), that records the address of the most recent
      branch or rfid (return from interrupt) instruction for debugging purposes.
      
      This saves the value of the CFAR in the exception entry code and stores
      it in the exception frame.  We also make xmon print the CFAR value in
      its register dump code.
      
      Rather than extend the pt_regs struct at this time, we steal the orig_gpr3
      field, which is only used for system calls, and use it for the CFAR value
      for all exceptions/interrupts other than system calls.  This means we
      don't save the CFAR on system calls, which is not a great problem since
      system calls tend not to happen unexpectedly, and also avoids adding the
      overhead of reading the CFAR to the system call entry path.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      48404f2e
    • P
      powerpc: Save register r9-r13 values accurately on interrupt with bad stack · 1977b502
      Paul Mackerras 提交于
      When we take an interrupt or exception from kernel mode and the stack
      pointer is obviously not a kernel address (i.e. the top bit is 0), we
      switch to an emergency stack, save register values and panic.  However,
      on 64-bit server machines, we don't actually save the values of r9 - r13
      at the time of the interrupt, but rather values corrupted by the
      exception entry code for r12-r13, and nothing at all for r9-r11.
      
      This fixes it by passing a pointer to the register save area in the paca
      through to the bad_stack code in r3.  The register values are saved in
      one of the paca register save areas (depending on which exception this
      is).  Using the pointer in r3, the bad_stack code now retrieves the
      saved values of r9 - r13 and stores them in the exception frame on the
      emergency stack.  This also stores the normal exception frame marker
      ("regshere") in the exception frame.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1977b502
  17. 20 4月, 2011 4 次提交