1. 26 9月, 2011 5 次提交
    • A
      KVM: PPC: Add sanity checking to vcpu_run · af8f38b3
      Alexander Graf 提交于
      There are multiple features in PowerPC KVM that can now be enabled
      depending on the user's wishes. Some of the combinations don't make
      sense or don't work though.
      
      So this patch adds a way to check if the executing environment would
      actually be able to run the guest properly. It also adds sanity
      checks if PVR is set (should always be true given the current code
      flow), if PAPR is only used with book3s_64 where it works and that
      HV KVM is only used in PAPR mode.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      af8f38b3
    • A
      KVM: PPC: Add PAPR hypercall code for PR mode · 0254f074
      Alexander Graf 提交于
      When running a PAPR guest, we need to handle a few hypercalls in kernel space,
      most prominently the page table invalidation (to sync the shadows).
      
      So this patch adds handling for a few PAPR hypercalls to PR mode KVM. I tried
      to share the code with HV mode, but it ended up being a lot easier this way
      around, as the two differ too much in those details.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      
      ---
      
      v1 -> v2:
      
        - whitespace fix
      0254f074
    • A
      KVM: PPC: Add support for explicit HIOR setting · a15bd354
      Alexander Graf 提交于
      Until now, we always set HIOR based on the PVR, but this is just wrong.
      Instead, we should be setting HIOR explicitly, so user space can decide
      what the initial HIOR value is - just like on real hardware.
      
      We keep the old PVR based way around for backwards compatibility, but
      once user space uses the SREGS based method, we drop the PVR logic.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a15bd354
    • A
      KVM: PPC: Add papr_enabled flag · 9432ba60
      Alexander Graf 提交于
      When running a PAPR guest, some things change. The privilege level drops
      from hypervisor to supervisor, SDR1 gets treated differently and we interpret
      hypercalls. For bisectability sake, add the flag now, but only enable it when
      all the support code is there.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9432ba60
    • A
      KVM: PPC: move compute_tlbie_rb to book3s common header · db507c30
      Alexander Graf 提交于
      We need the compute_tlbie_rb in _pr and _hv implementations for papr
      soon, so let's move it over to a common header file that both
      implementations can leverage.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      db507c30
  2. 30 8月, 2011 1 次提交
  3. 05 8月, 2011 4 次提交
    • A
      powerpc: Move kdump default base address to half RMO size on 64bit · 8aa6d359
      Anton Blanchard 提交于
      We are seeing boot failures on some very large boxes even with
      commit b5416ca9 (powerpc: Move kdump default base address to
      64MB on 64bit).
      
      This patch halves the RMO so both kernels get about the same
      amount of RMO memory. On large machines this region will be
      at least 256MB, so each kernel will get 128MB.
      
      We cap it at 256MB (small SLB size) since some early allocations need
      to be in the bolted SLB region. We could relax this on machines with
      1TB SLBs in a future patch.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8aa6d359
    • P
      ppc: Remove duplicate definition of PV_POWER7 · 501d2386
      Peter Zijlstra 提交于
      One definition of PV_POWER7 seems enough to me.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      501d2386
    • A
      powerpc: Jump label misalignment causes oops at boot · c113a3ae
      Anton Blanchard 提交于
      I hit an oops at boot on the first instruction of timer_cpu_notify:
      
      NIP [c000000000722f88] .timer_cpu_notify+0x0/0x388
      
      The code should look like:
      
      c000000000722f78:       eb e9 00 30     ld      r31,48(r9)
      c000000000722f7c:       2f bf 00 00     cmpdi   cr7,r31,0
      c000000000722f80:       40 9e ff 44     bne+    cr7,c000000000722ec4
      c000000000722f84:       4b ff ff 74     b       c000000000722ef8
      
      c000000000722f88 <.timer_cpu_notify>:
      c000000000722f88:       7c 08 02 a6     mflr    r0
      c000000000722f8c:       2f a4 00 07     cmpdi   cr7,r4,7
      c000000000722f90:       fb c1 ff f0     std     r30,-16(r1)
      c000000000722f94:       fb 61 ff d8     std     r27,-40(r1)
      
      But the oops output shows:
      
      eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 7c0803a6 ebe1fff8 4e800020
      00000000 ebe90030 c0000000 00ad0a28 00000000 2fa40007 fbc1fff0 fb61ffd8
      
      So we scribbled over our instructions with c000000000ad0a28, which
      is an address inside the jump_table ELF section.
      
      It turns out the jump_table section is only aligned to 8 bytes but
      we are aligning our entries within the section to 16 bytes. This
      means our entries are offset from the table:
      
      c000000000acd4a8 <__start___jump_table>:
              ...
      c000000000ad0a10:       c0 00 00 00     lfs     f0,0(0)
      c000000000ad0a14:       00 70 cd 5c     .long 0x70cd5c
      c000000000ad0a18:       c0 00 00 00     lfs     f0,0(0)
      c000000000ad0a1c:       00 70 cd 90     .long 0x70cd90
      c000000000ad0a20:       c0 00 00 00     lfs     f0,0(0)
      c000000000ad0a24:       00 ac a4 20     .long 0xaca420
      
      And the jump table sort code gets very confused and writes into the
      wrong spot. Remove the alignment, and also remove the padding since
      we it saves some space and we shouldn't need it.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c113a3ae
    • S
      powerpc: mtspr/mtmsr should take an unsigned long · 326ed6a9
      Scott Wood 提交于
      Add a cast in case the caller passes in a different type, as it would
      if mtspr/mtmsr were functions.
      
      Previously, if a 64-bit type was passed in on 32-bit, GCC would bind the
      constraint to a pair of registers, and would substitute the first register
      in the pair in the asm code.  This corresponds to the upper half of the
      64-bit register, which is generally not the desired behavior.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      326ed6a9
  4. 27 7月, 2011 5 次提交
  5. 22 7月, 2011 1 次提交
    • M
      net: filter: BPF 'JIT' compiler for PPC64 · 0ca87f05
      Matt Evans 提交于
      An implementation of a code generator for BPF programs to speed up packet
      filtering on PPC64, inspired by Eric Dumazet's x86-64 version.
      
      Filter code is generated as an ABI-compliant function in module_alloc()'d mem
      with stackframe & prologue/epilogue generated if required (simple filters don't
      need anything more than an li/blr).  The filter's local variables, M[], live in
      registers.  Supports all BPF opcodes, although "complicated" loads from negative
      packet offsets (e.g. SKF_LL_OFF) are not yet supported.
      
      There are a couple of further optimisations left for future work; many-pass
      assembly with branch-reach reduction and a register allocator to push M[]
      variables into volatile registers would improve the code quality further.
      
      This currently supports big-endian 64-bit PowerPC only (but is fairly simple
      to port to PPC32 or LE!).
      
      Enabled in the same way as x86-64:
      
      	echo 1 > /proc/sys/net/core/bpf_jit_enable
      
      Or, enabled with extra debug output:
      
      	echo 2 > /proc/sys/net/core/bpf_jit_enable
      Signed-off-by: NMatt Evans <matt@ozlabs.org>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ca87f05
  6. 21 7月, 2011 1 次提交
    • P
      treewide: fix potentially dangerous trailing ';' in #defined values/expressions · 497888cf
      Phil Carmody 提交于
      All these are instances of
        #define NAME value;
      or
        #define NAME(params_opt) value;
      
      These of course fail to build when used in contexts like
        if(foo $OP NAME)
        while(bar $OP NAME)
      and may silently generate the wrong code in contexts such as
        foo = NAME + 1;    /* foo = value; + 1; */
        bar = NAME - 1;    /* bar = value; - 1; */
        baz = NAME & quux; /* baz = value; & quux; */
      
      Reported on comp.lang.c,
      Message-ID: <ab0d55fe-25e5-482b-811e-c475aa6065c3@c29g2000yqd.googlegroups.com>
      Initial analysis of the dangers provided by Keith Thompson in that thread.
      
      There are many more instances of more complicated macros having unnecessary
      trailing semicolons, but this pile seems to be all of the cases of simple
      values suffering from the problem. (Thus things that are likely to be found
      in one of the contexts above, more complicated ones aren't.)
      Signed-off-by: NPhil Carmody <ext-phil.2.carmody@nokia.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      497888cf
  7. 13 7月, 2011 1 次提交
  8. 12 7月, 2011 22 次提交
    • R
      powerpc: rename ppc_pci_*_flags to pci_*_flags · 0e47ff1c
      Rob Herring 提交于
      This renames pci flags functions and enums in preparation for creating
      generic version in asm-generic/pci-bridge.h. The following search and
      replace is done:
      
      s/ppc_pci_/pci_/
      s/PPC_PCI_/PCI_/
      
      Direct accesses to ppc_pci_flag variable are replaced with helper
      functions.
      Signed-off-by: NRob Herring <rob.herring@calxeda.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      0e47ff1c
    • D
      powerpc/44x: don't use tlbivax on AMP systems · 91b191c7
      Dave Kleikamp 提交于
      Since other OS's may be running on the other cores don't use tlbivax
      Signed-off-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Signed-off-by: NTony Breeds <tony@bakeyournoodle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NJosh Boyer <jwboyer@linux.vnet.ibm.com>
      91b191c7
    • A
      KVM: PPC: Remove prog_flags · 29d03158
      Alexander Graf 提交于
      Commit c8f729d408 (KVM: PPC: Deliver program interrupts right away instead
      of queueing them) made away with all users of prog_flags, so we can just
      remove it from the headers.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      29d03158
    • P
      KVM: PPC: book3s_hv: Add support for PPC970-family processors · 9e368f29
      Paul Mackerras 提交于
      This adds support for running KVM guests in supervisor mode on those
      PPC970 processors that have a usable hypervisor mode.  Unfortunately,
      Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
      1), but the YDL PowerStation does have a usable hypervisor mode.
      
      There are several differences between the PPC970 and POWER7 in how
      guests are managed.  These differences are accommodated using the
      CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
      bits.  Notably, on PPC970:
      
      * The LPCR, LPID or RMOR registers don't exist, and the functions of
        those registers are provided by bits in HID4 and one bit in HID0.
      
      * External interrupts can be directed to the hypervisor, but unlike
        POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
        SRR0/1 not HSRR0/1.
      
      * There is no virtual RMA (VRMA) mode; the guest must use an RMO
        (real mode offset) area.
      
      * The TLB entries are not tagged with the LPID, so it is necessary to
        flush the whole TLB on partition switch.  Furthermore, when switching
        partitions we have to ensure that no other CPU is executing the tlbie
        or tlbsync instructions in either the old or the new partition,
        otherwise undefined behaviour can occur.
      
      * The PMU has 8 counters (PMC registers) rather than 6.
      
      * The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.
      
      * The SLB has 64 entries rather than 32.
      
      * There is no mediated external interrupt facility, so if we switch to
        a guest that has a virtual external interrupt pending but the guest
        has MSR[EE] = 0, we have to arrange to have an interrupt pending for
        it so that we can get control back once it re-enables interrupts.  We
        do that by sending ourselves an IPI with smp_send_reschedule after
        hard-disabling interrupts.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9e368f29
    • P
      powerpc, KVM: Split HVMODE_206 cpu feature bit into separate HV and architecture bits · 969391c5
      Paul Mackerras 提交于
      This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to
      indicate that we have a usable hypervisor mode, and another to indicate
      that the processor conforms to PowerISA version 2.06.  We also add
      another bit to indicate that the processor conforms to ISA version 2.01
      and set that for PPC970 and derivatives.
      
      Some PPC970 chips (specifically those in Apple machines) have a
      hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode
      is not useful in the sense that there is no way to run any code in
      supervisor mode (HV=0 PR=0).  On these processors, the LPES0 and LPES1
      bits in HID4 are always 0, and we use that as a way of detecting that
      hypervisor mode is not useful.
      
      Where we have a feature section in assembly code around code that
      only applies on POWER7 in hypervisor mode, we use a construct like
      
      END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
      
      The definition of END_FTR_SECTION_IFSET is such that the code will
      be enabled (not overwritten with nops) only if all bits in the
      provided mask are set.
      
      Note that the CPU feature check in __tlbie() only needs to check the
      ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called
      if we are running bare-metal, i.e. in hypervisor mode.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      969391c5
    • P
      KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests · aa04b4cc
      Paul Mackerras 提交于
      This adds infrastructure which will be needed to allow book3s_hv KVM to
      run on older POWER processors, including PPC970, which don't support
      the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
      Offset (RMO) facility.  These processors require a physically
      contiguous, aligned area of memory for each guest.  When the guest does
      an access in real mode (MMU off), the address is compared against a
      limit value, and if it is lower, the address is ORed with an offset
      value (from the Real Mode Offset Register (RMOR)) and the result becomes
      the real address for the access.  The size of the RMA has to be one of
      a set of supported values, which usually includes 64MB, 128MB, 256MB
      and some larger powers of 2.
      
      Since we are unlikely to be able to allocate 64MB or more of physically
      contiguous memory after the kernel has been running for a while, we
      allocate a pool of RMAs at boot time using the bootmem allocator.  The
      size and number of the RMAs can be set using the kvm_rma_size=xx and
      kvm_rma_count=xx kernel command line options.
      
      KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
      of the pool of preallocated RMAs.  The capability value is 1 if the
      processor can use an RMA but doesn't require one (because it supports
      the VRMA facility), or 2 if the processor requires an RMA for each guest.
      
      This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
      pool and returns a file descriptor which can be used to map the RMA.  It
      also returns the size of the RMA in the argument structure.
      
      Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
      ioctl calls from userspace.  To cope with this, we now preallocate the
      kvm->arch.ram_pginfo array when the VM is created with a size sufficient
      for up to 64GB of guest memory.  Subsequently we will get rid of this
      array and use memory associated with each memslot instead.
      
      This moves most of the code that translates the user addresses into
      host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
      to kvmppc_core_prepare_memory_region.  Also, instead of having to look
      up the VMA for each page in order to check the page size, we now check
      that the pages we get are compound pages of 16MB.  However, if we are
      adding memory that is mapped to an RMA, we don't bother with calling
      get_user_pages_fast and instead just offset from the base pfn for the
      RMA.
      
      Typically the RMA gets added after vcpus are created, which makes it
      inconvenient to have the LPCR (logical partition control register) value
      in the vcpu->arch struct, since the LPCR controls whether the processor
      uses RMA or VRMA for the guest.  This moves the LPCR value into the
      kvm->arch struct and arranges for the MER (mediated external request)
      bit, which is the only bit that varies between vcpus, to be set in
      assembly code when going into the guest if there is a pending external
      interrupt request.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      aa04b4cc
    • P
      KVM: PPC: Allow book3s_hv guests to use SMT processor modes · 371fefd6
      Paul Mackerras 提交于
      This lifts the restriction that book3s_hv guests can only run one
      hardware thread per core, and allows them to use up to 4 threads
      per core on POWER7.  The host still has to run single-threaded.
      
      This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
      capability.  The return value of the ioctl querying this capability
      is the number of vcpus per virtual CPU core (vcore), currently 4.
      
      To use this, the host kernel should be booted with all threads
      active, and then all the secondary threads should be offlined.
      This will put the secondary threads into nap mode.  KVM will then
      wake them from nap mode and use them for running guest code (while
      they are still offline).  To wake the secondary threads, we send
      them an IPI using a new xics_wake_cpu() function, implemented in
      arch/powerpc/sysdev/xics/icp-native.c.  In other words, at this stage
      we assume that the platform has a XICS interrupt controller and
      we are using icp-native.c to drive it.  Since the woken thread will
      need to acknowledge and clear the IPI, we also export the base
      physical address of the XICS registers using kvmppc_set_xics_phys()
      for use in the low-level KVM book3s code.
      
      When a vcpu is created, it is assigned to a virtual CPU core.
      The vcore number is obtained by dividing the vcpu number by the
      number of threads per core in the host.  This number is exported
      to userspace via the KVM_CAP_PPC_SMT capability.  If qemu wishes
      to run the guest in single-threaded mode, it should make all vcpu
      numbers be multiples of the number of threads per core.
      
      We distinguish three states of a vcpu: runnable (i.e., ready to execute
      the guest), blocked (that is, idle), and busy in host.  We currently
      implement a policy that the vcore can run only when all its threads
      are runnable or blocked.  This way, if a vcpu needs to execute elsewhere
      in the kernel or in qemu, it can do so without being starved of CPU
      by the other vcpus.
      
      When a vcore starts to run, it executes in the context of one of the
      vcpu threads.  The other vcpu threads all go to sleep and stay asleep
      until something happens requiring the vcpu thread to return to qemu,
      or to wake up to run the vcore (this can happen when another vcpu
      thread goes from busy in host state to blocked).
      
      It can happen that a vcpu goes from blocked to runnable state (e.g.
      because of an interrupt), and the vcore it belongs to is already
      running.  In that case it can start to run immediately as long as
      the none of the vcpus in the vcore have started to exit the guest.
      We send the next free thread in the vcore an IPI to get it to start
      to execute the guest.  It synchronizes with the other threads via
      the vcore->entry_exit_count field to make sure that it doesn't go
      into the guest if the other vcpus are exiting by the time that it
      is ready to actually enter the guest.
      
      Note that there is no fixed relationship between the hardware thread
      number and the vcpu number.  Hardware threads are assigned to vcpus
      as they become runnable, so we will always use the lower-numbered
      hardware threads in preference to higher-numbered threads if not all
      the vcpus in the vcore are runnable, regardless of which vcpus are
      runnable.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      371fefd6
    • D
      KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode · 54738c09
      David Gibson 提交于
      This improves I/O performance for guests using the PAPR
      paravirtualization interface by making the H_PUT_TCE hcall faster, by
      implementing it in real mode.  H_PUT_TCE is used for updating virtual
      IOMMU tables, and is used both for virtual I/O and for real I/O in the
      PAPR interface.
      
      Since this moves the IOMMU tables into the kernel, we define a new
      KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables.  The
      ioctl returns a file descriptor which can be used to mmap the newly
      created table.  The qemu driver models use them in the same way as
      userspace managed tables, but they can be updated directly by the
      guest with a real-mode H_PUT_TCE implementation, reducing the number
      of host/guest context switches during guest IO.
      
      There are certain circumstances where it is useful for userland qemu
      to write to the TCE table even if the kernel H_PUT_TCE path is used
      most of the time.  Specifically, allowing this will avoid awkwardness
      when we need to reset the table.  More importantly, we will in the
      future need to write the table in order to restore its state after a
      checkpoint resume or migration.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54738c09
    • P
      KVM: PPC: Handle some PAPR hcalls in the kernel · a8606e20
      Paul Mackerras 提交于
      This adds the infrastructure for handling PAPR hcalls in the kernel,
      either early in the guest exit path while we are still in real mode,
      or later once the MMU has been turned back on and we are in the full
      kernel context.  The advantage of handling hcalls in real mode if
      possible is that we avoid two partition switches -- and this will
      become more important when we support SMT4 guests, since a partition
      switch means we have to pull all of the threads in the core out of
      the guest.  The disadvantage is that we can only access the kernel
      linear mapping, not anything vmalloced or ioremapped, since the MMU
      is off.
      
      This also adds code to handle the following hcalls in real mode:
      
      H_ENTER       Add an HPTE to the hashed page table
      H_REMOVE      Remove an HPTE from the hashed page table
      H_READ        Read HPTEs from the hashed page table
      H_PROTECT     Change the protection bits in an HPTE
      H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
      H_SET_DABR    Set the data address breakpoint register
      
      Plus code to handle the following hcalls in the kernel:
      
      H_CEDE        Idle the vcpu until an interrupt or H_PROD hcall arrives
      H_PROD        Wake up a ceded vcpu
      H_REGISTER_VPA Register a virtual processor area (VPA)
      
      The code that runs in real mode has to be in the base kernel, not in
      the module, if KVM is compiled as a module.  The real-mode code can
      only access the kernel linear mapping, not vmalloc or ioremap space.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a8606e20
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948
    • P
      KVM: PPC: Split host-state fields out of kvmppc_book3s_shadow_vcpu · 3c42bf8a
      Paul Mackerras 提交于
      There are several fields in struct kvmppc_book3s_shadow_vcpu that
      temporarily store bits of host state while a guest is running,
      rather than anything relating to the particular guest or vcpu.
      This splits them out into a new kvmppc_host_state structure and
      modifies the definitions in asm-offsets.c to suit.
      
      On 32-bit, we have a kvmppc_host_state structure inside the
      kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
      to get to them both with one pointer.  On 64-bit they are separate
      fields in the PACA.  This means that on 64-bit we don't need to
      copy the kvmppc_host_state in and out on vcpu load/unload, and
      in future will mean that the book3s_hv code doesn't need a
      shadow_vcpu struct in the PACA at all.  That does mean that we
      have to be careful not to rely on any values persisting in the
      hstate field of the paca across any point where we could block
      or get preempted.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3c42bf8a
    • P
      powerpc: Set up LPCR for running guest partitions · 923c53ca
      Paul Mackerras 提交于
      In hypervisor mode, the LPCR controls several aspects of guest
      partitions, including virtual partition memory mode, and also controls
      whether the hypervisor decrementer interrupts are enabled.  This sets
      up LPCR at boot time so that guest partitions will use a virtual real
      memory area (VRMA) composed of 16MB large pages, and hypervisor
      decrementer interrupts are disabled.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      923c53ca
    • P
      KVM: PPC: Move guest enter/exit down into subarch-specific code · df6909e5
      Paul Mackerras 提交于
      Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
      calls in powerpc.c, this moves them down into the subarch-specific
      book3s_pr.c and booke.c.  This eliminates an extra local_irq_enable()
      call in book3s_pr.c, and will be needed for when we do SMT4 guest
      support in the book3s hypervisor mode code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      df6909e5
    • P
      KVM: PPC: Pass init/destroy vm and prepare/commit memory region ops down · f9e0554d
      Paul Mackerras 提交于
      This arranges for the top-level arch/powerpc/kvm/powerpc.c file to
      pass down some of the calls it gets to the lower-level subarchitecture
      specific code.  The lower-level implementations (in booke.c and book3s.c)
      are no-ops.  The coming book3s_hv.c will need this.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f9e0554d
    • P
      powerpc, KVM: Rework KVM checks in first-level interrupt handlers · b01c8b54
      Paul Mackerras 提交于
      Instead of branching out-of-line with the DO_KVM macro to check if we
      are in a KVM guest at the time of an interrupt, this moves the KVM
      check inline in the first-level interrupt handlers.  This speeds up
      the non-KVM case and makes sure that none of the interrupt handlers
      are missing the check.
      
      Because the first-level interrupt handlers are now larger, some things
      had to be move out of line in exceptions-64s.S.
      
      This all necessitated some minor changes to the interrupt entry code
      in KVM.  This also streamlines the book3s_32 KVM test.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b01c8b54
    • P
      KVM: PPC: Split out code from book3s.c into book3s_pr.c · f05ed4d5
      Paul Mackerras 提交于
      In preparation for adding code to enable KVM to use hypervisor mode
      on 64-bit Book 3S processors, this splits book3s.c into two files,
      book3s.c and book3s_pr.c, where book3s_pr.c contains the code that is
      specific to running the guest in problem state (user mode) and book3s.c
      contains code which should apply to all Book 3S processors.
      
      In doing this, we abstract some details, namely the interrupt offset,
      updating the interrupt pending flag, and detecting if the guest is
      in a critical section.  These are all things that will be different
      when we use hypervisor mode.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f05ed4d5
    • P
      KVM: PPC: Move fields between struct kvm_vcpu_arch and kvmppc_vcpu_book3s · c4befc58
      Paul Mackerras 提交于
      This moves the slb field, which represents the state of the emulated
      SLB, from the kvmppc_vcpu_book3s struct to the kvm_vcpu_arch, and the
      hpte_hash_[v]pte[_long] fields from kvm_vcpu_arch to kvmppc_vcpu_book3s.
      This is in accord with the principle that the kvm_vcpu_arch struct
      represents the state of the emulated CPU, and the kvmppc_vcpu_book3s
      struct holds the auxiliary data structures used in the emulation.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c4befc58
    • L
      KVM: PPC: e500: Add shadow PID support · dd9ebf1f
      Liu Yu 提交于
      Dynamically assign host PIDs to guest PIDs, splitting each guest PID into
      multiple host (shadow) PIDs based on kernel/user and MSR[IS/DS].  Use
      both PID0 and PID1 so that the shadow PIDs for the right mode can be
      selected, that correspond both to guest TID = zero and guest TID = guest
      PID.
      
      This allows us to significantly reduce the frequency of needing to
      invalidate the entire TLB.  When the guest mode or PID changes, we just
      update the host PID0/PID1.  And since the allocation of shadow PIDs is
      global, multiple guests can share the TLB without conflict.
      
      Note that KVM does not yet support the guest setting PID1 or PID2 to
      a value other than zero.  This will need to be fixed for nested KVM
      to work.  Until then, we enforce the requirement for guest PID1/PID2
      to stay zero by failing the emulation if the guest tries to set them
      to something else.
      Signed-off-by: NLiu Yu <yu.liu@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      dd9ebf1f
    • L
      KVM: PPC: e500: Stop keeping shadow TLB · 08b7fa92
      Liu Yu 提交于
      Instead of a fully separate set of TLB entries, keep just the
      pfn and dirty status.
      Signed-off-by: NLiu Yu <yu.liu@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      08b7fa92
    • S
      KVM: PPC: e500: enable magic page · a4cd8b23
      Scott Wood 提交于
      This is a shared page used for paravirtualization.  It is always present
      in the guest kernel's effective address space at the address indicated
      by the hypercall that enables it.
      
      The physical address specified by the hypercall is not used, as
      e500 does not have real mode.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a4cd8b23
    • S
      KVM: PPC: e500: Eliminate shadow_pages[], and use pfns instead. · 59c1f4e3
      Scott Wood 提交于
      This is in line with what other architectures do, and will allow us to
      map things other than ordinary, unreserved kernel pages -- such as
      dedicated devices, or large contiguous reserved regions.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      59c1f4e3
    • S
      KVM: PPC: e500: Save/restore SPE state · 4cd35f67
      Scott Wood 提交于
      This is done lazily.  The SPE save will be done only if the guest has
      used SPE since the last preemption or heavyweight exit.  Restore will be
      done only on demand, when enabling MSR_SPE in the shadow MSR, in response
      to an SPE fault or mtmsr emulation.
      
      For SPEFSCR, Linux already switches it on context switch (non-lazily), so
      the only remaining bit is to save it between qemu and the guest.
      Signed-off-by: NLiu Yu <yu.liu@freescale.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4cd35f67