1. 09 10月, 2018 9 次提交
    • P
      KVM: PPC: Book3S HV: Nested guest entry via hypercall · 360cae31
      Paul Mackerras 提交于
      This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
      hypervisor to enter one of its nested guests.  The hypercall supplies
      register values in two structs.  Those values are copied by the level 0
      (L0) hypervisor (the one which is running in hypervisor mode) into the
      vcpu struct of the L1 guest, and then the guest is run until an
      interrupt or error occurs which needs to be reported to L1 via the
      hypercall return value.
      
      Currently this assumes that the L0 and L1 hypervisors are the same
      endianness, and the structs passed as arguments are in native
      endianness.  If they are of different endianness, the version number
      check will fail and the hcall will be rejected.
      
      Nested hypervisors do not support indep_threads_mode=N, so this adds
      code to print a warning message if the administrator has set
      indep_threads_mode=N, and treat it as Y.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      360cae31
    • P
      KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization · 8e3f5fc1
      Paul Mackerras 提交于
      This starts the process of adding the code to support nested HV-style
      virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
      a nested hypervisor can use to set the base address and size of a
      partition table in its memory (analogous to the PTCR register).
      On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
      hypercall from the guest is handled by code that saves the virtual
      PTCR value for the guest.
      
      This also adds code for creating and destroying nested guests and for
      reading the partition table entry for a nested guest from L1 memory.
      Each nested guest has its own shadow LPID value, different in general
      from the LPID value used by the nested hypervisor to refer to it.  The
      shadow LPID value is allocated at nested guest creation time.
      
      Nested hypervisor functionality is only available for a radix guest,
      which therefore means a radix host on a POWER9 (or later) processor.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8e3f5fc1
    • S
      KVM: PPC: Book3S HV: Clear partition table entry on vm teardown · 89329c0b
      Suraj Jitindar Singh 提交于
      When destroying a VM we return the LPID to the pool, however we never
      zero the partition table entry. This is instead done when we reallocate
      the LPID.
      
      Zero the partition table entry on VM teardown before returning the LPID
      to the pool. This means if we were running as a nested hypervisor the
      real hypervisor could use this to determine when it can free resources.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      89329c0b
    • P
      KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct · fd0944ba
      Paul Mackerras 提交于
      When the 'regs' field was added to struct kvm_vcpu_arch, the code
      was changed to use several of the fields inside regs (e.g., gpr, lr,
      etc.) but not the ccr field, because the ccr field in struct pt_regs
      is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
      only 32 bits.  This changes the code to use the regs.ccr field
      instead of cr, and changes the assembly code on 64-bit platforms to
      use 64-bit loads and stores instead of 32-bit ones.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fd0944ba
    • P
      KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings · 9a94d3ee
      Paul Mackerras 提交于
      This adds a file called 'radix' in the debugfs directory for the
      guest, which when read gives all of the valid leaf PTEs in the
      partition-scoped radix tree for a radix guest, in human-readable
      format.  It is analogous to the existing 'htab' file which dumps
      the HPT entries for a HPT guest.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9a94d3ee
    • P
      KVM: PPC: Book3S HV: Handle hypervisor instruction faults better · 32eb150a
      Paul Mackerras 提交于
      Currently the code for handling hypervisor instruction page faults
      passes 0 for the flags indicating the type of fault, which is OK in
      the usual case that the page is not mapped in the partition-scoped
      page tables.  However, there are other causes for hypervisor
      instruction page faults, such as not being to update a reference
      (R) or change (C) bit.  The cause is indicated in bits in HSRR1,
      including a bit which indicates that the fault is due to not being
      able to write to a page (for example to update an R or C bit).
      Not handling these other kinds of faults correctly can lead to a
      loop of continual faults without forward progress in the guest.
      
      In order to handle these faults better, this patch constructs a
      "DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
      have in common, and passes it to kvmppc_book3s_hv_page_fault() so
      that it knows what caused the fault.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      32eb150a
    • P
      KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests · 95a6432c
      Paul Mackerras 提交于
      This creates an alternative guest entry/exit path which is used for
      radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
      these circumstances there is exactly one vcpu per vcore and there is
      no coordination required between vcpus or vcores; the vcpu can enter
      the guest without needing to synchronize with anything else.
      
      The new fast path is implemented almost entirely in C in book3s_hv.c
      and runs with the MMU on until the guest is entered.  On guest exit
      we use the existing path until the point where we are committed to
      exiting the guest (as distinct from handling an interrupt in the
      low-level code and returning to the guest) and we have pulled the
      guest context from the XIVE.  At that point we check a flag in the
      stack frame to see whether we came in via the old path and the new
      path; if we came in via the new path then we go back to C code to do
      the rest of the process of saving the guest context and restoring the
      host context.
      
      The C code is split into separate functions for handling the
      OS-accessible state and the hypervisor state, with the idea that the
      latter can be replaced by a hypercall when we implement nested
      virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Fix CONFIG_ALTIVEC=n build]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      95a6432c
    • P
      KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked · 53655ddd
      Paul Mackerras 提交于
      Currently kvmppc_handle_exit_hv() is called with the vcore lock held
      because it is called within a for_each_runnable_thread loop.
      However, we already unlock the vcore within kvmppc_handle_exit_hv()
      under certain circumstances, and this is safe because (a) any vcpus
      that become runnable and are added to the runnable set by
      kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
      run in the guest (because the vcore state is VCORE_EXITING), and
      (b) for_each_runnable_thread is safe against addition or removal
      of vcpus from the runnable set.
      
      Therefore, in order to simplify things for following patches, let's
      drop the vcore lock in the for_each_runnable_thread loop, so
      kvmppc_handle_exit_hv() gets called without the vcore lock held.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      53655ddd
    • P
      KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code · f7035ce9
      Paul Mackerras 提交于
      This is based on a patch by Suraj Jitindar Singh.
      
      This moves the code in book3s_hv_rmhandlers.S that generates an
      external, decrementer or privileged doorbell interrupt just before
      entering the guest to C code in book3s_hv_builtin.c.  This is to
      make future maintenance and modification easier.  The algorithm
      expressed in the C code is almost identical to the previous
      algorithm.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7035ce9
  2. 21 8月, 2018 1 次提交
  3. 30 7月, 2018 1 次提交
  4. 26 7月, 2018 2 次提交
    • P
      KVM: PPC: Book3S HV: Read kvm->arch.emul_smt_mode under kvm->lock · b5c6f760
      Paul Mackerras 提交于
      Commit 1e175d2e ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
      VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode
      before any VCPUs are created.  However, userspace can change
      kvm->arch.emul_smt_mode at any time up until the first VCPU is created.
      Hence it is (theoretically) possible for the check in
      kvmppc_core_vcpu_create_hv() to race with another userspace thread
      changing kvm->arch.emul_smt_mode.
      
      This fixes it by moving the test that uses kvm->arch.emul_smt_mode into
      the block where kvm->lock is held.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b5c6f760
    • S
      KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space · 1e175d2e
      Sam Bobroff 提交于
      It is not currently possible to create the full number of possible
      VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer
      threads per core than its core stride (or "VSMT mode"). This is
      because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS
      even though the VCPU ID is less than KVM_MAX_VCPU_ID.
      
      To address this, "pack" the VCORE ID and XIVE offsets by using
      knowledge of the way the VCPU IDs will be used when there are fewer
      guest threads per core than the core stride. The primary thread of
      each core will always be used first. Then, if the guest uses more than
      one thread per core, these secondary threads will sequentially follow
      the primary in each core.
      
      So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
      VCPUs are being spaced apart, so at least half of each core is empty,
      and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
      into the second half of each core (4..7, in an 8-thread core).
      
      Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
      each core is being left empty, and we can map down into the second and
      third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
      
      Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
      threads are being used and 7/8 of the core is empty, allowing use of
      the 1, 5, 3 and 7 thread slots.
      
      (Strides less than 8 are handled similarly.)
      
      This allows the VCORE ID or offset to be calculated quickly from the
      VCPU ID or XIVE server numbers, without access to the VCPU structure.
      
      [paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE
       to pr_devel, wrapped line, fixed id check.]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1e175d2e
  5. 18 7月, 2018 2 次提交
  6. 16 7月, 2018 1 次提交
  7. 20 6月, 2018 1 次提交
  8. 13 6月, 2018 1 次提交
    • K
      treewide: Use array_size() in vzalloc() · fad953ce
      Kees Cook 提交于
      The vzalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vzalloc(a * b)
      
      with:
              vzalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vzalloc(a * b * c)
      
      with:
      
              vzalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vzalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vzalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vzalloc(C1 * C2 * C3, ...)
      |
        vzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vzalloc(C1 * C2, ...)
      |
        vzalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      fad953ce
  9. 02 6月, 2018 1 次提交
    • G
      kvm: no need to check return value of debugfs_create functions · 929f45e3
      Greg Kroah-Hartman 提交于
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      This cleans up the error handling a lot, as this code will never get
      hit.
      
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim KrÄmář" <rkrcmar@redhat.com>
      Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
      Cc: Eric Auger <eric.auger@redhat.com>
      Cc: Andre Przywara <andre.przywara@arm.com>
      Cc: kvm-ppc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: kvmarm@lists.cs.columbia.edu
      Cc: kvm@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      929f45e3
  10. 18 5月, 2018 2 次提交
  11. 17 5月, 2018 3 次提交
    • P
      KVM: PPC: Book3S HV: Set RWMR on POWER8 so PURR/SPURR count correctly · 7aa15842
      Paul Mackerras 提交于
      Although Linux doesn't use PURR and SPURR ((Scaled) Processor
      Utilization of Resources Register), other OSes depend on them.
      On POWER8 they count at a rate depending on whether the VCPU is
      idle or running, the activity of the VCPU, and the value in the
      RWMR (Region-Weighting Mode Register).  Hardware expects the
      hypervisor to update the RWMR when a core is dispatched to reflect
      the number of online VCPUs in the vcore.
      
      This adds code to maintain a count in the vcore struct indicating
      how many VCPUs are online.  In kvmppc_run_core we use that count
      to set the RWMR register on POWER8.  If the core is split because
      of a static or dynamic micro-threading mode, we use the value for
      8 threads.  The RWMR value is not relevant when the host is
      executing because Linux does not use the PURR or SPURR register,
      so we don't bother saving and restoring the host value.
      
      For the sake of old userspace which does not set the KVM_REG_PPC_ONLINE
      register, we set online to 1 if it was 0 at the time of a KVM_RUN
      ioctl.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7aa15842
    • P
      KVM: PPC: Book3S HV: Add 'online' register to ONE_REG interface · a1f15826
      Paul Mackerras 提交于
      This adds a new KVM_REG_PPC_ONLINE register which userspace can set
      to 0 or 1 via the GET/SET_ONE_REG interface to indicate whether it
      considers the VCPU to be offline (0), that is, not currently running,
      or online (1).  This will be used in a later patch to configure the
      register which controls PURR and SPURR accumulation.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a1f15826
    • P
      KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry · 57b8daa7
      Paul Mackerras 提交于
      Currently, the HV KVM guest entry/exit code adds the timebase offset
      from the vcore struct to the timebase on guest entry, and subtracts
      it on guest exit.  Which is fine, except that it is possible for
      userspace to change the offset using the SET_ONE_REG interface while
      the vcore is running, as there is only one timebase offset per vcore
      but potentially multiple VCPUs in the vcore.  If that were to happen,
      KVM would subtract a different offset on guest exit from that which
      it had added on guest entry, leading to the timebase being out of sync
      between cores in the host, which then leads to bad things happening
      such as hangs and spurious watchdog timeouts.
      
      To fix this, we add a new field 'tb_offset_applied' to the vcore struct
      which stores the offset that is currently applied to the timebase.
      This value is set from the vcore tb_offset field on guest entry, and
      is what is subtracted from the timebase on guest exit.  Since it is
      zero when the timebase offset is not applied, we can simplify the
      logic in kvmhv_start_timing and kvmhv_accumulate_time.
      
      In addition, we had secondary threads reading the timebase while
      running concurrently with code on the primary thread which would
      eventually add or subtract the timebase offset from the timebase.
      This occurred while saving or restoring the DEC register value on
      the secondary threads.  Although no specific incorrect behaviour has
      been observed, this is a race which should be fixed.  To fix it, we
      move the DEC saving code to just before we call kvmhv_commence_exit,
      and the DEC restoring code to after the point where we have waited
      for the primary thread to switch the MMU context and add the timebase
      offset.  That way we are sure that the timebase contains the guest
      timebase value in both cases.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57b8daa7
  12. 03 5月, 2018 1 次提交
    • N
      powerpc64/ftrace: Disable ftrace during kvm entry/exit · a4bc64d3
      Naveen N. Rao 提交于
      During guest entry/exit, we switch over to/from the guest MMU context
      and we cannot take exceptions in the hypervisor code.
      
      Since ftrace may be enabled and since it can result in us taking a trap,
      disable ftrace by setting paca->ftrace_enabled to zero. There are two
      paths through which we enter/exit a guest:
      1. If we are the vcore runner, then we enter the guest via
      __kvmppc_vcore_entry() and we disable ftrace around this. This is always
      the case for Power9, and for the primary thread on Power8.
      2. If we are a secondary thread in Power8, then we would be in nap due
      to SMT being disabled. We are woken up by an IPI to enter the guest. In
      this scenario, we enter the guest through kvm_start_guest(). We disable
      ftrace at this point. In this scenario, ftrace would only get re-enabled
      on the secondary thread when SMT is re-enabled (via start_secondary()).
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a4bc64d3
  13. 03 4月, 2018 1 次提交
  14. 30 3月, 2018 2 次提交
  15. 27 3月, 2018 1 次提交
  16. 23 3月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 · 4bb3c7a0
      Paul Mackerras 提交于
      POWER9 has hardware bugs relating to transactional memory and thread
      reconfiguration (changes to hardware SMT mode).  Specifically, the core
      does not have enough storage to store a complete checkpoint of all the
      architected state for all four threads.  The DD2.2 version of POWER9
      includes hardware modifications designed to allow hypervisor software
      to implement workarounds for these problems.  This patch implements
      those workarounds in KVM code so that KVM guests see a full, working
      transactional memory implementation.
      
      The problems center around the use of TM suspended state, where the
      CPU has a checkpointed state but execution is not transactional.  The
      workaround is to implement a "fake suspend" state, which looks to the
      guest like suspended state but the CPU does not store a checkpoint.
      In this state, any instruction that would cause a transition to
      transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
      checkpointed state (treclaim) causes a "soft patch" interrupt (vector
      0x1500) to the hypervisor so that it can be emulated.  The trechkpt
      instruction also causes a soft patch interrupt.
      
      On POWER9 DD2.2, we avoid returning to the guest in any state which
      would require a checkpoint to be present.  The trechkpt in the guest
      entry path which would normally create that checkpoint is replaced by
      either a transition to fake suspend state, if the guest is in suspend
      state, or a rollback to the pre-transactional state if the guest is in
      transactional state.  Fake suspend state is indicated by a flag in the
      PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only and
      reads back as 0.
      
      On exit from the guest, if the guest is in fake suspend state, we still
      do the treclaim instruction as we would in real suspend state, in order
      to get into non-transactional state, but we do not save the resulting
      register state since there was no checkpoint.
      
      Emulation of the instructions that cause a softpatch interrupt is
      handled in two paths.  If the guest is in real suspend mode, we call
      kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
      transitioning to transactional state.  This is called before we do the
      treclaim in the guest exit path; because we haven't done treclaim, we
      can get back to the guest with the transaction still active.  If the
      instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
      handle, or if the guest is in fake suspend state, then we proceed to
      do the complete guest exit path and subsequently call
      kvmhv_p9_tm_emulation() in host context with the MMU on.  This handles
      all the cases including the cases that generate program interrupts
      (illegal instruction or TM Bad Thing) and facility unavailable
      interrupts.
      
      The emulation is reasonably straightforward and is mostly concerned
      with checking for exception conditions and updating the state of
      registers such as MSR and CR0.  The treclaim emulation takes care to
      ensure that the TEXASR register gets updated as if it were the guest
      treclaim instruction that had done failure recording, not the treclaim
      done in hypervisor state in the guest exit path.
      
      With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
      transactional memory is not available to host userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bb3c7a0
  17. 19 3月, 2018 1 次提交
  18. 03 3月, 2018 1 次提交
  19. 02 3月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing · debd574f
      Paul Mackerras 提交于
      The current code for initializing the VRMA (virtual real memory area)
      for HPT guests requires the page size of the backing memory to be one
      of 4kB, 64kB or 16MB.  With a radix host we have the possibility that
      the backing memory page size can be 2MB or 1GB.  In these cases, if the
      guest switches to HPT mode, KVM will not initialize the VRMA and the
      guest will fail to run.
      
      In fact it is not necessary that the VRMA page size is the same as the
      backing memory page size; any VRMA page size less than or equal to the
      backing memory page size is acceptable.  Therefore we now choose the
      largest page size out of the set {4k, 64k, 16M} which is not larger
      than the backing memory page size.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      debd574f
  20. 01 2月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Drop locks before reading guest memory · 36ee41d1
      Paul Mackerras 提交于
      Running with CONFIG_DEBUG_ATOMIC_SLEEP reveals that HV KVM tries to
      read guest memory, in order to emulate guest instructions, while
      preempt is disabled and a vcore lock is held.  This occurs in
      kvmppc_handle_exit_hv(), called from post_guest_process(), when
      emulating guest doorbell instructions on POWER9 systems, and also
      when checking whether we have hit a hypervisor breakpoint.
      Reading guest memory can cause a page fault and thus cause the
      task to sleep, so we need to avoid reading guest memory while
      holding a spinlock or when preempt is disabled.
      
      To fix this, we move the preempt_enable() in kvmppc_run_core() to
      before the loop that calls post_guest_process() for each vcore that
      has just run, and we drop and re-take the vcore lock around the calls
      to kvmppc_emulate_debug_inst() and kvmppc_emulate_doorbell_instr().
      
      Dropping the lock is safe with respect to the iteration over the
      runnable vcpus in post_guest_process(); for_each_runnable_thread
      is actually safe to use locklessly.  It is possible for a vcpu
      to become runnable and add itself to the runnable_threads array
      (code near the beginning of kvmppc_run_vcpu()) and then get included
      in the iteration in post_guest_process despite the fact that it
      has not just run.  This is benign because vcpu->arch.trap and
      vcpu->arch.ceded will be zero.
      
      Cc: stable@vger.kernel.org # v4.13+
      Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      36ee41d1
  21. 22 1月, 2018 1 次提交
    • R
      powerpc: Use octal numbers for file permissions · 57ad583f
      Russell Currey 提交于
      Symbolic macros are unintuitive and hard to read, whereas octal constants
      are much easier to interpret.  Replace macros for the basic permission
      flags (user/group/other read/write/execute) with numeric constants
      instead, across the whole powerpc tree.
      
      Introducing a significant number of changes across the tree for no runtime
      benefit isn't exactly desirable, but so long as these macros are still
      used in the tree people will keep sending patches that add them.  Not only
      are they hard to parse at a glance, there are multiple ways of coming to
      the same value (as you can see with 0444 and 0644 in this patch) which
      hurts readability.
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Reviewed-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      57ad583f
  22. 19 1月, 2018 1 次提交
  23. 18 1月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Allow HPT and radix on the same core for POWER9 v2.2 · 00608e1f
      Paul Mackerras 提交于
      POWER9 chip versions starting with "Nimbus" v2.2 can support running
      with some threads of a core in HPT mode and others in radix mode.
      This means that we don't have to prohibit independent-threads mode
      when running a HPT guest on a radix host, and we don't have to do any
      of the synchronization between threads that was introduced in commit
      c0101509 ("KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix
      hosts", 2017-10-19).
      
      Rather than using up another CPU feature bit, we just do an
      explicit test on the PVR (processor version register) at module
      startup time to determine whether we have to take steps to avoid
      having some threads in HPT mode and some in radix mode (so-called
      "mixed mode").  We test for "Nimbus" (indicated by 0 or 1 in the top
      nibble of the lower 16 bits) v2.2 or later, or "Cumulus" (indicated by
      2 or 3 in that nibble) v1.1 or later.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      00608e1f
  24. 16 1月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Enable migration of decrementer register · 5855564c
      Paul Mackerras 提交于
      This adds a register identifier for use with the one_reg interface
      to allow the decrementer expiry time to be read and written by
      userspace.  The decrementer expiry time is in guest timebase units
      and is equal to the sum of the decrementer and the guest timebase.
      (The expiry time is used rather than the decrementer value itself
      because the expiry time is not constantly changing, though the
      decrementer value is, while the guest vcpu is not running.)
      
      Without this, a guest vcpu migrated to a new host will see its
      decrementer set to some random value.  On POWER8 and earlier, the
      decrementer is 32 bits wide and counts down at 512MHz, so the
      guest vcpu will potentially see no decrementer interrupts for up
      to about 4 seconds, which will lead to a stall.  With POWER9, the
      decrementer is now 56 bits side, so the stall can be much longer
      (up to 2.23 years) and more noticeable.
      
      To help work around the problem in cases where userspace has not been
      updated to migrate the decrementer expiry time, we now set the
      default decrementer expiry at vcpu creation time to the current time
      rather than the maximum possible value.  This should mean an
      immediate decrementer interrupt when a migrated vcpu starts
      running.  In cases where the decrementer is 32 bits wide and more
      than 4 seconds elapse between the creation of the vcpu and when it
      first runs, the decrementer would have wrapped around to positive
      values and there may still be a stall - but this is no worse than
      the current situation.  In the large-decrementer case, we are sure
      to get an immediate decrementer interrupt (assuming the time from
      vcpu creation to first run is less than 2.23 years) and we thus
      avoid a very long stall.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5855564c
  25. 23 11月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Fix conditions for starting vcpu · c0093f1a
      Paul Mackerras 提交于
      This corrects the test that determines whether a vcpu that has just
      become able to run in the guest (e.g. it has just finished handling
      a hypercall or hypervisor page fault) and whose virtual core is
      already running somewhere as a "piggybacked" vcore can start
      immediately or not.  (A piggybacked vcore is one which is executing
      along with another vcore as a result of dynamic micro-threading.)
      
      Previously the test tried to lock the piggybacked vcore using
      spin_trylock, which would always fail because the vcore was already
      locked, and so the vcpu would have to wait until its vcore exited
      the guest before it could enter.
      
      In fact the vcpu can enter if its vcore is in VCORE_PIGGYBACK state
      and not already exiting (or exited) the guest, so the test in
      VCORE_PIGGYBACK state is basically the same as for VCORE_RUNNING
      state.
      
      Coverity detected this as a double unlock issue, which it isn't
      because the spin_trylock would always fail.  This will fix the
      apparent double unlock as well.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c0093f1a
    • P
      KVM: PPC: Book3S HV: Remove useless statement · 4fcf361d
      Paul Mackerras 提交于
      This removes a statement that has no effect.  It should have been
      removed in commit 898b25b2 ("KVM: PPC: Book3S HV: Simplify dynamic
      micro-threading code", 2017-06-22) along with the loop over the
      piggy-backed virtual cores.
      
      This issue was reported by Coverity.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4fcf361d