1. 23 8月, 2019 2 次提交
    • S
      KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot · d22deab6
      Suraj Jitindar Singh 提交于
      The rmap array in the guest memslot is an array of size number of guest
      pages, allocated at memslot creation time. Each rmap entry in this array
      is used to store information about the guest page to which it
      corresponds. For example for a hpt guest it is used to store a lock bit,
      rc bits, a present bit and the index of a hpt entry in the guest hpt
      which maps this page. For a radix guest which is running nested guests
      it is used to store a pointer to a linked list of nested rmap entries
      which store the nested guest physical address which maps this guest
      address and for which there is a pte in the shadow page table.
      
      As there are currently two uses for the rmap array, and the potential
      for this to expand to more in the future, define a type field (being the
      top 8 bits of the rmap entry) to be used to define the type of the rmap
      entry which is currently present and define two values for this field
      for the two current uses of the rmap array.
      
      Since the nested case uses the rmap entry to store a pointer, define
      this type as having the two high bits set as is expected for a pointer.
      Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d22deab6
    • P
      KVM: PPC: Book3S: Mark expected switch fall-through · ff7240cc
      Paul Menzel 提交于
      Fix the error below triggered by `-Wimplicit-fallthrough`, by tagging
      it as an expected fall-through.
      
          arch/powerpc/kvm/book3s_32_mmu.c: In function ‘kvmppc_mmu_book3s_32_xlate_pte’:
          arch/powerpc/kvm/book3s_32_mmu.c:241:21: error: this statement may fall through [-Werror=implicit-fallthrough=]
                pte->may_write = true;
                ~~~~~~~~~~~~~~~^~~~~~
          arch/powerpc/kvm/book3s_32_mmu.c:242:5: note: here
               case 3:
               ^~~~
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ff7240cc
  2. 16 8月, 2019 4 次提交
    • P
      powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race · da15c03b
      Paul Mackerras 提交于
      Testing has revealed the existence of a race condition where a XIVE
      interrupt being shut down can be in one of the XIVE interrupt queues
      (of which there are up to 8 per CPU, one for each priority) at the
      point where free_irq() is called.  If this happens, can return an
      interrupt number which has been shut down.  This can lead to various
      symptoms:
      
      - irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
        function gets called, resulting in the CPU's elevated interrupt
        priority (numerically lowered CPPR) never gets reset.  That then
        means that the CPU stops processing interrupts, causing device
        timeouts and other errors in various device drivers.
      
      - The irq descriptor or related data structures can be in the process
        of being freed as the interrupt code is using them.  This typically
        leads to crashes due to bad pointer dereferences.
      
      This race is basically what commit 62e04686 ("genirq: Add optional
      hardware synchronization for shutdown", 2019-06-28) is intended to
      fix, given a get_irqchip_state() method for the interrupt controller
      being used.  It works by polling the interrupt controller when an
      interrupt is being freed until the controller says it is not pending.
      
      With XIVE, the PQ bits of the interrupt source indicate the state of
      the interrupt source, and in particular the P bit goes from 0 to 1 at
      the point where the hardware writes an entry into the interrupt queue
      that this interrupt is directed towards.  Normally, the code will then
      process the interrupt and do an end-of-interrupt (EOI) operation which
      will reset PQ to 00 (assuming another interrupt hasn't been generated
      in the meantime).  However, there are situations where the code resets
      P even though a queue entry exists (for example, by setting PQ to 01,
      which disables the interrupt source), and also situations where the
      code leaves P at 1 after removing the queue entry (for example, this
      is done for escalation interrupts so they cannot fire again until
      they are explicitly re-enabled).
      
      The code already has a 'saved_p' flag for the interrupt source which
      indicates that a queue entry exists, although it isn't maintained
      consistently.  This patch adds a 'stale_p' flag to indicate that
      P has been left at 1 after processing a queue entry, and adds code
      to set and clear saved_p and stale_p as necessary to maintain a
      consistent indication of whether a queue entry may or may not exist.
      
      With this, we can implement xive_get_irqchip_state() by looking at
      stale_p, saved_p and the ESB PQ bits for the interrupt.
      
      There is some additional code to handle escalation interrupts
      properly; because they are enabled and disabled in KVM assembly code,
      which does not have access to the xive_irq_data struct for the
      escalation interrupt.  Hence, stale_p may be incorrect when the
      escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
      Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
      with some careful attention to barriers in order to ensure the correct
      result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().
      
      Finally, this adds code to make noise on the console (pr_crit and
      WARN_ON(1)) if we find an interrupt queue entry for an interrupt
      which does not have a descriptor.  While this won't catch the race
      reliably, if it does get triggered it will be an indication that
      the race is occurring and needs to be debugged.
      
      Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry
      da15c03b
    • P
      KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device · 8d4ba9c9
      Paul Mackerras 提交于
      At present, when running a guest on POWER9 using HV KVM but not using
      an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
      is run with the kernel_irqchip=off option, the guest entry code goes
      ahead and tries to load the guest context into the XIVE hardware, even
      though no context has been set up.
      
      To fix this, we check that the "CAM word" is non-zero before pushing
      it to the hardware.  The CAM word is initialized to a non-zero value
      in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
      and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.
      
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Reported-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100100.GC9567@blackberry
      8d4ba9c9
    • P
      KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts · 959c5d51
      Paul Mackerras 提交于
      Escalation interrupts are interrupts sent to the host by the XIVE
      hardware when it has an interrupt to deliver to a guest VCPU but that
      VCPU is not running anywhere in the system.  Hence we disable the
      escalation interrupt for the VCPU being run when we enter the guest
      and re-enable it when the guest does an H_CEDE hypercall indicating
      it is idle.
      
      It is possible that an escalation interrupt gets generated just as we
      are entering the guest.  In that case the escalation interrupt may be
      using a queue entry in one of the interrupt queues, and that queue
      entry may not have been processed when the guest exits with an H_CEDE.
      The existing entry code detects this situation and does not clear the
      vcpu->arch.xive_esc_on flag as an indication that there is a pending
      queue entry (if the queue entry gets processed, xive_esc_irq() will
      clear the flag).  There is a comment in the code saying that if the
      flag is still set on H_CEDE, we have to abort the cede rather than
      re-enabling the escalation interrupt, lest we end up with two
      occurrences of the escalation interrupt in the interrupt queue.
      
      However, the exit code doesn't do that; it aborts the cede in the sense
      that vcpu->arch.ceded gets cleared, but it still enables the escalation
      interrupt by setting the source's PQ bits to 00.  Instead we need to
      set the PQ bits to 10, indicating that an interrupt has been triggered.
      We also need to avoid setting vcpu->arch.xive_esc_on in this case
      (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
      xive_esc_irq() will run at some point and clear it, and if we race with
      that we may end up with an incorrect result (i.e. xive_esc_on set when
      the escalation interrupt has just been handled).
      
      It is extremely unlikely that having two queue entries would cause
      observable problems; theoretically it could cause queue overflow, but
      the CPU would have to have thousands of interrupts targetted to it for
      that to be possible.  However, this fix will also make it possible to
      determine accurately whether there is an unhandled escalation
      interrupt in the queue, which will be needed by the following patch.
      
      Fixes: 9b9b13a6 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberry
      959c5d51
    • C
      KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP · 237aed48
      Cédric Le Goater 提交于
      When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
      disabled and then the event notification queues are freed. When freeing
      the queues, we check for possible escalation interrupts and free them
      also.
      
      But when a XIVE VP is disabled, the underlying XIVE ENDs also are
      disabled in OPAL. When an END (Event Notification Descriptor) is
      disabled, its ESB pages (ESn and ESe) are disabled and loads return all
      1s. Which means that any access on the ESB page of the escalation
      interrupt will return invalid values.
      
      When an interrupt is freed, the shutdown handler computes a 'saved_p'
      field from the value returned by a load in xive_do_source_set_mask().
      This value is incorrect for escalation interrupts for the reason
      described above.
      
      This has no impact on Linux/KVM today because we don't make use of it
      but we will introduce in future changes a xive_get_irqchip_state()
      handler. This handler will use the 'saved_p' field to return the state
      of an interrupt and 'saved_p' being incorrect, softlockup will occur.
      
      Fix the vCPU cleanup sequence by first freeing the escalation interrupts
      if any, then disable the XIVE VP and last free the queues.
      
      Fixes: 90c73795 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.org
      237aed48
  3. 26 7月, 2019 3 次提交
  4. 25 7月, 2019 7 次提交
  5. 24 7月, 2019 7 次提交
  6. 23 7月, 2019 10 次提交
  7. 22 7月, 2019 7 次提交