1. 22 5月, 2018 1 次提交
  2. 18 5月, 2018 13 次提交
    • P
      KVM: PPC: Book3S PR: Enable use on POWER9 inside HPT-mode guests · ec531d02
      Paul Mackerras 提交于
      This relaxes the restriction on using PR KVM on POWER9.  The existing
      code does work inside a guest partition running in HPT mode, because
      hypercalls such as H_ENTER use the old HPTE format, not the new
      format used by POWER9, and so no change to PR KVM's HPT manipulation
      code is required.  PR KVM will still refuse to run if the kernel is
      using radix translation or if it is running bare-metal.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ec531d02
    • N
      KVM: PPC: Book3S HV: Send kvmppc_bad_interrupt NMIs to Linux handlers · 7c1bd80c
      Nicholas Piggin 提交于
      It's possible to take a SRESET or MCE in these paths due to a bug
      in the host code or a NMI IPI, etc. A recent bug attempting to load
      a virtual address from real mode gave th complete but cryptic error,
      abridged:
      
            Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
            LE SMP NR_CPUS=2048 NUMA PowerNV
            CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted
            NIP:  c0000000000155ac LR: c0000000000c2430 CTR: c000000000015580
            REGS: c000000fff76dd80 TRAP: 0200   Not tainted
            MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48082222  XER: 00000000
            CFAR: 0000000102900ef0 DAR: d00017fffd941a28 DSISR: 00000040 SOFTE: 3
            NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
            LR [c0000000000c2430] do_tlbies+0x230/0x2f0
      
      Sending the NMIs through the Linux handlers gives a nicer output:
      
            Severe Machine check interrupt [Not recovered]
              NIP [c0000000000155ac]: perf_trace_tlbie+0x2c/0x1a0
              Initiator: CPU
              Error type: Real address [Load (bad)]
                Effective address: d00017fffcc01a28
            opal: Machine check interrupt unrecoverable: MSR(RI=0)
            opal: Hardware platform error: Unrecoverable Machine Check exception
            CPU: 0 PID: 6700 Comm: qemu-system-ppc Tainted: G   M
            NIP:  c0000000000155ac LR: c0000000000c23c0 CTR: c000000000015580
            REGS: c000000fff9e9d80 TRAP: 0200   Tainted: G   M
            MSR:  9000000000201001 <SF,HV,ME,LE>  CR: 48082222  XER: 00000000
            CFAR: 000000010cbc1a30 DAR: d00017fffcc01a28 DSISR: 00000040 SOFTE: 3
            NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
            LR [c0000000000c23c0] do_tlbies+0x1c0/0x280
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7c1bd80c
    • N
      KVM: PPC: Book3S HV: Fix kvmppc_bad_host_intr for real mode interrupts · eadce3b4
      Nicholas Piggin 提交于
      When CONFIG_RELOCATABLE=n, the Linux real mode interrupt handlers call
      into KVM using real address. This needs to be translated to the kernel
      linear effective address before the MMU is switched on.
      
      kvmppc_bad_host_intr misses adding these bits, so when it is used to
      handle a system reset interrupt (that always gets delivered in real
      mode), it results in an instruction access fault immediately after
      the MMU is turned on.
      
      Fix this by ensuring the top 2 address bits are set when the MMU is
      turned on.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      eadce3b4
    • N
      KVM: PPC: Book3S HV: radix: Do not clear partition PTE when RC or write bits do not match · 878cf2bb
      Nicholas Piggin 提交于
      Adding the write bit and RC bits to pte permissions does not require a
      pte clear and flush. There should not be other bits changed here,
      because restricting access or changing the PFN must have already
      invalidated any existing ptes (otherwise the race is already lost).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      878cf2bb
    • N
      KVM: PPC: Book3S HV: radix: Refine IO region partition scope attributes · bc64dd0e
      Nicholas Piggin 提交于
      When the radix fault handler has no page from the process address
      space (e.g., for IO memory), it looks up the process pte and sets
      partition table pte using that to get attributes like CI and guarded.
      If the process table entry is to be writable, set _PAGE_DIRTY as well
      to avoid an RC update. If not, then ensure _PAGE_DIRTY does not come
      across. Set _PAGE_ACCESSED as well to avoid RC update.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      bc64dd0e
    • N
      KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on · 9a4506e1
      Nicholas Piggin 提交于
      The radix guest code can has fewer restrictions about what context it
      can run in, so move this flushing out of assembly and have it use the
      Linux TLB flush implementations introduced previously.
      
      This allows powerpc:tlbie trace events to be used.
      
      This changes the tlbiel sequence to only execute RIC=2 flush once on
      the first set flushed, then RIC=0 for the rest of the sets. The end
      result of the flush should be unchanged. This matches the local PID
      flush pattern that was introduced in a5998fcb ("powerpc/mm/radix:
      Optimise tlbiel flush all case").
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      9a4506e1
    • N
      KVM: PPC: Book3S HV: Make radix use the Linux translation flush functions for partition scope · d91cb39f
      Nicholas Piggin 提交于
      This has the advantage of consolidating TLB flush code in fewer
      places, and it also implements powerpc:tlbie trace events.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d91cb39f
    • N
      KVM: PPC: Book3S HV: Recursively unmap all page table entries when unmapping · a5704e83
      Nicholas Piggin 提交于
      When partition scope mappings are unmapped with kvm_unmap_radix, the
      pte is cleared, but the page table structure is left in place. If the
      next page fault requests a different page table geometry (e.g., due to
      THP promotion or split), kvmppc_create_pte is responsible for changing
      the page tables.
      
      When a page table entry is to be converted to a large pte, the page
      table entry is cleared, the PWC flushed, then the page table it points
      to freed. This will cause pte page tables to leak when a 1GB page is
      to replace a pud entry points to a pmd table with pte tables under it:
      The pmd table will be freed, but its pte tables will be missed.
      
      Fix this by replacing the simple clear and free code with one that
      walks down the page tables and frees children. Care must be taken to
      clear the root entry being unmapped then flushing the PWC before
      freeing any page tables, as explained in comments.
      
      This requires PWC flush to logically become a flush-all-PWC (which it
      already is in hardware, but the KVM API needs to be changed to avoid
      confusion).
      
      This code also checks that no unexpected pte entries exist in any page
      table being freed, and unmaps those and emits a WARN. This is an
      expensive operation for the pte page level, but partition scope
      changes are rare, so it's unconditional for now to iron out bugs. It
      can be put under a CONFIG option or removed after some time.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a5704e83
    • N
    • N
      KVM: PPC: Book3S HV: Lockless tlbie for HPT hcalls · b7557451
      Nicholas Piggin 提交于
      tlbies to an LPAR do not have to be serialised since POWER4/PPC970,
      after which the MMU_FTR_LOCKLESS_TLBIE feature was introduced to
      avoid tlbie locking.
      
      Since commit c17b98cf ("KVM: PPC: Book3S HV: Remove code for
      PPC970 processors"), KVM no longer supports processors that do not
      have this feature, so the tlbie locking can be removed completely.
      A sanity check for the feature is put in kvmppc_mmu_hv_init.
      
      Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
      in HPT mode. 32 instances of the powerpc fork benchmark from selftests
      were run with --fork, and the results measured.
      
      Without this patch, total throughput was about 13.5K/sec, and this is
      the top of the host profile:
      
         74.52%  [k] do_tlbies
          2.95%  [k] kvmppc_book3s_hv_page_fault
          1.80%  [k] calc_checksum
          1.80%  [k] kvmppc_vcpu_run_hv
          1.49%  [k] kvmppc_run_core
      
      After this patch, throughput was about 51K/sec, with this profile:
      
         21.28%  [k] do_tlbies
          5.26%  [k] kvmppc_run_core
          4.88%  [k] kvmppc_book3s_hv_page_fault
          3.30%  [k] _raw_spin_lock_irqsave
          3.25%  [k] gup_pgd_range
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b7557451
    • S
      KVM: PPC: Fix a mmio_host_swabbed uninitialized usage issue · f19d1f36
      Simon Guo 提交于
      When KVM emulates VMX store, it will invoke kvmppc_get_vmx_data() to
      retrieve VMX reg val. kvmppc_get_vmx_data() will check mmio_host_swabbed
      to decide which double word of vr[] to be used. But the
      mmio_host_swabbed can be uninitialized during VMX store procedure:
      
      kvmppc_emulate_loadstore
      	\- kvmppc_handle_store128_by2x64
      		\- kvmppc_get_vmx_data
      
      So vcpu->arch.mmio_host_swabbed is not meant to be used at all for
      emulation of store instructions, and this patch makes that true for
      VMX stores. This patch also initializes mmio_host_swabbed to avoid
      possible future problems.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f19d1f36
    • S
      KVM: PPC: Move nip/ctr/lr/xer registers to pt_regs in kvm_vcpu_arch · 173c520a
      Simon Guo 提交于
      This patch moves nip/ctr/lr/xer registers from scattered places in
      kvm_vcpu_arch to pt_regs structure.
      
      cr register is "unsigned long" in pt_regs and u32 in vcpu->arch.
      It will need more consideration and may move in later patches.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      173c520a
    • S
      KVM: PPC: Add pt_regs into kvm_vcpu_arch and move vcpu->arch.gpr[] into it · 1143a706
      Simon Guo 提交于
      Current regs are scattered at kvm_vcpu_arch structure and it will
      be more neat to organize them into pt_regs structure.
      
      Also it will enable reimplementation of MMIO emulation code with
      analyse_instr() later.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1143a706
  3. 17 5月, 2018 14 次提交
  4. 15 5月, 2018 1 次提交
  5. 27 4月, 2018 2 次提交
  6. 25 4月, 2018 2 次提交
    • N
      powerpc: Fix smp_send_stop NMI IPI handling · ac61c115
      Nicholas Piggin 提交于
      The NMI IPI handler for a receiving CPU increments nmi_ipi_busy_count
      over the handler function call, which causes later smp_send_nmi_ipi()
      callers to spin until the call is finished.
      
      The stop_this_cpu() function never returns, so the busy count is never
      decremeted, which can cause the system to hang in some cases. For
      example panic() will call smp_send_stop() early on which calls
      stop_this_cpu() on other CPUs, then later in the reboot path,
      pnv_restart() will call smp_send_stop() again, which hangs.
      
      Fix this by adding a special case to the stop_this_cpu() handler to
      decrement the busy count, because it will never return.
      
      Now that the NMI/non-NMI versions of stop_this_cpu() are different,
      split them out into separate functions rather than doing #ifdef tricks
      to share the body between the two functions.
      
      Fixes: 6bed3237 ("powerpc: use NMI IPI for smp_send_stop")
      Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Split out the functions, tweak change log a bit]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ac61c115
    • N
      rtc: opal: Fix OPAL RTC driver OPAL_BUSY loops · 682e6b4d
      Nicholas Piggin 提交于
      The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
      OPAL_BUSY_EVENT from firmware, which causes large scheduling
      latencies, up to 50 seconds have been observed here when RTC stops
      responding (BMC reboot can do it).
      
      Fix this by converting it to the standard form OPAL_BUSY loop that
      sleeps.
      
      Fixes: 628daa8d ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks")
      Cc: stable@vger.kernel.org # v3.2+
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      682e6b4d
  7. 24 4月, 2018 6 次提交
    • M
      powerpc/mce: Fix a bug where mce loops on memory UE. · 75ecfb49
      Mahesh Salgaonkar 提交于
      The current code extracts the physical address for UE errors and then
      hooks it up into memory failure infrastructure. On successful
      extraction of physical address it wrongly sets "handled = 1" which
      means this UE error has been recovered. Since MCE handler gets return
      value as handled = 1, it assumes that error has been recovered and
      goes back to same NIP. This causes MCE interrupt again and again in a
      loop leading to hard lockup.
      
      Also, initialize phys_addr to ULONG_MAX so that we don't end up
      queuing undesired page to hwpoison.
      
      Without this patch we see:
        Severe Machine check interrupt [Recovered]
          NIP: [000000001002588c] PID: 7109 Comm: find
          Initiator: CPU
          Error type: UE [Load/Store]
            Effective address: 00007fffd2755940
            Physical address:  000020181a080000
        ...
        Severe Machine check interrupt [Recovered]
          NIP: [000000001002588c] PID: 7109 Comm: find
          Initiator: CPU
          Error type: UE [Load/Store]
            Effective address: 00007fffd2755940
            Physical address:  000020181a080000
        Severe Machine check interrupt [Recovered]
          NIP: [000000001002588c] PID: 7109 Comm: find
          Initiator: CPU
          Error type: UE [Load/Store]
            Effective address: 00007fffd2755940
            Physical address:  000020181a080000
        Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        Memory failure: 0x20181a08: already hardware poisoned
        ...
        Watchdog CPU:38 Hard LOCKUP
      
      After this patch we see:
      
        Severe Machine check interrupt [Not recovered]
          NIP: [00007fffaae585f4] PID: 7168 Comm: find
          Initiator: CPU
          Error type: UE [Load/Store]
            Effective address: 00007fffaafe28ac
            Physical address:  00002017c0bd0000
        find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
        Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
      
      Fixes: 01eaac2b ("powerpc/mce: Hookup ierror (instruction) UE errors")
      Fixes: ba41e1e1 ("powerpc/mce: Hookup derror (load/store) UE errors")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      75ecfb49
    • A
      powerpc/powernv/npu: Do a PID GPU TLB flush when invalidating a large address range · d0cf9b56
      Alistair Popple 提交于
      The NPU has a limited number of address translation shootdown (ATSD)
      registers and the GPU has limited bandwidth to process ATSDs. This can
      result in contention of ATSD registers leading to soft lockups on some
      threads, particularly when invalidating a large address range in
      pnv_npu2_mn_invalidate_range().
      
      At some threshold it becomes more efficient to flush the entire GPU
      TLB for the given MM context (PID) than individually flushing each
      address in the range. This patch will result in ranges greater than
      2MB being converted from 32+ ATSDs into a single ATSD which will flush
      the TLB for the given PID on each GPU.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Tested-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d0cf9b56
    • A
      powerpc/powernv/npu: Prevent overwriting of pnv_npu2_init_contex() callback parameters · a1409ada
      Alistair Popple 提交于
      There is a single npu context per set of callback parameters. Callers
      should be prevented from overwriting existing callback values so
      instead return an error if different parameters are passed.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Reviewed-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Tested-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a1409ada
    • A
      powerpc/powernv/npu: Add lock to prevent race in concurrent context init/destroy · 28a5933e
      Alistair Popple 提交于
      The pnv_npu2_init_context() and pnv_npu2_destroy_context() functions
      are used to allocate/free contexts to allow address translation and
      shootdown by the NPU on a particular GPU. Context initialisation is
      implicitly safe as it is protected by the requirement mmap_sem be held
      in write mode, however pnv_npu2_destroy_context() does not require
      mmap_sem to be held and it is not safe to call with a concurrent
      initialisation for a different GPU.
      
      It was assumed the driver would ensure destruction was not called
      concurrently with initialisation. However the driver may be simplified
      by allowing concurrent initialisation and destruction for different
      GPUs. As npu context creation/destruction is not a performance
      critical path and the critical section is not large a single spinlock
      is used for simplicity.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Reviewed-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Tested-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      28a5933e
    • B
      powerpc/powernv/memtrace: Let the arch hotunplug code flush cache · 7fd6641d
      Balbir Singh 提交于
      Don't do this via custom code, instead now that we have support in the
      arch hotplug/hotunplug code, rely on those routines to do the right
      thing.
      
      The existing flush doesn't work because it uses ppc64_caches.l1d.size
      instead of ppc64_caches.l1d.line_size.
      
      Fixes: 9d5171a8 ("powerpc/powernv: Enable removal of memory for in memory tracing")
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7fd6641d
    • B
      powerpc/mm: Flush cache on memory hot(un)plug · fb5924fd
      Balbir Singh 提交于
      This patch adds support for flushing potentially dirty cache lines
      when memory is hot-plugged/hot-un-plugged. The support is currently
      limited to 64 bit systems.
      
      The bug was exposed when mappings for a device were actually
      hot-unplugged and plugged in back later. A similar issue was observed
      during the development of memtrace, but memtrace does it's own
      flushing of region via a custom routine.
      
      These patches do a flush both on hotplug/unplug to clear any stale
      data in the cache w.r.t mappings, there is a small race window where a
      clean cache line may be created again just prior to tearing down the
      mapping.
      
      The patches were tested by disabling the flush routines in memtrace
      and doing I/O on the trace file. The system immediately
      checkstops (quite reliablly if prior to the hot-unplug of the memtrace
      region, we memset the regions we are about to hot unplug). After these
      patches no custom flushing is needed in the memtrace code.
      
      Fixes: 9d5171a8 ("powerpc/powernv: Enable removal of memory for in memory tracing")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Acked-by: NReza Arbab <arbab@linux.ibm.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fb5924fd
  8. 21 4月, 2018 1 次提交