1. 22 5月, 2018 1 次提交
  2. 18 5月, 2018 3 次提交
    • N
      KVM: PPC: Book3S HV: Lockless tlbie for HPT hcalls · b7557451
      Nicholas Piggin 提交于
      tlbies to an LPAR do not have to be serialised since POWER4/PPC970,
      after which the MMU_FTR_LOCKLESS_TLBIE feature was introduced to
      avoid tlbie locking.
      
      Since commit c17b98cf ("KVM: PPC: Book3S HV: Remove code for
      PPC970 processors"), KVM no longer supports processors that do not
      have this feature, so the tlbie locking can be removed completely.
      A sanity check for the feature is put in kvmppc_mmu_hv_init.
      
      Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
      in HPT mode. 32 instances of the powerpc fork benchmark from selftests
      were run with --fork, and the results measured.
      
      Without this patch, total throughput was about 13.5K/sec, and this is
      the top of the host profile:
      
         74.52%  [k] do_tlbies
          2.95%  [k] kvmppc_book3s_hv_page_fault
          1.80%  [k] calc_checksum
          1.80%  [k] kvmppc_vcpu_run_hv
          1.49%  [k] kvmppc_run_core
      
      After this patch, throughput was about 51K/sec, with this profile:
      
         21.28%  [k] do_tlbies
          5.26%  [k] kvmppc_run_core
          4.88%  [k] kvmppc_book3s_hv_page_fault
          3.30%  [k] _raw_spin_lock_irqsave
          3.25%  [k] gup_pgd_range
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b7557451
    • S
      KVM: PPC: Move nip/ctr/lr/xer registers to pt_regs in kvm_vcpu_arch · 173c520a
      Simon Guo 提交于
      This patch moves nip/ctr/lr/xer registers from scattered places in
      kvm_vcpu_arch to pt_regs structure.
      
      cr register is "unsigned long" in pt_regs and u32 in vcpu->arch.
      It will need more consideration and may move in later patches.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      173c520a
    • S
      KVM: PPC: Add pt_regs into kvm_vcpu_arch and move vcpu->arch.gpr[] into it · 1143a706
      Simon Guo 提交于
      Current regs are scattered at kvm_vcpu_arch structure and it will
      be more neat to organize them into pt_regs structure.
      
      Also it will enable reimplementation of MMIO emulation code with
      analyse_instr() later.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1143a706
  3. 17 5月, 2018 4 次提交
    • P
      KVM: PPC: Book3S HV: Set RWMR on POWER8 so PURR/SPURR count correctly · 7aa15842
      Paul Mackerras 提交于
      Although Linux doesn't use PURR and SPURR ((Scaled) Processor
      Utilization of Resources Register), other OSes depend on them.
      On POWER8 they count at a rate depending on whether the VCPU is
      idle or running, the activity of the VCPU, and the value in the
      RWMR (Region-Weighting Mode Register).  Hardware expects the
      hypervisor to update the RWMR when a core is dispatched to reflect
      the number of online VCPUs in the vcore.
      
      This adds code to maintain a count in the vcore struct indicating
      how many VCPUs are online.  In kvmppc_run_core we use that count
      to set the RWMR register on POWER8.  If the core is split because
      of a static or dynamic micro-threading mode, we use the value for
      8 threads.  The RWMR value is not relevant when the host is
      executing because Linux does not use the PURR or SPURR register,
      so we don't bother saving and restoring the host value.
      
      For the sake of old userspace which does not set the KVM_REG_PPC_ONLINE
      register, we set online to 1 if it was 0 at the time of a KVM_RUN
      ioctl.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7aa15842
    • P
      KVM: PPC: Book3S HV: Add 'online' register to ONE_REG interface · a1f15826
      Paul Mackerras 提交于
      This adds a new KVM_REG_PPC_ONLINE register which userspace can set
      to 0 or 1 via the GET/SET_ONE_REG interface to indicate whether it
      considers the VCPU to be offline (0), that is, not currently running,
      or online (1).  This will be used in a later patch to configure the
      register which controls PURR and SPURR accumulation.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a1f15826
    • P
      KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry · 57b8daa7
      Paul Mackerras 提交于
      Currently, the HV KVM guest entry/exit code adds the timebase offset
      from the vcore struct to the timebase on guest entry, and subtracts
      it on guest exit.  Which is fine, except that it is possible for
      userspace to change the offset using the SET_ONE_REG interface while
      the vcore is running, as there is only one timebase offset per vcore
      but potentially multiple VCPUs in the vcore.  If that were to happen,
      KVM would subtract a different offset on guest exit from that which
      it had added on guest entry, leading to the timebase being out of sync
      between cores in the host, which then leads to bad things happening
      such as hangs and spurious watchdog timeouts.
      
      To fix this, we add a new field 'tb_offset_applied' to the vcore struct
      which stores the offset that is currently applied to the timebase.
      This value is set from the vcore tb_offset field on guest entry, and
      is what is subtracted from the timebase on guest exit.  Since it is
      zero when the timebase offset is not applied, we can simplify the
      logic in kvmhv_start_timing and kvmhv_accumulate_time.
      
      In addition, we had secondary threads reading the timebase while
      running concurrently with code on the primary thread which would
      eventually add or subtract the timebase offset from the timebase.
      This occurred while saving or restoring the DEC register value on
      the secondary threads.  Although no specific incorrect behaviour has
      been observed, this is a race which should be fixed.  To fix it, we
      move the DEC saving code to just before we call kvmhv_commence_exit,
      and the DEC restoring code to after the point where we have waited
      for the primary thread to switch the MMU context and add the timebase
      offset.  That way we are sure that the timebase contains the guest
      timebase value in both cases.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57b8daa7
    • N
      powerpc/mm/radix: implement LPID based TLB flushes to be used by KVM · 0078778a
      Nicholas Piggin 提交于
      Implement a local TLB flush for invalidating an LPID with variants for
      process or partition scope. And a global TLB flush for invalidating
      a partition scoped page of an LPID.
      
      These will be used by KVM in subsequent patches.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0078778a
  4. 24 4月, 2018 1 次提交
  5. 14 4月, 2018 1 次提交
  6. 13 4月, 2018 1 次提交
    • M
      powerpc/64s: Fix CPU_FTRS_ALWAYS vs DT CPU features · 81b654c2
      Michael Ellerman 提交于
      The cpu_has_feature() mechanism has an optimisation where at build
      time we construct a mask of the CPU feature bits that will always be
      true for the given .config, based on the platform/bitness/etc. that we
      are building for.
      
      That is incompatible with DT CPU features, where the set of CPU
      features is dependent on feature flags that are given to us by
      firmware.
      
      The result is that some feature bits can not be *disabled* by DT CPU
      features. Or more accurately, they can be disabled but they will still
      appear in the ALWAYS mask, meaning cpu_has_feature() will always
      return true for them.
      
      In the past this hasn't really been a problem because on Book3S
      64 (where we support DT CPU features), the set of ALWAYS bits has been
      very small. That was because we always built for POWER4 and later,
      meaning the set of common bits was small.
      
      The only bit that could be cleared by DT CPU features that was also in
      the ALWAYS mask was CPU_FTR_NODSISRALIGN, and that was only used in
      the alignment handler to create a fake DSISR. That code was itself
      deleted in 31bfdb03 ("powerpc: Use instruction emulation
      infrastructure to handle alignment faults") (Sep 2017).
      
      However the set of ALWAYS features changed with the recent commit
      db5ae1c1 ("powerpc/64s: Refine feature sets for little endian
      builds") which restricted the set of feature flags when building
      little endian to Power7 or later. That caused the ALWAYS mask to
      become much larger for little endian builds.
      
      The result is that the following feature bits can currently not
      be *disabled* by DT CPU features:
      
        CPU_FTR_REAL_LE, CPU_FTR_MMCRA, CPU_FTR_CTRL, CPU_FTR_SMT,
        CPU_FTR_PURR, CPU_FTR_SPURR, CPU_FTR_DSCR, CPU_FTR_PKEY,
        CPU_FTR_VMX_COPY, CPU_FTR_CFAR, CPU_FTR_HAS_PPR.
      
      To fix it we need to mask the set of ALWAYS features with the base set
      of DT CPU features, ie. the features that are always enabled by DT CPU
      features. That way there are no bits in the ALWAYS mask that are not
      also always set by DT CPU features.
      
      Fixes: db5ae1c1 ("powerpc/64s: Refine feature sets for little endian builds")
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      81b654c2
  7. 10 4月, 2018 1 次提交
    • N
      powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops · 34dd25de
      Nicholas Piggin 提交于
      This is the start of an effort to tidy up and standardise all the
      delays. Existing loops have a range of delay/sleep periods from 1ms
      to 20ms, and some have no delay. They all loop forever except rtc,
      which times out after 10 retries, and that uses 10ms delays. So use
      10ms as our standard delay. The OPAL maintainer agrees 10ms is a
      reasonable starting point.
      
      The idea is to use the same recipe everywhere, once this is proven to
      work then it will be documented as an OPAL API standard. Then both
      firmware and OS can agree, and if a particular call needs something
      else, then that can be documented with reasoning.
      
      This is not the end-all of this effort, it's just a relatively easy
      change that fixes some existing high latency delays. There should be
      provision for standardising timeouts and/or interruptible loops where
      possible, so non-fatal firmware errors don't cause hangs.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      34dd25de
  8. 09 4月, 2018 1 次提交
    • M
      powerpc/modules: Fix crashes by adding CONFIG_RELOCATABLE to vermagic · 73aca179
      Michael Ellerman 提交于
      If you build the kernel with CONFIG_RELOCATABLE=n, then install the
      modules, rebuild the kernel with CONFIG_RELOCATABLE=y and leave the
      old modules installed, we crash something like:
      
        Unable to handle kernel paging request for data at address 0xd000000018d66cef
        Faulting instruction address: 0xc0000000021ddd08
        Oops: Kernel access of bad area, sig: 11 [#1]
        Modules linked in: x_tables autofs4
        CPU: 2 PID: 1 Comm: systemd Not tainted 4.16.0-rc6-gcc_ubuntu_le-g99fec39e #1
        ...
        NIP check_version.isra.22+0x118/0x170
        Call Trace:
          __ksymtab_xt_unregister_table+0x58/0xfffffffffffffcb8 [x_tables] (unreliable)
          resolve_symbol+0xb4/0x150
          load_module+0x10e8/0x29a0
          SyS_finit_module+0x110/0x140
          system_call+0x58/0x6c
      
      This happens because since commit 71810db2 ("modversions: treat
      symbol CRCs as 32 bit quantities"), a relocatable kernel encodes and
      handles symbol CRCs differently from a non-relocatable kernel.
      
      Although it's possible we could try and detect this situation and
      handle it, it's much more robust to simply make the state of
      CONFIG_RELOCATABLE part of the module vermagic.
      
      Fixes: 71810db2 ("modversions: treat symbol CRCs as 32 bit quantities")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      73aca179
  9. 06 4月, 2018 1 次提交
    • D
      mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize() · 09135cc5
      Dan Williams 提交于
      Patch series "mm, smaps: MMUPageSize for device-dax", v3.
      
      Similar to commit 31383c68 ("mm, hugetlbfs: introduce ->split() to
      vm_operations_struct") here is another occasion where we want
      special-case hugetlbfs/hstate enabling to also apply to device-dax.
      
      This prompts the question what other hstate conversions we might do
      beyond ->split() and ->pagesize(), but this appears to be the last of
      the usages of hstate_vma() in generic/non-hugetlbfs specific code paths.
      
      This patch (of 3):
      
      The current powerpc definition of vma_mmu_pagesize() open codes looking
      up the page size via hstate.  It is identical to the generic
      vma_kernel_pagesize() implementation.
      
      Now, vma_kernel_pagesize() is growing support for determining the page
      size of Device-DAX vmas in addition to the existing Hugetlbfs page size
      determination.
      
      Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it
      would automatically benefit from any new vma-type support that is added
      to vma_kernel_pagesize().  However, the powerpc vma_mmu_pagesize() is
      prevented from calling vma_kernel_pagesize() due to a circular header
      dependency that requires vma_mmu_pagesize() to be defined before
      including <linux/hugetlb.h>.
      
      Break this circular dependency by defining the default vma_mmu_pagesize()
      as a __weak symbol to be overridden by the powerpc version.
      
      Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Jane Chu <jane.chu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09135cc5
  10. 05 4月, 2018 2 次提交
  11. 04 4月, 2018 2 次提交
    • N
      powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it · 5d6a03eb
      Naveen N. Rao 提交于
      We get the below warning if we try to use kexec on P9:
         kexec_core: Starting new kernel
         WARNING: CPU: 0 PID: 1223 at arch/powerpc/kernel/process.c:826 __set_breakpoint+0xb4/0x140
         [snip]
         NIP __set_breakpoint+0xb4/0x140
         LR  kexec_prepare_cpus_wait+0x58/0x150
         Call Trace:
           0xc0000000ee70fb20 (unreliable)
           0xc0000000ee70fb20
           default_machine_kexec+0x234/0x2c0
           machine_kexec+0x84/0x90
           kernel_kexec+0xd8/0xe0
           SyS_reboot+0x214/0x2c0
           system_call+0x58/0x6c
      
      This happens since we are trying to clear hw breakpoint on POWER9,
      though we don't have CPU_FTR_DAWR enabled. Guard __set_breakpoint()
      within hw_breakpoint_disable() with ppc_breakpoint_available() to
      address this.
      
      Fixes: 96541531 ("powerpc: Disable DAWR in the base POWER9 CPU features")
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5d6a03eb
    • A
      powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix · fb4e5dbd
      Aneesh Kumar K.V 提交于
      With split PTL (page table lock) config, we allocate the level
      4 (leaf) page table using pte fragment framework instead of slab cache
      like other levels. This was done to enable us to have split page table
      lock at the level 4 of the page table. We use page->plt backing the
      all the level 4 pte fragment for the lock.
      
      Currently with Radix, we use only 16 fragments out of the allocated
      page. In radix each fragment is 256 bytes which means we use only 4k
      out of the allocated 64K page wasting 60k of the allocated memory.
      This was done earlier to keep it closer to hash.
      
      This patch update the pte fragment count to 256, thereby using the
      full 64K page and reducing the memory usage. Performance tests shows
      really low impact even with THP disabled. With THP disabled we will be
      contenting further less on level 4 ptl and hence the impact should be
      further low.
      
        256 threads:
          without patch (10 runs of ./ebizzy  -m -n 1000 -s 131072 -S 100)
            median = 15678.5
            stdev = 42.1209
      
          with patch:
            median = 15354
            stdev = 194.743
      
      This is with THP disabled. With THP enabled the impact of the patch
      will be less.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fb4e5dbd
  12. 03 4月, 2018 3 次提交
    • N
      powerpc/powernv: Always stop secondaries before reboot/shutdown · f2748bdf
      Nicholas Piggin 提交于
      Currently powernv reboot and shutdown requests just leave secondaries
      to do their own things. This is undesirable because they can trigger
      any number of watchdogs while waiting for reboot, but also we don't
      know what else they might be doing -- they might be causing trouble,
      trampling memory, etc.
      
      The opal scheduled flash update code already ran into watchdog problems
      due to flashing taking a long time, and it was fixed with 2196c6f1
      ("powerpc/powernv: Return secondary CPUs to firmware before FW update"),
      which returns secondaries to opal. It's been found that regular reboots
      can take over 10 seconds, which can result in the hard lockup watchdog
      firing,
      
        reboot: Restarting system
        [  360.038896709,5] OPAL: Reboot request...
        Watchdog CPU:0 Hard LOCKUP
        Watchdog CPU:44 detected Hard LOCKUP other CPUS:16
        Watchdog CPU:16 Hard LOCKUP
        watchdog: BUG: soft lockup - CPU#16 stuck for 3s! [swapper/16:0]
      
      This patch removes the special case for flash update, and calls
      smp_send_stop in all cases before calling reboot/shutdown.
      
      smp_send_stop could return CPUs to OPAL, the main reason not to is
      that the request could come from a NMI that interrupts OPAL code,
      so re-entry to OPAL can cause a number of problems. Putting
      secondaries into simple spin loops improves the chances of a
      successful reboot.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f2748bdf
    • M
      powerpc: Move default security feature flags · e7347a86
      Mauricio Faria de Oliveira 提交于
      This moves the definition of the default security feature flags
      (i.e., enabled by default) closer to the security feature flags.
      
      This can be used to restore current flags to the default flags.
      Signed-off-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e7347a86
    • A
      powerpc: Fix oops due to bad access of lppaca on bare metal · a6201da3
      Aneesh Kumar K.V 提交于
      Commit 8e0b634b ("powerpc/64s: Do not allocate lppaca if we are
      not virtualized") removed allocation of lppaca on bare metal
      platforms. But with CONFIG_PPC_SPLPAR enabled, we still access the
      lppaca on bare metal in some code paths.
      
      Fix this but adding runtime checks for SPLPAR (shared processor LPAR).
      
      Fixes: 8e0b634b ("powerpc/64s: Do not allocate lppaca if we are not virtualized")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a6201da3
  13. 01 4月, 2018 1 次提交
  14. 31 3月, 2018 6 次提交
  15. 30 3月, 2018 12 次提交