1. 19 4月, 2018 1 次提交
    • D
      compat: Move compat_timespec/ timeval to compat_time.h · 0d55303c
      Deepa Dinamani 提交于
      All the current architecture specific defines for these
      are the same. Refactor these common defines to a common
      header file.
      
      The new common linux/compat_time.h is also useful as it
      will eventually be used to hold all the defines that
      are needed for compat time types that support non y2038
      safe types. New architectures need not have to define these
      new types as they will only use new y2038 safe syscalls.
      This file can be deleted after y2038 when we stop supporting
      non y2038 safe syscalls.
      
      The patch also requires an operation similar to:
      
      git grep "asm/compat\.h" | cut -d ":" -f 1 |  xargs -n 1 sed -i -e "s%asm/compat.h%linux/compat.h%g"
      
      Cc: acme@kernel.org
      Cc: benh@kernel.crashing.org
      Cc: borntraeger@de.ibm.com
      Cc: catalin.marinas@arm.com
      Cc: cmetcalf@mellanox.com
      Cc: cohuck@redhat.com
      Cc: davem@davemloft.net
      Cc: deller@gmx.de
      Cc: devel@driverdev.osuosl.org
      Cc: gerald.schaefer@de.ibm.com
      Cc: gregkh@linuxfoundation.org
      Cc: heiko.carstens@de.ibm.com
      Cc: hoeppner@linux.vnet.ibm.com
      Cc: hpa@zytor.com
      Cc: jejb@parisc-linux.org
      Cc: jwi@linux.vnet.ibm.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: mark.rutland@arm.com
      Cc: mingo@redhat.com
      Cc: mpe@ellerman.id.au
      Cc: oberpar@linux.vnet.ibm.com
      Cc: oprofile-list@lists.sf.net
      Cc: paulus@samba.org
      Cc: peterz@infradead.org
      Cc: ralf@linux-mips.org
      Cc: rostedt@goodmis.org
      Cc: rric@kernel.org
      Cc: schwidefsky@de.ibm.com
      Cc: sebott@linux.vnet.ibm.com
      Cc: sparclinux@vger.kernel.org
      Cc: sth@linux.vnet.ibm.com
      Cc: ubraun@linux.vnet.ibm.com
      Cc: will.deacon@arm.com
      Cc: x86@kernel.org
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NJames Hogan <jhogan@kernel.org>
      Acked-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      0d55303c
  2. 30 3月, 2018 1 次提交
  3. 23 3月, 2018 2 次提交
    • P
      KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 · 4bb3c7a0
      Paul Mackerras 提交于
      POWER9 has hardware bugs relating to transactional memory and thread
      reconfiguration (changes to hardware SMT mode).  Specifically, the core
      does not have enough storage to store a complete checkpoint of all the
      architected state for all four threads.  The DD2.2 version of POWER9
      includes hardware modifications designed to allow hypervisor software
      to implement workarounds for these problems.  This patch implements
      those workarounds in KVM code so that KVM guests see a full, working
      transactional memory implementation.
      
      The problems center around the use of TM suspended state, where the
      CPU has a checkpointed state but execution is not transactional.  The
      workaround is to implement a "fake suspend" state, which looks to the
      guest like suspended state but the CPU does not store a checkpoint.
      In this state, any instruction that would cause a transition to
      transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
      checkpointed state (treclaim) causes a "soft patch" interrupt (vector
      0x1500) to the hypervisor so that it can be emulated.  The trechkpt
      instruction also causes a soft patch interrupt.
      
      On POWER9 DD2.2, we avoid returning to the guest in any state which
      would require a checkpoint to be present.  The trechkpt in the guest
      entry path which would normally create that checkpoint is replaced by
      either a transition to fake suspend state, if the guest is in suspend
      state, or a rollback to the pre-transactional state if the guest is in
      transactional state.  Fake suspend state is indicated by a flag in the
      PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only and
      reads back as 0.
      
      On exit from the guest, if the guest is in fake suspend state, we still
      do the treclaim instruction as we would in real suspend state, in order
      to get into non-transactional state, but we do not save the resulting
      register state since there was no checkpoint.
      
      Emulation of the instructions that cause a softpatch interrupt is
      handled in two paths.  If the guest is in real suspend mode, we call
      kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
      transitioning to transactional state.  This is called before we do the
      treclaim in the guest exit path; because we haven't done treclaim, we
      can get back to the guest with the transaction still active.  If the
      instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
      handle, or if the guest is in fake suspend state, then we proceed to
      do the complete guest exit path and subsequently call
      kvmhv_p9_tm_emulation() in host context with the MMU on.  This handles
      all the cases including the cases that generate program interrupts
      (illegal instruction or TM Bad Thing) and facility unavailable
      interrupts.
      
      The emulation is reasonably straightforward and is mostly concerned
      with checking for exception conditions and updating the state of
      registers such as MSR and CR0.  The treclaim emulation takes care to
      ensure that the TEXASR register gets updated as if it were the guest
      treclaim instruction that had done failure recording, not the treclaim
      done in hypervisor state in the guest exit path.
      
      With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
      transactional memory is not available to host userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bb3c7a0
    • P
      powerpc/powernv: Provide a way to force a core into SMT4 mode · 7672691a
      Paul Mackerras 提交于
      POWER9 processors up to and including "Nimbus" v2.2 have hardware
      bugs relating to transactional memory and thread reconfiguration.
      One of these bugs has a workaround which is to get the core into
      SMT4 state temporarily.  This workaround is only needed when
      running bare-metal.
      
      This patch provides a function which gets the core into SMT4 mode
      by preventing threads from going to a stop state, and waking up
      those which are already in a stop state.  Once at least 3 threads
      are not in a stop state, the core will be in SMT4 and we can
      continue.
      
      To do this, we add a "dont_stop" flag to the paca to tell the
      thread not to go into a stop state.  If this flag is set,
      power9_idle_stop() just returns immediately with a return value
      of 0.  The pnv_power9_force_smt4_catch() function does the following:
      
      1. Set the dont_stop flag for each thread in the core, except
         ourselves (in fact we use an atomic_inc() in case more than
         one thread is calling this function concurrently).
      2. See how many threads are awake, indicated by their
         requested_psscr field in the paca being 0.  If this is at
         least 3, skip to step 5.
      3. Send a doorbell interrupt to each thread that was seen as
         being in a stop state in step 2.
      4. Until at least 3 threads are awake, scan the threads to which
         we sent a doorbell interrupt and check if they are awake now.
      
      This relies on the following properties:
      
      - Once dont_stop is non-zero, requested_psccr can't go from zero to
        non-zero, except transiently (and without the thread doing stop).
      - requested_psscr being zero guarantees that the thread isn't in
        a state-losing stop state where thread reconfiguration could occur.
      - Doing stop with a PSSCR value of 0 won't be a state-losing stop
        and thus won't allow thread reconfiguration.
      - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core
        must be in SMT4 mode, since SMT modes are powers of 2.
      
      This does add a sync to power9_idle_stop(), which is necessary to
      provide the correct ordering between setting requested_psscr and
      checking dont_stop.  The overhead of the sync should be unnoticeable
      compared to the latency of going into and out of a stop state.
      
      Because some objected to incurring this extra latency on systems where
      the XER[SO] bug is not relevant, I have put the test in
      power9_idle_stop inside a feature section.  This means that
      pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems
      without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will
      probably hang the system.
      
      In order to cater for uses where the caller has an operation that
      has to be done while the core is in SMT4, the core continues to be
      kept in SMT4 after pnv_power9_force_smt4_catch() function returns,
      until the pnv_power9_force_smt4_release() function is called.
      It undoes the effect of step 1 above and allows the other threads
      to go into a stop state.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7672691a
  4. 23 1月, 2018 1 次提交
    • N
      powerpc/64s: Improve RFI L1-D cache flush fallback · bdcb1aef
      Nicholas Piggin 提交于
      The fallback RFI flush is used when firmware does not provide a way
      to flush the cache. It's a "displacement flush" that evicts useful
      data by displacing it with an uninteresting buffer.
      
      The flush has to take care to work with implementation specific cache
      replacment policies, so the recipe has been in flux. The initial
      slow but conservative approach is to touch all lines of a congruence
      class, with dependencies between each load. It has since been
      determined that a linear pattern of loads without dependencies is
      sufficient, and is significantly faster.
      
      Measuring the speed of a null syscall with RFI fallback flush enabled
      gives the relative improvement:
      
      P8 - 1.83x
      P9 - 1.75x
      
      The flush also becomes simpler and more adaptable to different cache
      geometries.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bdcb1aef
  5. 19 1月, 2018 3 次提交
  6. 10 1月, 2018 1 次提交
    • M
      powerpc/64s: Add support for RFI flush of L1-D cache · aa8a5e00
      Michael Ellerman 提交于
      On some CPUs we can prevent the Meltdown vulnerability by flushing the
      L1-D cache on exit from kernel to user mode, and from hypervisor to
      guest.
      
      This is known to be the case on at least Power7, Power8 and Power9. At
      this time we do not know the status of the vulnerability on other CPUs
      such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
      CPUs. As more information comes to light we can enable this, or other
      mechanisms on those CPUs.
      
      The vulnerability occurs when the load of an architecturally
      inaccessible memory region (eg. userspace load of kernel memory) is
      speculatively executed to the point where its result can influence the
      address of a subsequent speculatively executed load.
      
      In order for that to happen, the first load must hit in the L1,
      because before the load is sent to the L2 the permission check is
      performed. Therefore if no kernel addresses hit in the L1 the
      vulnerability can not occur. We can ensure that is the case by
      flushing the L1 whenever we return to userspace. Similarly for
      hypervisor vs guest.
      
      In order to flush the L1-D cache on exit, we add a section of nops at
      each (h)rfi location that returns to a lower privileged context, and
      patch that with some sequence. Newer firmwares are able to advertise
      to us that there is a special nop instruction that flushes the L1-D.
      If we do not see that advertised, we fall back to doing a displacement
      flush in software.
      
      For guest kernels we support migration between some CPU versions, and
      different CPUs may use different flush instructions. So that we are
      prepared to migrate to a machine with a different flush instruction
      activated, we may have to patch more than one flush instruction at
      boot if the hypervisor tells us to.
      
      In the end this patch is mostly the work of Nicholas Piggin and
      Michael Ellerman. However a cast of thousands contributed to analysis
      of the issue, earlier versions of the patch, back ports testing etc.
      Many thanks to all of them.
      Tested-by: NJon Masters <jcm@redhat.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aa8a5e00
  7. 04 12月, 2017 1 次提交
    • S
      powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE · 5c929885
      Santosh Sivaraj 提交于
      Current vDSO64 implementation does not have support for coarse clocks
      (CLOCK_MONOTONIC_COARSE, CLOCK_REALTIME_COARSE), for which it falls back
      to system call, increasing the response time, vDSO implementation reduces
      the cycle time. Below is a benchmark of the difference in execution times.
      
      (Non-coarse clocks are also included just for completion)
      
      clock-gettime-realtime: syscall: 172 nsec/call
      clock-gettime-realtime:    libc: 28 nsec/call
      clock-gettime-realtime:    vdso: 22 nsec/call
      clock-gettime-monotonic: syscall: 171 nsec/call
      clock-gettime-monotonic:    libc: 30 nsec/call
      clock-gettime-monotonic:    vdso: 25 nsec/call
      clock-gettime-realtime-coarse: syscall: 153 nsec/call
      clock-gettime-realtime-coarse:    libc: 16 nsec/call
      clock-gettime-realtime-coarse:    vdso: 10 nsec/call
      clock-gettime-monotonic-coarse: syscall: 167 nsec/call
      clock-gettime-monotonic-coarse:    libc: 17 nsec/call
      clock-gettime-monotonic-coarse:    vdso: 11 nsec/call
      
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NSantosh Sivaraj <santosh@fossix.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5c929885
  8. 13 11月, 2017 1 次提交
  9. 06 11月, 2017 1 次提交
    • M
      powerpc/64s: Replace CONFIG_PPC_STD_MMU_64 with CONFIG_PPC_BOOK3S_64 · 4e003747
      Michael Ellerman 提交于
      CONFIG_PPC_STD_MMU_64 indicates support for the "standard" powerpc MMU
      on 64-bit CPUs. The "standard" MMU refers to the hash page table MMU
      found in "server" processors, from IBM mainly.
      
      Currently CONFIG_PPC_STD_MMU_64 is == CONFIG_PPC_BOOK3S_64. While it's
      annoying to have two symbols that always have the same value, it's not
      quite annoying enough to bother removing one.
      
      However with the arrival of Power9, we now have the situation where
      CONFIG_PPC_STD_MMU_64 is enabled, but the kernel is running using the
      Radix MMU - *not* the "standard" MMU. So it is now actively confusing
      to use it, because it implies that code is disabled or inactive when
      the Radix MMU is in use, however that is not necessarily true.
      
      So s/CONFIG_PPC_STD_MMU_64/CONFIG_PPC_BOOK3S_64/, and do some minor
      formatting updates of some of the affected lines.
      
      This will be a pain for backports, but c'est la vie.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4e003747
  10. 01 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts · c0101509
      Paul Mackerras 提交于
      This patch removes the restriction that a radix host can only run
      radix guests, allowing us to run HPT (hashed page table) guests as
      well.  This is useful because it provides a way to run old guest
      kernels that know about POWER8 but not POWER9.
      
      Unfortunately, POWER9 currently has a restriction that all threads
      in a given code must either all be in HPT mode, or all in radix mode.
      This means that when entering a HPT guest, we have to obtain control
      of all 4 threads in the core and get them to switch their LPIDR and
      LPCR registers, even if they are not going to run a guest.  On guest
      exit we also have to get all threads to switch LPIDR and LPCR back
      to host values.
      
      To make this feasible, we require that KVM not be in the "independent
      threads" mode, and that the CPU cores be in single-threaded mode from
      the host kernel's perspective (only thread 0 online; threads 1, 2 and
      3 offline).  That allows us to use the same code as on POWER8 for
      obtaining control of the secondary threads.
      
      To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
      struct to contain the information needed by the secondary threads.
      All threads perform a barrier synchronization (where all threads wait
      for every other thread to reach the synchronization point) on guest
      entry, both before and after loading LPCR and LPIDR.  On guest exit,
      they all once again perform a barrier synchronization both before
      and after loading host values into LPCR and LPIDR.
      
      Finally, it is also currently necessary to flush the entire TLB every
      time we enter a HPT guest on a radix host.  We do this on thread 0
      with a loop of tlbiel instructions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c0101509
  11. 01 8月, 2017 1 次提交
    • G
      powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle · e1c1cfed
      Gautham R. Shenoy 提交于
      The stop4 idle state on POWER9 is a deep idle state which loses
      hypervisor resources, but whose latency is low enough that it can be
      exposed via cpuidle.
      
      Until now, the deep idle states which lose hypervisor resources (eg:
      winkle) were only exposed via CPU-Hotplug.  Hence currently on wakeup
      from such states, barring a few SPRs which need to be restored to
      their older value, rest of the SPRS are reinitialized to their values
      corresponding to that at boot time.
      
      When stop4 is used in the context of cpuidle, we want these additional
      SPRs to be restored to their older value, to ensure that the context
      on the CPU coming back from idle is same as it was before going idle.
      
      In this patch, we define a SPR save area in PACA (since we have used
      up the volatile register space in the stack) and on POWER9, we restore
      SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1,
      SPRN_MMCR2 to the values they had before entering stop.
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e1c1cfed
  12. 27 6月, 2017 1 次提交
  13. 21 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Add new capability to control MCE behaviour · 134764ed
      Aravinda Prasad 提交于
      This introduces a new KVM capability to control how KVM behaves
      on machine check exception (MCE) in HV KVM guests.
      
      If this capability has not been enabled, KVM redirects machine check
      exceptions to guest's 0x200 vector, if the address in error belongs to
      the guest. With this capability enabled, KVM will cause a guest exit
      with the exit reason indicating an NMI.
      
      The new capability is required to avoid problems if a new kernel/KVM
      is used with an old QEMU, running a guest that doesn't issue
      "ibm,nmi-register".  As old QEMU does not understand the NMI exit
      type, it treats it as a fatal error.  However, the guest could have
      handled the machine check error if the exception was delivered to
      guest's 0x200 interrupt vector instead of NMI exit in case of old
      QEMU.
      
      [paulus@ozlabs.org - Reworded the commit message to be clearer,
       enable only on HV KVM.]
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      134764ed
  14. 19 6月, 2017 3 次提交
    • N
      powerpc/64s: msgclr when handling doorbell exceptions from system reset · a9af97aa
      Nicholas Piggin 提交于
      msgsnd doorbell exceptions are cleared when the doorbell interrupt is
      taken. However if a doorbell exception causes a system reset interrupt
      wake from power saving state, the message is not cleared. Processing
      the doorbell from the system reset interrupt requires msgclr to avoid
      taking the exception again.
      
      Testing this plus the previous wakup direct patch gives:
      
                                      original         wakeup direct     msgclr
      Different threads, same core:   315k/s           264k/s            345k/s
      Different cores:                235k/s           242k/s            242k/s
      
      Net speedup is +10% for same core, and +3% for different core.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a9af97aa
    • P
      KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 · 57900694
      Paul Mackerras 提交于
      On POWER9, we no longer have the restriction that we had on POWER8
      where all threads in a core have to be in the same partition, so
      the CPU threads are now independent.  However, we still want to be
      able to run guests with a virtual SMT topology, if only to allow
      migration of guests from POWER8 systems to POWER9.
      
      A guest that has a virtual SMT mode greater than 1 will expect to
      be able to use the doorbell facility; it will expect the msgsndp
      and msgclrp instructions to work appropriately and to be able to read
      sensible values from the TIR (thread identification register) and
      DPDES (directed privileged doorbell exception status) special-purpose
      registers.  However, since each CPU thread is a separate sub-processor
      in POWER9, these instructions and registers can only be used within
      a single CPU thread.
      
      In order for these instructions to appear to act correctly according
      to the guest's virtual SMT mode, we have to trap and emulate them.
      We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
      register.  The emulation is triggered by the hypervisor facility
      unavailable interrupt that occurs when the guest uses them.
      
      To cause a doorbell interrupt to occur within the guest, we set the
      DPDES register to 1.  If the guest has interrupts enabled, the CPU
      will generate a doorbell interrupt and clear the DPDES register in
      hardware.  The DPDES hardware register for the guest is saved in the
      vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
      exit code, other VCPUs wishing to cause a doorbell interrupt don't
      write that field directly, but instead set a vcpu->arch.doorbell_request
      flag.  This is consumed and set to 0 by the guest entry code, which
      then sets DPDES to 1.
      
      Emulating reads of the DPDES register is somewhat involved, because
      it requires reading the doorbell pending interrupt status of all of the
      VCPU threads in the virtual core, and if any of those VCPUs are
      running, their doorbell status is only up-to-date in the hardware
      DPDES registers of the CPUs where they are running.  In order to get
      a reasonable approximation of the current doorbell status, we send
      those CPUs an IPI, causing an exit from the guest which will update
      the vcpu->arch.vcore->dpdes field.  We then use that value in
      constructing the emulated DPDES register value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57900694
    • P
      KVM: PPC: Book3S HV: Context-switch HFSCR between host and guest on POWER9 · 769377f7
      Paul Mackerras 提交于
      This adds code to allow us to use a different value for the HFSCR
      (Hypervisor Facilities Status and Control Register) when running the
      guest from that which applies in the host.  The reason for doing this
      is to allow us to trap the msgsndp instruction and related operations
      in future so that they can be virtualized.  We also save the value of
      HFSCR when a hypervisor facility unavailable interrupt occurs, because
      the high byte of HFSCR indicates which facility the guest attempted to
      access.
      
      We save and restore the host value on guest entry/exit because some
      bits of it affect host userspace execution.
      
      We only do all this on POWER9, not on POWER8, because we are not
      intending to virtualize any of the facilities controlled by HFSCR on
      POWER8.  In particular, the HFSCR bit that controls execution of
      msgsndp and related operations does not exist on POWER8.  The HFSCR
      doesn't exist at all on POWER7.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      769377f7
  15. 30 5月, 2017 1 次提交
    • G
      powerpc/powernv/idle: Use Requested Level for restoring state on P9 DD1 · 22c6663d
      Gautham R. Shenoy 提交于
      On Power9 DD1 due to a hardware bug the Power-Saving Level Status
      field (PLS) of the PSSCR for a thread waking up from a deep state can
      under-report if some other thread in the core is in a shallow stop
      state. The scenario in which this can manifest is as follows:
      
         1) All the threads of the core are in deep stop.
         2) One of the threads is woken up. The PLS for this thread will
            correctly reflect that it is waking up from deep stop.
         3) The thread that has woken up now executes a shallow stop.
         4) When some other thread in the core is woken, its PLS will reflect
            the shallow stop state.
      
      Thus, the subsequent thread for which the PLS is under-reporting the
      wakeup state will not restore the hypervisor resources.
      
      Hence, on DD1 systems, use the Requested Level (RL) field as a
      workaround to restore the contents of the hypervisor resources on the
      wakeup from the stop state.
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      22c6663d
  16. 28 4月, 2017 3 次提交
    • N
      powerpc/64s: Dedicated system reset interrupt stack · b1ee8a3d
      Nicholas Piggin 提交于
      The system reset interrupt is used for crash/debug situations, so it is
      desirable to have as little impact on the normal state of the system as
      possible.
      
      Currently it uses the current kernel stack to process the exception.
      This stores into the stack which may be involved with the crash. The
      stack pointer may be corrupted, or it may have overflowed.
      
      Avoid or minimise these problems by creating a dedicated NMI stack for
      the system reset interrupt to use.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b1ee8a3d
    • N
      powerpc/64s: Disallow system reset vs system reset reentrancy · c4f3b52c
      Nicholas Piggin 提交于
      In preparation for using a dedicated stack for system reset interrupts,
      prevent a nested system reset from recovering, in order to simplify
      code that is called in crash/debug path. This allows a system reset
      interrupt to just use the base stack pointer.
      
      Keep an in_nmi nesting counter similarly to the in_mce counter. Consider
      the interrrupt non-recoverable if it is taken inside another system
      reset.
      
      Interrupt nesting could be allowed similarly to MCE, but system reset
      is a special case that's not for normal operation, so simplicity wins
      until there is requirement for nested system reset interrupts.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c4f3b52c
    • N
      powerpc/64s: Fix system reset vs general interrupt reentrancy · a3d96f70
      Nicholas Piggin 提交于
      The system reset interrupt can occur when MSR_EE=0, and it currently
      uses the PACA_EXGEN save area.
      
      Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0
      when the save area is still in use. A system reset interrupt in this
      window can lead to undetected corruption when the save area gets
      overwritten.
      
      This patch introduces PACA_EXNMI save area for system reset exceptions,
      which closes this corruption window. It's also helpful to retain the
      EXGEN state for debugging situations, even if not considering the
      recoverability aspect.
      
      This patch also moves the PACA_EXMC area down to a less frequently used
      part of the paca with the new save area.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a3d96f70
  17. 27 4月, 2017 1 次提交
  18. 12 4月, 2017 1 次提交
    • M
      powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages · 03dfee6d
      Michael Ellerman 提交于
      Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB"),
      we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This
      makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to
      calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong.
      
      The end result is that the PGD (Page Global Directory, ie top level page table)
      of the kernel (aka. swapper_pg_dir), is too small.
      
      This generally doesn't lead to a crash, as we don't use the full range in normal
      operation. However if we try to dump the kernel pagetables we can trigger a
      crash because we walk off the end of the pgd into other memory and eventually
      try to dereference something bogus:
      
        $ cat /sys/kernel/debug/kernel_pagetables
        Unable to handle kernel paging request for data at address 0xe8fece0000000000
        Faulting instruction address: 0xc000000000072314
        cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890]
            pc: c000000000072314: ptdump_show+0x164/0x430
            lr: c000000000072550: ptdump_show+0x3a0/0x430
           dar: e802cf0000000000
        seq_read+0xf8/0x560
        full_proxy_read+0x84/0xc0
        __vfs_read+0x6c/0x1d0
        vfs_read+0xbc/0x1b0
        SyS_read+0x6c/0x110
        system_call+0x38/0xfc
      
      The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be
      the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move
      the calculation into asm-offsets.c where we can do it easily using
      max().
      
      Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      03dfee6d
  19. 11 4月, 2017 1 次提交
    • G
      powerpc/powernv: Recover correct PACA on wakeup from a stop on P9 DD1 · 17ed4c8f
      Gautham R. Shenoy 提交于
      POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up
      from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus
      the HSPRG0 of a thread waking up from can contain the paca pointer of
      its sibling.
      
      This patch implements a context recovery framework within threads of a
      core, by provisioning space in paca_struct for saving every sibling
      threads's paca pointers. Basically, we should be able to arrive at the
      right paca pointer from any of the thread's existing paca pointer.
      
      At bootup, during powernv idle-init, we save the paca address of every
      CPU in each one its siblings paca_struct in the slot corresponding to
      this CPU's index in the core.
      
      On wakeup from a stop, the thread will determine its index in the core
      from the TIR register and recover its PACA pointer by indexing into
      the correct slot in the provisioned space in the current PACA.
      
      Furthermore, ensure that the NVGPRs are restored from the stack on the
      way out by setting the NAPSTATELOST in paca.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Call it a bug]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      17ed4c8f
  20. 01 4月, 2017 1 次提交
  21. 15 2月, 2017 2 次提交
  22. 06 2月, 2017 2 次提交
  23. 31 1月, 2017 1 次提交
  24. 24 1月, 2017 1 次提交
    • M
      powerpc: Revert the initial stack protector support · f2574030
      Michael Ellerman 提交于
      Unfortunately the stack protector support we merged recently only works
      on some toolchains. If the toolchain is built without glibc support
      everything works fine, but if glibc is built then it leads to a panic
      at boot.
      
      The solution is not rc5 material, so revert the support for now. This
      reverts commits:
      
      6533b7c1 ("powerpc: Initial stack protector (-fstack-protector) support")
      902e06eb ("powerpc/32: Change the stack protector canary value per task")
      
      Fixes: 6533b7c1 ("powerpc: Initial stack protector (-fstack-protector) support")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f2574030
  25. 14 1月, 2017 1 次提交
  26. 24 11月, 2016 2 次提交
    • P
      KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9 · 7c5b06ca
      Paul Mackerras 提交于
      POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
      and tlbiel (local tlbie) instructions.  Both instructions get a
      set of new parameters (RIC, PRS and R) which appear as bits in the
      instruction word.  The tlbiel instruction now has a second register
      operand, which contains a PID and/or LPID value if needed, and
      should otherwise contain 0.
      
      This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
      as well as older processors.  Since we only handle HPT guests so
      far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
      word as on previous processors, so we don't need to conditionally
      execute different instructions depending on the processor.
      
      The local flush on first entry to a guest in book3s_hv_rmhandlers.S
      is a loop which depends on the number of TLB sets.  Rather than
      using feature sections to set the number of iterations based on
      which CPU we're on, we now work out this number at VM creation time
      and store it in the kvm_arch struct.  That will make it possible to
      get the number from the device tree in future, which will help with
      compatibility with future processors.
      
      Since mmu_partition_table_set_entry() does a global flush of the
      whole LPID, we don't need to do the TLB flush on first entry to the
      guest on each processor.  Therefore we don't set all bits in the
      tlb_need_flush bitmap on VM startup on POWER9.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7c5b06ca
    • P
      KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs · e9cf1e08
      Paul Mackerras 提交于
      This adds code to handle two new guest-accessible special-purpose
      registers on POWER9: TIDR (thread ID register) and PSSCR (processor
      stop status and control register).  They are context-switched
      between host and guest, and the guest values can be read and set
      via the one_reg interface.
      
      The PSSCR contains some fields which are guest-accessible and some
      which are only accessible in hypervisor mode.  We only allow the
      guest-accessible fields to be read or set by userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e9cf1e08
  27. 23 11月, 2016 1 次提交
    • C
      powerpc/32: Change the stack protector canary value per task · 902e06eb
      Christophe Leroy 提交于
      Partially copied from commit df0698be ("ARM: stack protector:
      change the canary value per task")
      
      A new random value for the canary is stored in the task struct whenever
      a new task is forked.  This is meant to allow for different canary values
      per task.  On powerpc, GCC expects the canary value to be found in a global
      variable called __stack_chk_guard.  So this variable has to be updated
      with the value stored in the task struct whenever a task switch occurs.
      
      Because the variable GCC expects is global, this cannot work on SMP
      unfortunately.  So, on SMP, the same initial canary value is kept
      throughout, making this feature a bit less effective although it is still
      useful.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      902e06eb
  28. 21 11月, 2016 1 次提交
    • P
      KVM: PPC: Book3S HV: Save/restore XER in checkpointed register state · 0d808df0
      Paul Mackerras 提交于
      When switching from/to a guest that has a transaction in progress,
      we need to save/restore the checkpointed register state.  Although
      XER is part of the CPU state that gets checkpointed, the code that
      does this saving and restoring doesn't save/restore XER.
      
      This fixes it by saving and restoring the XER.  To allow userspace
      to read/write the checkpointed XER value, we also add a new ONE_REG
      specifier.
      
      The visible effect of this bug is that the guest may see its XER
      value being corrupted when it uses transactions.
      
      Fixes: e4e38121 ("KVM: PPC: Book3S HV: Add transactional memory support")
      Fixes: 0a8eccef ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0d808df0
  29. 04 10月, 2016 1 次提交
  30. 27 9月, 2016 1 次提交
    • P
      KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread · 88b02cf9
      Paul Mackerras 提交于
      POWER8 has one virtual timebase (VTB) register per subcore, not one
      per CPU thread.  The HV KVM code currently treats VTB as a per-thread
      register, which can lead to spurious soft lockup messages from guests
      which use the VTB as the time source for the soft lockup detector.
      (CPUs before POWER8 did not have the VTB register.)
      
      For HV KVM, this fixes the problem by making only the primary thread
      in each virtual core save and restore the VTB value.  With this,
      the VTB state becomes part of the kvmppc_vcore structure.  This
      also means that "piggybacking" of multiple virtual cores onto one
      subcore is not possible on POWER8, because then the virtual cores
      would share a single VTB register.
      
      PR KVM emulates a VTB register, which is per-vcpu because PR KVM
      has no notion of CPU threads or SMT.  For PR KVM we move the VTB
      state into the kvmppc_vcpu_book3s struct.
      
      Cc: stable@vger.kernel.org # v3.14+
      Reported-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      88b02cf9