- 19 4月, 2018 1 次提交
-
-
由 Deepa Dinamani 提交于
All the current architecture specific defines for these are the same. Refactor these common defines to a common header file. The new common linux/compat_time.h is also useful as it will eventually be used to hold all the defines that are needed for compat time types that support non y2038 safe types. New architectures need not have to define these new types as they will only use new y2038 safe syscalls. This file can be deleted after y2038 when we stop supporting non y2038 safe syscalls. The patch also requires an operation similar to: git grep "asm/compat\.h" | cut -d ":" -f 1 | xargs -n 1 sed -i -e "s%asm/compat.h%linux/compat.h%g" Cc: acme@kernel.org Cc: benh@kernel.crashing.org Cc: borntraeger@de.ibm.com Cc: catalin.marinas@arm.com Cc: cmetcalf@mellanox.com Cc: cohuck@redhat.com Cc: davem@davemloft.net Cc: deller@gmx.de Cc: devel@driverdev.osuosl.org Cc: gerald.schaefer@de.ibm.com Cc: gregkh@linuxfoundation.org Cc: heiko.carstens@de.ibm.com Cc: hoeppner@linux.vnet.ibm.com Cc: hpa@zytor.com Cc: jejb@parisc-linux.org Cc: jwi@linux.vnet.ibm.com Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: mark.rutland@arm.com Cc: mingo@redhat.com Cc: mpe@ellerman.id.au Cc: oberpar@linux.vnet.ibm.com Cc: oprofile-list@lists.sf.net Cc: paulus@samba.org Cc: peterz@infradead.org Cc: ralf@linux-mips.org Cc: rostedt@goodmis.org Cc: rric@kernel.org Cc: schwidefsky@de.ibm.com Cc: sebott@linux.vnet.ibm.com Cc: sparclinux@vger.kernel.org Cc: sth@linux.vnet.ibm.com Cc: ubraun@linux.vnet.ibm.com Cc: will.deacon@arm.com Cc: x86@kernel.org Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: NJames Hogan <jhogan@kernel.org> Acked-by: NHelge Deller <deller@gmx.de> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 30 3月, 2018 1 次提交
-
-
由 Nicholas Piggin 提交于
The "lppaca" is a structure registered with the hypervisor. This is unnecessary when running on non-virtualised platforms. One field from the lppaca (pmcregs_in_use) is also used by the host, so move the host part out into the paca (lppaca field is still updated in guest mode). Signed-off-by: NNicholas Piggin <npiggin@gmail.com> [mpe: Fix non-pseries build with some #ifdefs] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 23 3月, 2018 2 次提交
-
-
由 Paul Mackerras 提交于
POWER9 has hardware bugs relating to transactional memory and thread reconfiguration (changes to hardware SMT mode). Specifically, the core does not have enough storage to store a complete checkpoint of all the architected state for all four threads. The DD2.2 version of POWER9 includes hardware modifications designed to allow hypervisor software to implement workarounds for these problems. This patch implements those workarounds in KVM code so that KVM guests see a full, working transactional memory implementation. The problems center around the use of TM suspended state, where the CPU has a checkpointed state but execution is not transactional. The workaround is to implement a "fake suspend" state, which looks to the guest like suspended state but the CPU does not store a checkpoint. In this state, any instruction that would cause a transition to transactional state (rfid, rfebb, mtmsrd, tresume) or would use the checkpointed state (treclaim) causes a "soft patch" interrupt (vector 0x1500) to the hypervisor so that it can be emulated. The trechkpt instruction also causes a soft patch interrupt. On POWER9 DD2.2, we avoid returning to the guest in any state which would require a checkpoint to be present. The trechkpt in the guest entry path which would normally create that checkpoint is replaced by either a transition to fake suspend state, if the guest is in suspend state, or a rollback to the pre-transactional state if the guest is in transactional state. Fake suspend state is indicated by a flag in the PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and reads back as 0. On exit from the guest, if the guest is in fake suspend state, we still do the treclaim instruction as we would in real suspend state, in order to get into non-transactional state, but we do not save the resulting register state since there was no checkpoint. Emulation of the instructions that cause a softpatch interrupt is handled in two paths. If the guest is in real suspend mode, we call kvmhv_p9_tm_emulation_early() to handle the cases where the guest is transitioning to transactional state. This is called before we do the treclaim in the guest exit path; because we haven't done treclaim, we can get back to the guest with the transaction still active. If the instruction is a case that kvmhv_p9_tm_emulation_early() doesn't handle, or if the guest is in fake suspend state, then we proceed to do the complete guest exit path and subsequently call kvmhv_p9_tm_emulation() in host context with the MMU on. This handles all the cases including the cases that generate program interrupts (illegal instruction or TM Bad Thing) and facility unavailable interrupts. The emulation is reasonably straightforward and is mostly concerned with checking for exception conditions and updating the state of registers such as MSR and CR0. The treclaim emulation takes care to ensure that the TEXASR register gets updated as if it were the guest treclaim instruction that had done failure recording, not the treclaim done in hypervisor state in the guest exit path. With this, the KVM_CAP_PPC_HTM capability returns true (1) even if transactional memory is not available to host userspace. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
POWER9 processors up to and including "Nimbus" v2.2 have hardware bugs relating to transactional memory and thread reconfiguration. One of these bugs has a workaround which is to get the core into SMT4 state temporarily. This workaround is only needed when running bare-metal. This patch provides a function which gets the core into SMT4 mode by preventing threads from going to a stop state, and waking up those which are already in a stop state. Once at least 3 threads are not in a stop state, the core will be in SMT4 and we can continue. To do this, we add a "dont_stop" flag to the paca to tell the thread not to go into a stop state. If this flag is set, power9_idle_stop() just returns immediately with a return value of 0. The pnv_power9_force_smt4_catch() function does the following: 1. Set the dont_stop flag for each thread in the core, except ourselves (in fact we use an atomic_inc() in case more than one thread is calling this function concurrently). 2. See how many threads are awake, indicated by their requested_psscr field in the paca being 0. If this is at least 3, skip to step 5. 3. Send a doorbell interrupt to each thread that was seen as being in a stop state in step 2. 4. Until at least 3 threads are awake, scan the threads to which we sent a doorbell interrupt and check if they are awake now. This relies on the following properties: - Once dont_stop is non-zero, requested_psccr can't go from zero to non-zero, except transiently (and without the thread doing stop). - requested_psscr being zero guarantees that the thread isn't in a state-losing stop state where thread reconfiguration could occur. - Doing stop with a PSSCR value of 0 won't be a state-losing stop and thus won't allow thread reconfiguration. - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core must be in SMT4 mode, since SMT modes are powers of 2. This does add a sync to power9_idle_stop(), which is necessary to provide the correct ordering between setting requested_psscr and checking dont_stop. The overhead of the sync should be unnoticeable compared to the latency of going into and out of a stop state. Because some objected to incurring this extra latency on systems where the XER[SO] bug is not relevant, I have put the test in power9_idle_stop inside a feature section. This means that pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will probably hang the system. In order to cater for uses where the caller has an operation that has to be done while the core is in SMT4, the core continues to be kept in SMT4 after pnv_power9_force_smt4_catch() function returns, until the pnv_power9_force_smt4_release() function is called. It undoes the effect of step 1 above and allows the other threads to go into a stop state. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 23 1月, 2018 1 次提交
-
-
由 Nicholas Piggin 提交于
The fallback RFI flush is used when firmware does not provide a way to flush the cache. It's a "displacement flush" that evicts useful data by displacing it with an uninteresting buffer. The flush has to take care to work with implementation specific cache replacment policies, so the recipe has been in flux. The initial slow but conservative approach is to touch all lines of a congruence class, with dependencies between each load. It has since been determined that a linear pattern of loads without dependencies is sufficient, and is significantly faster. Measuring the speed of a null syscall with RFI fallback flush enabled gives the relative improvement: P8 - 1.83x P9 - 1.75x The flush also becomes simpler and more adaptable to different cache geometries. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 19 1月, 2018 3 次提交
-
-
由 Madhavan Srinivasan 提交于
Rename the paca->soft_enabled to paca->irq_soft_mask as it is no longer used as a flag for interrupt state, but a mask. Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Benjamin Herrenschmidt 提交于
This works on top of the single escalation support. When in single escalation, with this change, we will keep the escalation interrupt disabled unless the VCPU is in H_CEDE (idle). In any other case, we know the VCPU will be rescheduled and thus there is no need to take escalation interrupts in the host whenever a guest interrupt fires. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Benjamin Herrenschmidt 提交于
The prodded flag is only cleared at the beginning of H_CEDE, so every time we have an escalation, we will cause the *next* H_CEDE to return immediately. Instead use a dedicated "irq_pending" flag to indicate that a guest interrupt is pending for the VCPU. We don't reuse the existing exception bitmap so as to avoid expensive atomic ops. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 10 1月, 2018 1 次提交
-
-
由 Michael Ellerman 提交于
On some CPUs we can prevent the Meltdown vulnerability by flushing the L1-D cache on exit from kernel to user mode, and from hypervisor to guest. This is known to be the case on at least Power7, Power8 and Power9. At this time we do not know the status of the vulnerability on other CPUs such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale CPUs. As more information comes to light we can enable this, or other mechanisms on those CPUs. The vulnerability occurs when the load of an architecturally inaccessible memory region (eg. userspace load of kernel memory) is speculatively executed to the point where its result can influence the address of a subsequent speculatively executed load. In order for that to happen, the first load must hit in the L1, because before the load is sent to the L2 the permission check is performed. Therefore if no kernel addresses hit in the L1 the vulnerability can not occur. We can ensure that is the case by flushing the L1 whenever we return to userspace. Similarly for hypervisor vs guest. In order to flush the L1-D cache on exit, we add a section of nops at each (h)rfi location that returns to a lower privileged context, and patch that with some sequence. Newer firmwares are able to advertise to us that there is a special nop instruction that flushes the L1-D. If we do not see that advertised, we fall back to doing a displacement flush in software. For guest kernels we support migration between some CPU versions, and different CPUs may use different flush instructions. So that we are prepared to migrate to a machine with a different flush instruction activated, we may have to patch more than one flush instruction at boot if the hypervisor tells us to. In the end this patch is mostly the work of Nicholas Piggin and Michael Ellerman. However a cast of thousands contributed to analysis of the issue, earlier versions of the patch, back ports testing etc. Many thanks to all of them. Tested-by: NJon Masters <jcm@redhat.com> Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 04 12月, 2017 1 次提交
-
-
由 Santosh Sivaraj 提交于
Current vDSO64 implementation does not have support for coarse clocks (CLOCK_MONOTONIC_COARSE, CLOCK_REALTIME_COARSE), for which it falls back to system call, increasing the response time, vDSO implementation reduces the cycle time. Below is a benchmark of the difference in execution times. (Non-coarse clocks are also included just for completion) clock-gettime-realtime: syscall: 172 nsec/call clock-gettime-realtime: libc: 28 nsec/call clock-gettime-realtime: vdso: 22 nsec/call clock-gettime-monotonic: syscall: 171 nsec/call clock-gettime-monotonic: libc: 30 nsec/call clock-gettime-monotonic: vdso: 25 nsec/call clock-gettime-realtime-coarse: syscall: 153 nsec/call clock-gettime-realtime-coarse: libc: 16 nsec/call clock-gettime-realtime-coarse: vdso: 10 nsec/call clock-gettime-monotonic-coarse: syscall: 167 nsec/call clock-gettime-monotonic-coarse: libc: 17 nsec/call clock-gettime-monotonic-coarse: vdso: 11 nsec/call CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: NSantosh Sivaraj <santosh@fossix.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 13 11月, 2017 1 次提交
-
-
由 Nicholas Piggin 提交于
Radix keeps no meaningful state in addr_limit, so remove it from radix code and rename to slb_addr_limit to make it clear it applies to hash only. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 06 11月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
CONFIG_PPC_STD_MMU_64 indicates support for the "standard" powerpc MMU on 64-bit CPUs. The "standard" MMU refers to the hash page table MMU found in "server" processors, from IBM mainly. Currently CONFIG_PPC_STD_MMU_64 is == CONFIG_PPC_BOOK3S_64. While it's annoying to have two symbols that always have the same value, it's not quite annoying enough to bother removing one. However with the arrival of Power9, we now have the situation where CONFIG_PPC_STD_MMU_64 is enabled, but the kernel is running using the Radix MMU - *not* the "standard" MMU. So it is now actively confusing to use it, because it implies that code is disabled or inactive when the Radix MMU is in use, however that is not necessarily true. So s/CONFIG_PPC_STD_MMU_64/CONFIG_PPC_BOOK3S_64/, and do some minor formatting updates of some of the affected lines. This will be a pain for backports, but c'est la vie. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 01 11月, 2017 1 次提交
-
-
由 Paul Mackerras 提交于
This patch removes the restriction that a radix host can only run radix guests, allowing us to run HPT (hashed page table) guests as well. This is useful because it provides a way to run old guest kernels that know about POWER8 but not POWER9. Unfortunately, POWER9 currently has a restriction that all threads in a given code must either all be in HPT mode, or all in radix mode. This means that when entering a HPT guest, we have to obtain control of all 4 threads in the core and get them to switch their LPIDR and LPCR registers, even if they are not going to run a guest. On guest exit we also have to get all threads to switch LPIDR and LPCR back to host values. To make this feasible, we require that KVM not be in the "independent threads" mode, and that the CPU cores be in single-threaded mode from the host kernel's perspective (only thread 0 online; threads 1, 2 and 3 offline). That allows us to use the same code as on POWER8 for obtaining control of the secondary threads. To manage the LPCR/LPIDR changes required, we extend the kvm_split_info struct to contain the information needed by the secondary threads. All threads perform a barrier synchronization (where all threads wait for every other thread to reach the synchronization point) on guest entry, both before and after loading LPCR and LPIDR. On guest exit, they all once again perform a barrier synchronization both before and after loading host values into LPCR and LPIDR. Finally, it is also currently necessary to flush the entire TLB every time we enter a HPT guest on a radix host. We do this on thread 0 with a loop of tlbiel instructions. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 01 8月, 2017 1 次提交
-
-
由 Gautham R. Shenoy 提交于
The stop4 idle state on POWER9 is a deep idle state which loses hypervisor resources, but whose latency is low enough that it can be exposed via cpuidle. Until now, the deep idle states which lose hypervisor resources (eg: winkle) were only exposed via CPU-Hotplug. Hence currently on wakeup from such states, barring a few SPRs which need to be restored to their older value, rest of the SPRS are reinitialized to their values corresponding to that at boot time. When stop4 is used in the context of cpuidle, we want these additional SPRs to be restored to their older value, to ensure that the context on the CPU coming back from idle is same as it was before going idle. In this patch, we define a SPR save area in PACA (since we have used up the volatile register space in the stack) and on POWER9, we restore SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1, SPRN_MMCR2 to the values they had before entering stop. Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 27 6月, 2017 1 次提交
-
-
由 Michael Neuling 提交于
The asm code assumes the FP regs are at the start of fp_state. While this is true now, it may not always be the case and there is nothing enforcing it. This fixes the asm-offsets to point to the actual FP registers inside the fp_state. Similarly for VMX. Signed-off-by: NMichael Neuling <mikey@neuling.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 21 6月, 2017 1 次提交
-
-
由 Aravinda Prasad 提交于
This introduces a new KVM capability to control how KVM behaves on machine check exception (MCE) in HV KVM guests. If this capability has not been enabled, KVM redirects machine check exceptions to guest's 0x200 vector, if the address in error belongs to the guest. With this capability enabled, KVM will cause a guest exit with the exit reason indicating an NMI. The new capability is required to avoid problems if a new kernel/KVM is used with an old QEMU, running a guest that doesn't issue "ibm,nmi-register". As old QEMU does not understand the NMI exit type, it treats it as a fatal error. However, the guest could have handled the machine check error if the exception was delivered to guest's 0x200 interrupt vector instead of NMI exit in case of old QEMU. [paulus@ozlabs.org - Reworded the commit message to be clearer, enable only on HV KVM.] Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 19 6月, 2017 3 次提交
-
-
由 Nicholas Piggin 提交于
msgsnd doorbell exceptions are cleared when the doorbell interrupt is taken. However if a doorbell exception causes a system reset interrupt wake from power saving state, the message is not cleared. Processing the doorbell from the system reset interrupt requires msgclr to avoid taking the exception again. Testing this plus the previous wakup direct patch gives: original wakeup direct msgclr Different threads, same core: 315k/s 264k/s 345k/s Different cores: 235k/s 242k/s 242k/s Net speedup is +10% for same core, and +3% for different core. Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
On POWER9, we no longer have the restriction that we had on POWER8 where all threads in a core have to be in the same partition, so the CPU threads are now independent. However, we still want to be able to run guests with a virtual SMT topology, if only to allow migration of guests from POWER8 systems to POWER9. A guest that has a virtual SMT mode greater than 1 will expect to be able to use the doorbell facility; it will expect the msgsndp and msgclrp instructions to work appropriately and to be able to read sensible values from the TIR (thread identification register) and DPDES (directed privileged doorbell exception status) special-purpose registers. However, since each CPU thread is a separate sub-processor in POWER9, these instructions and registers can only be used within a single CPU thread. In order for these instructions to appear to act correctly according to the guest's virtual SMT mode, we have to trap and emulate them. We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR register. The emulation is triggered by the hypervisor facility unavailable interrupt that occurs when the guest uses them. To cause a doorbell interrupt to occur within the guest, we set the DPDES register to 1. If the guest has interrupts enabled, the CPU will generate a doorbell interrupt and clear the DPDES register in hardware. The DPDES hardware register for the guest is saved in the vcpu->arch.vcore->dpdes field. Since this gets written by the guest exit code, other VCPUs wishing to cause a doorbell interrupt don't write that field directly, but instead set a vcpu->arch.doorbell_request flag. This is consumed and set to 0 by the guest entry code, which then sets DPDES to 1. Emulating reads of the DPDES register is somewhat involved, because it requires reading the doorbell pending interrupt status of all of the VCPU threads in the virtual core, and if any of those VCPUs are running, their doorbell status is only up-to-date in the hardware DPDES registers of the CPUs where they are running. In order to get a reasonable approximation of the current doorbell status, we send those CPUs an IPI, causing an exit from the guest which will update the vcpu->arch.vcore->dpdes field. We then use that value in constructing the emulated DPDES register value. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Paul Mackerras 提交于
This adds code to allow us to use a different value for the HFSCR (Hypervisor Facilities Status and Control Register) when running the guest from that which applies in the host. The reason for doing this is to allow us to trap the msgsndp instruction and related operations in future so that they can be virtualized. We also save the value of HFSCR when a hypervisor facility unavailable interrupt occurs, because the high byte of HFSCR indicates which facility the guest attempted to access. We save and restore the host value on guest entry/exit because some bits of it affect host userspace execution. We only do all this on POWER9, not on POWER8, because we are not intending to virtualize any of the facilities controlled by HFSCR on POWER8. In particular, the HFSCR bit that controls execution of msgsndp and related operations does not exist on POWER8. The HFSCR doesn't exist at all on POWER7. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 30 5月, 2017 1 次提交
-
-
由 Gautham R. Shenoy 提交于
On Power9 DD1 due to a hardware bug the Power-Saving Level Status field (PLS) of the PSSCR for a thread waking up from a deep state can under-report if some other thread in the core is in a shallow stop state. The scenario in which this can manifest is as follows: 1) All the threads of the core are in deep stop. 2) One of the threads is woken up. The PLS for this thread will correctly reflect that it is waking up from deep stop. 3) The thread that has woken up now executes a shallow stop. 4) When some other thread in the core is woken, its PLS will reflect the shallow stop state. Thus, the subsequent thread for which the PLS is under-reporting the wakeup state will not restore the hypervisor resources. Hence, on DD1 systems, use the Requested Level (RL) field as a workaround to restore the contents of the hypervisor resources on the wakeup from the stop state. Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 28 4月, 2017 3 次提交
-
-
由 Nicholas Piggin 提交于
The system reset interrupt is used for crash/debug situations, so it is desirable to have as little impact on the normal state of the system as possible. Currently it uses the current kernel stack to process the exception. This stores into the stack which may be involved with the crash. The stack pointer may be corrupted, or it may have overflowed. Avoid or minimise these problems by creating a dedicated NMI stack for the system reset interrupt to use. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
In preparation for using a dedicated stack for system reset interrupts, prevent a nested system reset from recovering, in order to simplify code that is called in crash/debug path. This allows a system reset interrupt to just use the base stack pointer. Keep an in_nmi nesting counter similarly to the in_mce counter. Consider the interrrupt non-recoverable if it is taken inside another system reset. Interrupt nesting could be allowed similarly to MCE, but system reset is a special case that's not for normal operation, so simplicity wins until there is requirement for nested system reset interrupts. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
The system reset interrupt can occur when MSR_EE=0, and it currently uses the PACA_EXGEN save area. Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0 when the save area is still in use. A system reset interrupt in this window can lead to undetected corruption when the save area gets overwritten. This patch introduces PACA_EXNMI save area for system reset exceptions, which closes this corruption window. It's also helpful to retain the EXGEN state for debugging situations, even if not considering the recoverability aspect. This patch also moves the PACA_EXMC area down to a less frequently used part of the paca with the new save area. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 27 4月, 2017 1 次提交
-
-
由 Benjamin Herrenschmidt 提交于
This patch makes KVM capable of using the XIVE interrupt controller to provide the standard PAPR "XICS" style hypercalls. It is necessary for proper operations when the host uses XIVE natively. This has been lightly tested on an actual system, including PCI pass-through with a TG3 device. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and adding empty stubs for the kvm_xive_xxx() routines, fixup subject, integrate fixes from Paul for building PR=y HV=n] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 12 4月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB"), we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong. The end result is that the PGD (Page Global Directory, ie top level page table) of the kernel (aka. swapper_pg_dir), is too small. This generally doesn't lead to a crash, as we don't use the full range in normal operation. However if we try to dump the kernel pagetables we can trigger a crash because we walk off the end of the pgd into other memory and eventually try to dereference something bogus: $ cat /sys/kernel/debug/kernel_pagetables Unable to handle kernel paging request for data at address 0xe8fece0000000000 Faulting instruction address: 0xc000000000072314 cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890] pc: c000000000072314: ptdump_show+0x164/0x430 lr: c000000000072550: ptdump_show+0x3a0/0x430 dar: e802cf0000000000 seq_read+0xf8/0x560 full_proxy_read+0x84/0xc0 __vfs_read+0x6c/0x1d0 vfs_read+0xbc/0x1b0 SyS_read+0x6c/0x110 system_call+0x38/0xfc The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move the calculation into asm-offsets.c where we can do it easily using max(). Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB") Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 11 4月, 2017 1 次提交
-
-
由 Gautham R. Shenoy 提交于
POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus the HSPRG0 of a thread waking up from can contain the paca pointer of its sibling. This patch implements a context recovery framework within threads of a core, by provisioning space in paca_struct for saving every sibling threads's paca pointers. Basically, we should be able to arrive at the right paca pointer from any of the thread's existing paca pointer. At bootup, during powernv idle-init, we save the paca address of every CPU in each one its siblings paca_struct in the slot corresponding to this CPU's index in the core. On wakeup from a stop, the thread will determine its index in the core from the TIR register and recover its PACA pointer by indexing into the correct slot in the provisioned space in the current PACA. Furthermore, ensure that the NVGPRs are restored from the stack on the way out by setting the NAPSTATELOST in paca. [Changelog written with inputs from svaidy@linux.vnet.ibm.com] Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: NNicholas Piggin <npiggin@gmail.com> [mpe: Call it a bug] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 01 4月, 2017 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
We optmize the slice page size array copy to paca by copying only the range based on addr_limit. This will require us to not look at page size array beyond addr_limit in PACA on slb fault. To enable that copy task size to paca which will be used during slb fault. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rename from task_size to addr_limit, consolidate #ifdefs] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 15 2月, 2017 2 次提交
-
-
由 Rashmica Gupta 提交于
There are quite a few entries in asm-offests.c which look like: DEFINE(REG, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, reg)); So define a macro to do it once. Signed-off-by: NRashmica Gupta <rashmicy@gmail.com> [mpe: Rename to STACK_PT_REGS_OFFSET for excruciating explicitness] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Rashmica Gupta 提交于
A lot of entries in asm-offests.c look like this: DEFINE(TI_FLAGS, offsetof(struct thread_info, flags)); But there is a common macro, OFFSET, which makes this cleaner: OFFSET(TI_flags, thread_info, flags) So use it. Signed-off-by: NRashmica Gupta <rashmicy@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 06 2月, 2017 2 次提交
-
-
由 Benjamin Herrenschmidt 提交于
We have two set of identical struct members for the I and D sides and mostly identical bunches of code to parse the device-tree to populate them. Instead make a ppc_cache_info structure with one copy for I and one for D Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Benjamin Herrenschmidt 提交于
In a number of places we called "cache line size" what is actually the cache block size, which in the powerpc architecture, means the effective size to use with cache management instructions (it can be different from the actual cache line size). We fix the naming across the board and properly retrieve both pieces of information when available in the device-tree. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 31 1月, 2017 1 次提交
-
-
由 Paul Mackerras 提交于
This adds code to branch around the parts that radix guests don't need - clearing and loading the SLB with the guest SLB contents, saving the guest SLB contents on exit, and restoring the host SLB contents. Since the host is now using radix, we need to save and restore the host value for the PID register. On hypervisor data/instruction storage interrupts, we don't do the guest HPT lookup on radix, but just save the guest physical address for the fault (from the ASDR register) in the vcpu struct. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 24 1月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
Unfortunately the stack protector support we merged recently only works on some toolchains. If the toolchain is built without glibc support everything works fine, but if glibc is built then it leads to a panic at boot. The solution is not rc5 material, so revert the support for now. This reverts commits: 6533b7c1 ("powerpc: Initial stack protector (-fstack-protector) support") 902e06eb ("powerpc/32: Change the stack protector canary value per task") Fixes: 6533b7c1 ("powerpc: Initial stack protector (-fstack-protector) support") Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 14 1月, 2017 1 次提交
-
-
由 Frederic Weisbecker 提交于
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay cputime accounting to the tick, provide finegrained accumulators to powerpc in order to store the cputime until flushing. While at it, normalize the name of several fields according to common cputime naming. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Acked-by: NThomas Gleixner <tglx@linutronix.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1483636310-6557-6-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 11月, 2016 2 次提交
-
-
由 Paul Mackerras 提交于
POWER9 adds new capabilities to the tlbie (TLB invalidate entry) and tlbiel (local tlbie) instructions. Both instructions get a set of new parameters (RIC, PRS and R) which appear as bits in the instruction word. The tlbiel instruction now has a second register operand, which contains a PID and/or LPID value if needed, and should otherwise contain 0. This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9 as well as older processors. Since we only handle HPT guests so far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction word as on previous processors, so we don't need to conditionally execute different instructions depending on the processor. The local flush on first entry to a guest in book3s_hv_rmhandlers.S is a loop which depends on the number of TLB sets. Rather than using feature sections to set the number of iterations based on which CPU we're on, we now work out this number at VM creation time and store it in the kvm_arch struct. That will make it possible to get the number from the device tree in future, which will help with compatibility with future processors. Since mmu_partition_table_set_entry() does a global flush of the whole LPID, we don't need to do the TLB flush on first entry to the guest on each processor. Therefore we don't set all bits in the tlb_need_flush bitmap on VM startup on POWER9. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Paul Mackerras 提交于
This adds code to handle two new guest-accessible special-purpose registers on POWER9: TIDR (thread ID register) and PSSCR (processor stop status and control register). They are context-switched between host and guest, and the guest values can be read and set via the one_reg interface. The PSSCR contains some fields which are guest-accessible and some which are only accessible in hypervisor mode. We only allow the guest-accessible fields to be read or set by userspace. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 23 11月, 2016 1 次提交
-
-
由 Christophe Leroy 提交于
Partially copied from commit df0698be ("ARM: stack protector: change the canary value per task") A new random value for the canary is stored in the task struct whenever a new task is forked. This is meant to allow for different canary values per task. On powerpc, GCC expects the canary value to be found in a global variable called __stack_chk_guard. So this variable has to be updated with the value stored in the task struct whenever a task switch occurs. Because the variable GCC expects is global, this cannot work on SMP unfortunately. So, on SMP, the same initial canary value is kept throughout, making this feature a bit less effective although it is still useful. Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 21 11月, 2016 1 次提交
-
-
由 Paul Mackerras 提交于
When switching from/to a guest that has a transaction in progress, we need to save/restore the checkpointed register state. Although XER is part of the CPU state that gets checkpointed, the code that does this saving and restoring doesn't save/restore XER. This fixes it by saving and restoring the XER. To allow userspace to read/write the checkpointed XER value, we also add a new ONE_REG specifier. The visible effect of this bug is that the guest may see its XER value being corrupted when it uses transactions. Fixes: e4e38121 ("KVM: PPC: Book3S HV: Add transactional memory support") Fixes: 0a8eccef ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NThomas Huth <thuth@redhat.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 04 10月, 2016 1 次提交
-
-
由 Cyril Bur 提交于
Make the structures being used for checkpointed state named consistently with the pt_regs/ckpt_regs. Signed-off-by: NCyril Bur <cyrilbur@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 27 9月, 2016 1 次提交
-
-
由 Paul Mackerras 提交于
POWER8 has one virtual timebase (VTB) register per subcore, not one per CPU thread. The HV KVM code currently treats VTB as a per-thread register, which can lead to spurious soft lockup messages from guests which use the VTB as the time source for the soft lockup detector. (CPUs before POWER8 did not have the VTB register.) For HV KVM, this fixes the problem by making only the primary thread in each virtual core save and restore the VTB value. With this, the VTB state becomes part of the kvmppc_vcore structure. This also means that "piggybacking" of multiple virtual cores onto one subcore is not possible on POWER8, because then the virtual cores would share a single VTB register. PR KVM emulates a VTB register, which is per-vcpu because PR KVM has no notion of CPU threads or SMT. For PR KVM we move the VTB state into the kvmppc_vcpu_book3s struct. Cc: stable@vger.kernel.org # v3.14+ Reported-by: NThomas Huth <thuth@redhat.com> Tested-by: NThomas Huth <thuth@redhat.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-