- 02 2月, 2015 1 次提交
-
-
由 Ryan Grimm 提交于
When unbinding and rebinding the driver on a system with a card in PHB0, this error condition is reached after a few attempts: ERROR: Bad of_node_put() on /pciex@3fffe40000000 CPU: 0 PID: 3040 Comm: bash Not tainted 3.18.0-rc3-12545-g3627ffe #152 Call Trace: [c000000721acb5c0] [c00000000086ef94] .dump_stack+0x84/0xb0 (unreliable) [c000000721acb640] [c00000000073a0a8] .of_node_release+0xd8/0xe0 [c000000721acb6d0] [c00000000044bc44] .kobject_release+0x74/0xe0 [c000000721acb760] [c0000000007394fc] .of_node_put+0x1c/0x30 [c000000721acb7d0] [c000000000545cd8] .cxl_probe+0x1a98/0x1d50 [c000000721acb900] [c0000000004845a0] .local_pci_probe+0x40/0xc0 [c000000721acb980] [c000000000484998] .pci_device_probe+0x128/0x170 [c000000721acba30] [c00000000052400c] .driver_probe_device+0xac/0x2a0 [c000000721acbad0] [c000000000522468] .bind_store+0x108/0x160 [c000000721acbb70] [c000000000521448] .drv_attr_store+0x38/0x60 [c000000721acbbe0] [c000000000293840] .sysfs_kf_write+0x60/0xa0 [c000000721acbc50] [c000000000292500] .kernfs_fop_write+0x140/0x1d0 [c000000721acbcf0] [c000000000208648] .vfs_write+0xd8/0x260 [c000000721acbd90] [c000000000208b18] .SyS_write+0x58/0x100 [c000000721acbe30] [c000000000009258] syscall_exit+0x0/0x98 We are missing a call to of_node_get(). pnv_pci_to_phb_node() should call of_node_get() otherwise np's reference count isn't incremented and it might go away. Rename pnv_pci_to_phb_node() to pnv_pci_get_phb_node() so it's clear it calls of_node_get(). Signed-off-by: NRyan Grimm <grimm@linux.vnet.ibm.com> Acked-by: NIan Munsie <imunsie@au1.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 30 1月, 2015 4 次提交
-
-
由 LEROY Christophe 提交于
When pages are not 4K, PGDIR table is allocated with kmalloc(). In order to optimise TLB handlers, aligned memory is needed. kmalloc() doesn't provide aligned memory blocks, so lets use a kmem_cache pool instead. Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NScott Wood <scottwood@freescale.com>
-
由 LEROY Christophe 提交于
On powerpc 8xx, in TLB entries, 0x400 bit is set to 1 for read-only pages and is set to 0 for RW pages. So we should use _PAGE_RO instead of _PAGE_RW Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NScott Wood <scottwood@freescale.com>
-
由 LEROY Christophe 提交于
Some powerpc like the 8xx don't have a RW bit in PTE bits but a RO (Read Only) bit. This patch implements the handling of a _PAGE_RO flag to be used in place of _PAGE_RW Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> [scottwood@freescale.com: fix whitespace] Signed-off-by: NScott Wood <scottwood@freescale.com>
-
由 Kim Phillips 提交于
Fix this: CC arch/powerpc/sysdev/fsl_pci.o arch/powerpc/sysdev/fsl_pci.c: In function 'fsl_pcie_check_link': arch/powerpc/sysdev/fsl_pci.c:91:1: error: the frame size of 1360 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] when configuring FRAME_WARN, by refactoring indirect_read_config() to take hose and bus number instead of the 1344-byte struct pci_bus. Signed-off-by: NKim Phillips <kim.phillips@freescale.com> Signed-off-by: NScott Wood <scottwood@freescale.com>
-
- 28 1月, 2015 1 次提交
-
-
由 Michael Ellerman 提交于
Remove slice_set_psize() which is not used. It was added in 3a8247cc "powerpc: Only demote individual slices rather than whole process" but was never used. Remove vsx_assist_exception() which is not used. It was added in ce48b210 "powerpc: Add VSX context save/restore, ptrace and signal support" but was never used. Remove generic_mach_cpu_die() which is not used. Its last caller was removed in 375f561a "powerpc/powernv: Always go into nap mode when CPU is offline". Remove mpc7448_hpc2_power_off() and mpc7448_hpc2_halt() which are unused. These were introduced in c5d56332 "[POWERPC] Add general support for mpc7448hpc2 (Taiga) platform" but were never used. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se> [mpe: Update changelog with details on when/why they are unused] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 27 1月, 2015 1 次提交
-
-
由 Cyril Bur 提交于
RTAS events require arguments be passed in big endian while hypercalls have their arguments passed in registers and the values should therefore be in CPU endian. The "ibm,suspend_me" 'RTAS' call makes a sequence of hypercalls to setup one true RTAS call. This means that "ibm,suspend_me" is handled specially in the ppc_rtas() syscall. The ppc_rtas() syscall has its arguments in big endian and can therefore pass these arguments directly to the RTAS call. "ibm,suspend_me" is handled specially from within ppc_rtas() (by calling rtas_ibm_suspend_me()) which has left an endian bug on little endian systems due to the requirement of hypercalls. The return value from rtas_ibm_suspend_me() gets returned in cpu endian, and is left unconverted, also a bug on little endian systems. rtas_ibm_suspend_me() does not actually make use of the rtas_args that it is passed. This patch removes the convoluted use of the rtas_args struct to pass params to rtas_ibm_suspend_me() in favour of passing what it needs as actual arguments. This patch also ensures the two callers of rtas_ibm_suspend_me() pass function parameters in cpu endian and in the case of ppc_rtas(), converts the return value. migrate_store() (the other caller of rtas_ibm_suspend_me()) is from a sysfs file which deals with everything in cpu endian so this function only underwent cleanup. This patch has been tested with KVM both LE and BE and on PowerVM both LE and BE. Under QEMU/KVM the migration happens without touching these code pathes. For PowerVM there is no obvious regression on BE and the LE code path now provides the correct parameters to the hypervisor. Signed-off-by: NCyril Bur <cyrilbur@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 23 1月, 2015 6 次提交
-
-
由 Gavin Shan 提交于
When PE's frozen count hits maximal allowed frozen times, which is 5 currently, it will be forced to be offline permanently. Once the PE is removed permanently, rebooting machine is required to bring the PE back. It's not convienent when testing EEH functionality. The patch exports the maximal allowed frozen times through debugfs entry (/sys/kernel/debug/powerpc/eeh_max_freezes). Requested-by: NRyan Grimm <grimm@linux.vnet.ibm.com> Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Gavin Shan 提交于
The conditions that one specific PE's frozen count exceeds the maximal allowed times (EEH_MAX_ALLOWED_FREEZES) and it's in isolated or recovery state indicate the PE was removed permanently implicitly. The patch introduces flag EEH_PE_REMOVED to indicate that explicitly so that we don't depend on the fixed maximal allowed times, which can be varied as we do in subsequent patch. Flag EEH_PE_REMOVED is expected to be marked for the PE whose frozen count exceeds the maximal allowed times, or just failed from recovery. Requested-by: NRyan Grimm <grimm@linux.vnet.ibm.com> Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Gavin Shan 提交于
PE#0 should be regarded as valid for P7IOC, while it's invalid for PHB3. The patch adds flag EEH_VALID_PE_ZERO to differentiate those two cases. Without the patch, we possibly see frozen PE#0 state is cleared without EEH recovery taken on P7IOC as following kernel logs indicate: [root@ltcfbl8eb ~]# dmesg : pci 0000:00 : [PE# 000] Secondary bus 0 associated with PE#0 pci 0000:01 : [PE# 001] Secondary bus 1 associated with PE#1 pci 0001:00 : [PE# 000] Secondary bus 0 associated with PE#0 pci 0001:01 : [PE# 001] Secondary bus 1 associated with PE#1 pci 0002:00 : [PE# 000] Secondary bus 0 associated with PE#0 pci 0002:01 : [PE# 001] Secondary bus 1 associated with PE#1 pci 0003:00 : [PE# 000] Secondary bus 0 associated with PE#0 pci 0003:01 : [PE# 001] Secondary bus 1 associated with PE#1 pci 0003:20 : [PE# 002] Secondary bus 32..63 associated with PE#2 : EEH: Clear non-existing PHB#3-PE#0 EEH: PHB location: U78AE.001.WZS00M9-P1-002 Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Naveen N. Rao 提交于
Currently, all non-dot symbols are being treated as function descriptors in ABIv1. This is incorrect and is resulting in perf probe not working: # perf probe do_fork Added new event: Failed to write event: Invalid argument Error: Failed to add events. # dmesg | tail -1 [192268.073063] Could not insert probe at _text+768432: -22 perf probe bases all kernel probes on _text and writes, for example, "p:probe/do_fork _text+768432" to /sys/kernel/debug/tracing/kprobe_events. In-kernel, _text is being considered to be a function descriptor and is resulting in the above error. Fix this by changing how we lookup symbol addresses on ppc64. We first check for the dot variant of a symbol and look at the non-dot variant only if that fails. In this manner, we avoid having to look at the function descriptor. While at it, also separate out how this works on ABIv2 where we don't have dot symbols, but need to use the local entry point. Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Michael Ellerman 提交于
Once upon a time, at least 9 years ago (< 2.6.12), _TIF_SYSCALL_T_OR_A meant "TRACE or AUDIT". But these days it means TRACE or AUDIT or SECCOMP or TRACEPOINT or NOHZ. All of those are implemented via syscall_dotrace() so rename the flag to that to try and clarify things. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Michael Ellerman 提交于
We removed the last usage of CPU_FTR_IABR in commit 1ad7d705 "powerpc/xmon: Enable HW instruction breakpoint on POWER8". Mark it as free. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 22 1月, 2015 1 次提交
-
-
由 Ryan Grimm 提交于
Turning snoops on is the last step in CAPP recovery. Sapphire is expected to have reinitialized the PHB and done the previous recovery steps. Add mode argument to opal call to do this. Driver can turn snoops off although it does not currently. Signed-off-by: NRyan Grimm <grimm@linux.vnet.ibm.com> Acked-by: NIan Munsie <imunsie@au1.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 18 12月, 2014 1 次提交
-
-
由 Michael S. Tsirkin 提交于
At the moment, if p and x are both of the same bitwise type (eg. __le32), get_user(x, p) produces a sparse warning. This is because *p is loaded into a long then cast back to typeof(*p). When typeof(*p) is a bitwise type (which is uncommon), such a cast needs __force, otherwise sparse produces a warning. For non-bitwise types __force should have no effect, and should not hide any legitimate errors. Note that we are casting to typeof(*p) not typeof(x). Even with the cast, if x and *p are of different types we should get the warning, so I think we are not loosing the ability to detect any actual errors. virtio would like to use bitwise types with get_user() so fix these spurious warnings by adding __force. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> [mpe: Fill in changelog with more details] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 17 12月, 2014 4 次提交
-
-
由 Sam Bobroff 提交于
Currently the H_CONFER hcall is implemented in kernel virtual mode, meaning that whenever a guest thread does an H_CONFER, all the threads in that virtual core have to exit the guest. This is bad for performance because it interrupts the other threads even if they are doing useful work. The H_CONFER hcall is called by a guest VCPU when it is spinning on a spinlock and it detects that the spinlock is held by a guest VCPU that is currently not running on a physical CPU. The idea is to give this VCPU's time slice to the holder VCPU so that it can make progress towards releasing the lock. To avoid having the other threads exit the guest unnecessarily, we add a real-mode implementation of H_CONFER that checks whether the other threads are doing anything. If all the other threads are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER), it returns H_TOO_HARD which causes a guest exit and allows the H_CONFER to be handled in virtual mode. Otherwise it spins for a short time (up to 10 microseconds) to give other threads the chance to observe that this thread is trying to confer. The spin loop also terminates when any thread exits the guest or when all other threads are idle or trying to confer. If the timeout is reached, the H_CONFER returns H_SUCCESS. In this case the guest VCPU will recheck the spinlock word and most likely call H_CONFER again. This also improves the implementation of the H_CONFER virtual mode handler. If the VCPU is part of a virtual core (vcore) which is runnable, there will be a 'runner' VCPU which has taken responsibility for running the vcore. In this case we yield to the runner VCPU rather than the target VCPU. We also introduce a check on the target VCPU's yield count: if it differs from the yield count passed to H_CONFER, the target VCPU has run since H_CONFER was called and may have already released the lock. This check is required by PAPR. Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
There are two ways in which a guest instruction can be obtained from the guest in the guest exit code in book3s_hv_rmhandlers.S. If the exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal instruction), the offending instruction is in the HEIR register (Hypervisor Emulation Instruction Register). If the exit was caused by a load or store to an emulated MMIO device, we load the instruction from the guest by turning data relocation on and loading the instruction with an lwz instruction. Unfortunately, in the case where the guest has opposite endianness to the host, these two methods give results of different endianness, but both get put into vcpu->arch.last_inst. The HEIR value has been loaded using guest endianness, whereas the lwz will load the instruction using host endianness. The rest of the code that uses vcpu->arch.last_inst assumes it was loaded using host endianness. To fix this, we define a new vcpu field to store the HEIR value. Then, in kvmppc_handle_exit_hv(), we transfer the value from this new field to vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness differ. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This removes the code that was added to enable HV KVM to work on PPC970 processors. The PPC970 is an old CPU that doesn't support virtualizing guest memory. Removing PPC970 support also lets us remove the code for allocating and managing contiguous real-mode areas, the code for the !kvm->arch.using_mmu_notifiers case, the code for pinning pages of guest memory when first accessed and keeping track of which pages have been pinned, and the code for handling H_ENTER hypercalls in virtual mode. Book3S HV KVM is now supported only on POWER7 and POWER8 processors. The KVM_CAP_PPC_RMA capability now always returns 0. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Currently the calculations of stolen time for PPC Book3S HV guests uses fields in both the vcpu struct and the kvmppc_vcore struct. The fields in the kvmppc_vcore struct are protected by the vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for running the virtual core. This works correctly but confuses lockdep, because it sees that the code takes the tbacct_lock for a vcpu in kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in vcore_stolen_time(), and it thinks there is a possibility of deadlock, causing it to print reports like this: ============================================= [ INFO: possible recursive locking detected ] 3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted --------------------------------------------- qemu-system-ppc/6188 is trying to acquire lock: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv] but task is already holding lock: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&vcpu->arch.tbacct_lock)->rlock); lock(&(&vcpu->arch.tbacct_lock)->rlock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by qemu-system-ppc/6188: #0: (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm] #1: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv] #2: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv] stack backtrace: CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89 Call Trace: [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable) [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190 [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0 [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70 [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv] [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv] [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv] [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm] [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm] [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm] [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770 [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0 [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98 In order to make the locking easier to analyse, we change the code to use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and preempt_tb fields. This lock needs to be an irq-safe lock since it is used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv() functions, which are called with the scheduler rq lock held, which is an irq-safe lock. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 15 12月, 2014 5 次提交
-
-
由 Paul Mackerras 提交于
The B (segment size) field in the RB operand for the tlbie instruction is two bits, which we get from the top two bits of the first doubleword of the HPT entry to be invalidated. These bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM bit numbering). The compute_tlbie_rb() function gets these bits as v >> (62 - 8), which is not correct as it will bring in the top 10 bits, not just the top two. These extra bits could corrupt the AP, AVAL and L fields in the RB value. To fix this we shift right 62 bits and then shift left 8 bits, so we only get the two bits of the B field. The first doubleword of the HPT entry is under the control of the guest kernel. In fact, Linux guests will always put zeroes in bits 54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing this. Signed-off-by: NPaul Mackerras <paulus@samba.org> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Shreyas B. Prabhu 提交于
Winkle is a deep idle state supported in power8 chips. A core enters winkle when all the threads of the core enter winkle. In this state power supply to the entire chiplet i.e core, private L2 and private L3 is turned off. As a result it gives higher powersavings compared to sleep. But entering winkle results in a total hypervisor state loss. Hence the hypervisor context has to be preserved before entering winkle and restored upon wake up. Power-on Reset Engine (PORE) is a dedicated engine which is responsible for powering on the chiplet during wake up. It can be programmed to restore the register contests of a few specific registers. This patch uses PORE to restore register state wherever possible and uses stack to save and restore rest of the necessary registers. With hypervisor state restore things fall under three categories- per-core state, per-subcore state and per-thread state. To manage this, extend the infrastructure introduced for sleep. Mainly we add a paca variable subcore_sibling_mask. Using this and the core_idle_state we can distingush first thread in core and subcore. Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Shreyas B. Prabhu 提交于
Deep idle states like sleep and winkle are per core idle states. A core enters these states only when all the threads enter either the particular idle state or a deeper one. There are tasks like fastsleep hardware bug workaround and hypervisor core state save which have to be done only by the last thread of the core entering deep idle state and similarly tasks like timebase resync, hypervisor core register restore that have to be done only by the first thread waking up from these state. The current idle state management does not have a way to distinguish the first/last thread of the core waking/entering idle states. Tasks like timebase resync are done for all the threads. This is not only is suboptimal, but can cause functionality issues when subcores and kvm is involved. This patch adds the necessary infrastructure to track idle states of threads in a per-core structure. It uses this info to perform tasks like fastsleep workaround and timebase resync only once per core. Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Shreyas B. Prabhu 提交于
The secondary threads should enter deep idle states so as to gain maximum powersavings when the entire core is offline. To do so the offline path must be made aware of the available deepest idle state. Hence probe the device tree for the possible idle states in powernv core code and expose the deepest idle state through flags. Since the device tree is probed by the cpuidle driver as well, move the parameters required to discover the idle states into an appropriate common place to both the driver and the powernv core code. Another point is that fastsleep idle state may require workarounds in the kernel to function properly. This workaround is introduced in the subsequent patches. However neither the cpuidle driver or the hotplug path need be bothered about this workaround. They will be taken care of by the core powernv code. Originally-by: NSrivatsa S. Bhat <srivatsa@mit.edu> Signed-off-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Reviewed-by: NPaul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
Currently, when going idle, we set the flag indicating that we are in nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap (or sleep or rvwinkle) instruction, all with the MMU on. This is bad for two reasons: (a) the architecture specifies that those instructions must be executed with the MMU off, and in fact with only the SF, HV, ME and possibly RI bits set, and (b) this introduces a race, because as soon as we set the flag, another thread can switch the MMU to a guest context. If the race is lost, this thread will typically start looping on relocation-on ISIs at 0xc...4400. This fixes it by setting the MSR as required by the architecture before setting the flag or executing the nap/sleep/rvwinkle instruction. Cc: stable@vger.kernel.org [ shreyas@linux.vnet.ibm.com: Edited to handle LE ] Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 14 12月, 2014 1 次提交
-
-
由 Neelesh Gupta 提交于
The patch exposes the available i2c busses on the PowerNV platform to the kernel and implements the bus driver to support i2c and smbus commands. The driver uses the platform device infrastructure to probe the busses on the platform and registers them with the i2c driver framework. Signed-off-by: NNeelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Wolfram Sang <wsa@the-dreams.de> (I2C part, excluding the bindings) Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 12 12月, 2014 3 次提交
-
-
由 Richard Guy Briggs 提交于
Since both ppc and ppc64 have LE variants which are now reported by uname, add that flag (__AUDIT_ARCH_LE) to syscall_get_arch() and add AUDIT_ARCH_PPC64LE variant. Without this, perf trace and auditctl fail. Mainline kernel reports ppc64le (per a0588015) but there is no matching AUDIT_ARCH_PPC64LE. Since 32-bit PPC LE is not supported by audit, don't advertise it in AUDIT_ARCH_PPC* variants. See: https://www.redhat.com/archives/linux-audit/2014-August/msg00082.html https://www.redhat.com/archives/linux-audit/2014-December/msg00004.htmlSigned-off-by: NRichard Guy Briggs <rgb@redhat.com> Acked-by: NPaul Moore <paul@paul-moore.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Alexander Duyck 提交于
There are a number of situations where the mandatory barriers rmb() and wmb() are used to order memory/memory operations in the device drivers and those barriers are much heavier than they actually need to be. For example in the case of PowerPC wmb() calls the heavy-weight sync instruction when for coherent memory operations all that is really needed is an lsync or eieio instruction. This commit adds a coherent only version of the mandatory memory barriers rmb() and wmb(). In most cases this should result in the barrier being the same as the SMP barriers for the SMP case, however in some cases we use a barrier that is somewhere in between rmb() and smp_rmb(). For example on ARM the rmb barriers break down as follows: Barrier Call Explanation --------- -------- ---------------------------------- rmb() dsb() Data synchronization barrier - system dma_rmb() dmb(osh) data memory barrier - outer sharable smp_rmb() dmb(ish) data memory barrier - inner sharable These new barriers are not as safe as the standard rmb() and wmb(). Specifically they do not guarantee ordering between coherent and incoherent memories. The primary use case for these would be to enforce ordering of reads and writes when accessing coherent memory that is shared between the CPU and a device. It may also be noted that there is no dma_mb(). Most architectures don't provide a good mechanism for performing a coherent only full barrier without resorting to the same mechanism used in mb(). As such there isn't much to be gained in trying to define such a function. Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Miller <davem@davemloft.net> Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
This patch is meant to cleanup the handling of read_barrier_depends and smp_read_barrier_depends. In multiple spots in the kernel headers read_barrier_depends is defined as "do {} while (0)", however we then go into the SMP vs non-SMP sections and have the SMP version reference read_barrier_depends, and the non-SMP define it as yet another empty do/while. With this commit I went through and cleaned out the duplicate definitions and reduced the number of definitions down to 2 per header. In addition I moved the 50 line comments for the macro from the x86 and mips headers that defined it as an empty do/while to those that were actually defining the macro, alpha and blackfin. Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 12月, 2014 2 次提交
-
-
由 Kirill A. Shutemov 提交于
As a small zero page, huge zero page should not be accounted in smaps report as normal page. For small pages we rely on vm_normal_page() to filter out zero page, but vm_normal_page() is not designed to handle pmds. We only get here due hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not necessary compatible on each and every architecture. Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will detect huge zero page for us. We would need pmd_dirty() helper to do this properly. The patch adds it to THP-enabled architectures which don't yet have one. [akpm@linux-foundation.org: use do_div to fix 32-bit build] Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Reported-by: NFengguang Wu <fengguang.wu@intel.com> Tested-by: NFengwei Yin <yfw.kernel@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daniel Borkmann 提交于
As there are now no remaining users of arch_fast_hash(), lets kill it entirely. This basically reverts commit 71ae8aac ("lib: introduce arch optimized hash library") and follow-up work, that is f.e., commit 23721754 ("lib: hash: follow-up fixups for arch hash"), commit e3fec2f7 ("lib: Add missing arch generic-y entries for asm-generic/hash.h") and last but not least commit 6a02652d ("perf tools: Fix include for non x86 architectures"). Cc: Francesco Fusco <fusco@ntop.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 12月, 2014 1 次提交
-
-
由 Paul Mackerras 提交于
When a secondary hardware thread has finished running a KVM guest, we currently put that thread into nap mode using a nap instruction in the KVM code. This changes the code so that instead of doing a nap instruction directly, we instead cause the call to power7_nap() that put the thread into nap mode to return. The reason for doing this is to avoid having the KVM code having to know what low-power mode to put the thread into. In the case of a secondary thread used to run a KVM guest, the thread will be offline from the point of view of the host kernel, and the relevant power7_nap() call is the one in pnv_smp_cpu_disable(). In this case we don't want to clear pending IPIs in the offline loop in that function, since that might cause us to miss the wakeup for the next time the thread needs to run a guest. To tell whether or not to clear the interrupt, we use the SRR1 value returned from power7_nap(), and check if it indicates an external interrupt. We arrange that the return from power7_nap() when we have finished running a guest returns 0, so pending interrupts don't get flushed in that case. Note that it is important a secondary thread that has finished executing in the guest, or that didn't have a guest to run, should not return to power7_nap's caller while the kvm_hstate.hwthread_req flag in the PACA is non-zero, because the return from power7_nap will reenable the MMU, and the MMU might still be in guest context. In this situation we spin at low priority in real mode waiting for hwthread_req to become zero. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 06 12月, 2014 1 次提交
-
-
由 Alexei Starovoitov 提交于
introduce new setsockopt() command: setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...) and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER setsockopt() calls bpf_prog_get() which increments refcnt of the program, so it doesn't get unloaded while socket is using the program. The same eBPF program can be attached to multiple sockets. User task exit automatically closes socket which calls sk_filter_uncharge() which decrements refcnt of eBPF program Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 12月, 2014 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
upatepp can get called for a nohpte fault when we find from the linux page table that the translation was hashed before. In that case we are sure that there is no existing translation, hence we could avoid doing tlbie. We could possibly race with a parallel fault filling the TLB. But that should be ok because updatepp is only ever relaxing permissions. We also look at linux pte permission bits when filling hash pte permission bits. We also hold the linux pte busy bits while inserting/updating a hashpte entry, hence a paralle update of linux pte is not possible. On the other hand mprotect involves ptep_modify_prot_start which cause a hpte invalidate and not updatepp. Performance number: We use randbox_access_bench written by Anton. Kernel with THP disabled and smaller hash page table size. 86.60% random_access_b [kernel.kallsyms] [k] .native_hpte_updatepp 2.10% random_access_b random_access_bench [.] doit 1.99% random_access_b [kernel.kallsyms] [k] .do_raw_spin_lock 1.85% random_access_b [kernel.kallsyms] [k] .native_hpte_insert 1.26% random_access_b [kernel.kallsyms] [k] .native_flush_hash_range 1.18% random_access_b [kernel.kallsyms] [k] .__delay 0.69% random_access_b [kernel.kallsyms] [k] .native_hpte_remove 0.37% random_access_b [kernel.kallsyms] [k] .clear_user_page 0.34% random_access_b [kernel.kallsyms] [k] .__hash_page_64K 0.32% random_access_b [kernel.kallsyms] [k] fast_exception_return 0.30% random_access_b [kernel.kallsyms] [k] .hash_page_mm With Fix: 27.54% random_access_b random_access_bench [.] doit 22.90% random_access_b [kernel.kallsyms] [k] .native_hpte_insert 5.76% random_access_b [kernel.kallsyms] [k] .native_hpte_remove 5.20% random_access_b [kernel.kallsyms] [k] fast_exception_return 5.12% random_access_b [kernel.kallsyms] [k] .__hash_page_64K 4.80% random_access_b [kernel.kallsyms] [k] .hash_page_mm 3.31% random_access_b [kernel.kallsyms] [k] data_access_common 1.84% random_access_b [kernel.kallsyms] [k] .trace_hardirqs_on_caller Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 02 12月, 2014 5 次提交
-
-
由 Aneesh Kumar K.V 提交于
If we know that user address space has never executed on other cpus we could use tlbiel. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Aneesh Kumar K.V 提交于
Rename invalidate_old_hpte to flush_hash_hugepage and use that in other places. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Mahesh Salgaonkar 提交于
Cleanup OpalMCE_* definitions/declarations and other related code which is not used anymore. Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: NBenjamin Herrrenschmidt <benh@kernel.crashing.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
On PowerNV platform, PHB diag-data is dumped after stopping device drivers. In case of recursive EEH errors, the kernel is usually crashed before dumping PHB diag-data for the second EEH error. It's hard to locate the root cause of the second EEH error without PHB diag-data. The patch adds one more EEH option "eeh=early_log", which helps dumping PHB diag-data immediately once frozen PE is detected, in order to get the PHB diag-data for the second EEH error. Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
The patch introduces additional flag EEH_PE_RESET to indicate the corresponding PE is under reset. In turn, the PE retrieval bakcend on PowerNV platform can return unfrozen state for the EEH core to moving forward. Flag EEH_PE_CFG_BLOCKED isn't the correct one for the purpose. In PCI passthrou case, the problem is more worse: Guest doesn't recover 6th EEH error. The PE is left in isolated (frozen) and config blocked state on Broadcom adapters. We can't retrieve the PE's state correctly any more, even from the host side via sysfs /sys/bus/pci/devices/xxx/eeh_pe_state. Reported-by: NRajeshkumar Subramanian <rajeshkumars@in.ibm.com> Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 24 11月, 2014 1 次提交
-
-
由 Benjamin Herrenschmidt 提交于
This is now fully replaced with the generic "no_64bit_msi" one that is set by the respective drivers directly. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 18 11月, 2014 1 次提交
-
-
由 Joerg Roedel 提交于
The IOMMU-API gained support for a new iommu_map_sg function. This causes compile failures on powerpc because the function name is already globally used there. This patch renames adds a ppc_ prefix to these functions to solve the compile problem. Signed-off-by: NJoerg Roedel <jroedel@suse.de>
-