- 05 3月, 2012 19 次提交
-
-
由 Paul Mackerras 提交于
With this, if a guest does an H_ENTER with a read/write HPTE on a page which is currently read-only, we make the actual HPTE inserted be a read-only version of the HPTE. We now intercept protection faults as well as HPTE not found faults, and for a protection fault we work out whether it should be reflected to the guest (e.g. because the guest HPTE didn't allow write access to usermode) or handled by switching to kernel context and calling kvmppc_book3s_hv_page_fault, which will then request write access to the page and update the actual HPTE. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This adds the infrastructure to enable us to page out pages underneath a Book3S HV guest, on processors that support virtualized partition memory, that is, POWER7. Instead of pinning all the guest's pages, we now look in the host userspace Linux page tables to find the mapping for a given guest page. Then, if the userspace Linux PTE gets invalidated, kvm_unmap_hva() gets called for that address, and we replace all the guest HPTEs that refer to that page with absent HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit set, which will cause an HDSI when the guest tries to access them. Finally, the page fault handler is extended to reinstantiate the guest HPTE when the guest tries to access a page which has been paged out. Since we can't intercept the guest DSI and ISI interrupts on PPC970, we still have to pin all the guest pages on PPC970. We have a new flag, kvm->arch.using_mmu_notifiers, that indicates whether we can page guest pages out. If it is not set, the MMU notifier callbacks do nothing and everything operates as before. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This provides the low-level support for MMIO emulation in Book3S HV guests. When the guest tries to map a page which is not covered by any memslot, that page is taken to be an MMIO emulation page. Instead of inserting a valid HPTE, we insert an HPTE that has the valid bit clear but another hypervisor software-use bit set, which we call HPTE_V_ABSENT, to indicate that this is an absent page. An absent page is treated much like a valid page as far as guest hcalls (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that an absent HPTE doesn't need to be invalidated with tlbie since it was never valid as far as the hardware is concerned. When the guest accesses a page for which there is an absent HPTE, it will take a hypervisor data storage interrupt (HDSI) since we now set the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults looks up the hash table and if it finds an absent HPTE mapping the requested virtual address, will switch to kernel mode and handle the fault in kvmppc_book3s_hv_page_fault(), which at present just calls kvmppc_hv_emulate_mmio() to set up the MMIO emulation. This is based on an earlier patch by Benjamin Herrenschmidt, but since heavily reworked. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This expands the reverse mapping array to contain two links for each HPTE which are used to link together HPTEs that correspond to the same guest logical page. Each circular list of HPTEs is pointed to by the rmap array entry for the guest logical page, pointed to by the relevant memslot. Links are 32-bit HPT entry indexes rather than full 64-bit pointers, to save space. We use 3 of the remaining 32 bits in the rmap array entries as a lock bit, a referenced bit and a present bit (the present bit is needed since HPTE index 0 is valid). The bit lock for the rmap chain nests inside the HPTE lock bit. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This provides for the case where userspace maps an I/O device into the address range of a memory slot using a VM_PFNMAP mapping. In that case, we work out the pfn from vma->vm_pgoff, and record the cache enable bits from vma->vm_page_prot in two low-order bits in the slot_phys array entries. Then, in kvmppc_h_enter() we check that the cache bits in the HPTE that the guest wants to insert match the cache bits in the slot_phys array entry. However, we do allow the guest to create what it thinks is a non-cacheable or write-through mapping to memory that is actually cacheable, so that we can use normal system memory as part of an emulated device later on. In that case the actual HPTE we insert is a cacheable HPTE. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This relaxes the requirement that the guest memory be provided as 16MB huge pages, allowing it to be provided as normal memory, i.e. in pages of PAGE_SIZE bytes (4k or 64k). To allow this, we index the kvm->arch.slot_phys[] arrays with a small page index, even if huge pages are being used, and use the low-order 5 bits of each entry to store the order of the enclosing page with respect to normal pages, i.e. log_2(enclosing_page_size / PAGE_SIZE). Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This removes the code from kvmppc_core_prepare_memory_region() that looked up the VMA for the region being added and called hva_to_page to get the pfns for the memory. We have no guarantee that there will be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION ioctl call; userspace can do that ioctl and then map memory into the region later. Instead we defer looking up the pfn for each memory page until it is needed, which generally means when the guest does an H_ENTER hcall on the page. Since we can't call get_user_pages in real mode, if we don't already have the pfn for the page, kvmppc_h_enter() will return H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back to kernel context. That calls kvmppc_get_guest_page() to get the pfn for the page, and then calls back to kvmppc_h_enter() to redo the HPTE insertion. When the first vcpu starts executing, we need to have the RMO or VRMA region mapped so that the guest's real mode accesses will work. Thus we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot starting at guest physical 0 now has RMO memory mapped there; if so it sets it up for the guest, otherwise on POWER7 it sets up the VRMA. The function that does that, kvmppc_map_vrma, is now a bit simpler, as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself. Since we are now potentially updating entries in the slot_phys[] arrays from multiple vcpu threads, we now have a spinlock protecting those updates to ensure that we don't lose track of any references to pages. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
At present, our implementation of H_ENTER only makes one try at locking each slot that it looks at, and doesn't even retry the ldarx/stdcx. atomic update sequence that it uses to attempt to lock the slot. Thus it can return the H_PTEG_FULL error unnecessarily, particularly when the H_EXACT flag is set, meaning that the caller wants a specific PTEG slot. This improves the situation by making a second pass when no free HPTE slot is found, where we spin until we succeed in locking each slot in turn and then check whether it is full while we hold the lock. If the second pass fails, then we return H_PTEG_FULL. This also moves lock_hpte to a header file (since later commits in this series will need to use it from other source files) and renames it to try_lock_hpte, which is a somewhat less misleading name. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This adds two new functions, kvmppc_pin_guest_page() and kvmppc_unpin_guest_page(), and uses them to pin the guest pages where the guest has registered areas of memory for the hypervisor to update, (i.e. the per-cpu virtual processor areas, SLB shadow buffers and dispatch trace logs) and then unpin them when they are no longer required. Although it is not strictly necessary to pin the pages at this point, since all guest pages are already pinned, later commits in this series will mean that guest pages aren't all pinned. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This allocates an array for each memory slot that is added to store the physical addresses of the pages in the slot. This array is vmalloc'd and accessed in kvmppc_h_enter using real_vmalloc_addr(). This allows us to remove the ram_pginfo field from the kvm_arch struct, and removes the 64GB guest RAM limit that we had. We use the low-order bits of the array entries to store a flag indicating that we have done get_page on the corresponding page, and therefore need to call put_page when we are finished with the page. Currently this is set for all pages except those in our special RMO regions. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Paul Mackerras 提交于
This adds an array that parallels the guest hashed page table (HPT), that is, it has one entry per HPTE, used to store the guest's view of the second doubleword of the corresponding HPTE. The first doubleword in the HPTE is the same as the guest's idea of it, so we don't need to store a copy, but the second doubleword in the HPTE has the real page number rather than the guest's logical page number. This allows us to remove the back_translate() and reverse_xlate() functions. This "reverse mapping" array is vmalloc'd, meaning that to access it in real mode we have to walk the kernel's page tables explicitly. That is done by the new real_vmalloc_addr() function. (In fact this returns an address in the linear mapping, so the result is usable both in real mode and in virtual mode.) There are also some minor cleanups here: moving the definitions of HPT_ORDER etc. to a header file and defining HPT_NPTE for HPT_NPTEG << 3. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Scott Wood 提交于
The hardware maintains a per-set next victim hint. Using this reduces conflicts, especially on e500v2 where a single guest TLB entry is mapped to two shadow TLB entries (user and kernel). We want those two entries to go to different TLB ways. sesel is now only used for TLB1. Reported-by: NLiu Yu <yu.liu@freescale.com> Signed-off-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Alexander Graf 提交于
When running the 64-bit Book3s PR code without CONFIG_PREEMPT_NONE, we were doing a few things wrong, most notably access to PACA fields without making sure that the pointers stay stable accross the access (preempt_disable()). This patch moves to_svcpu towards a get/put model which allows us to disable preemption while accessing the shadow vcpu fields in the PACA. That way we can run preemptible and everyone's happy! Reported-by: NJörg Sommer <joerg@alea.gnuu.de> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Scott Wood 提交于
Decrementers are now properly driven by TCR/TSR, and the guest has full read/write access to these registers. The decrementer keeps ticking (and setting the TSR bit) regardless of whether the interrupts are enabled with TCR. The decrementer stops at zero, rather than going negative. Decrementers (and FITs, once implemented) are delivered as level-triggered interrupts -- dequeued when the TSR bit is cleared, not on delivery. Signed-off-by: NLiu Yu <yu.liu@freescale.com> [scottwood@freescale.com: significant changes] Signed-off-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Scott Wood 提交于
This allows additional registers to be accessed by the guest in PR-mode KVM without trapping. SPRG4-7 are readable from userspace. On booke, KVM will sync these registers when it enters the guest, so that accesses from guest userspace will work. The guest kernel, OTOH, must consistently use either the real registers or the shared area between exits. This also applies to the already-paravirted SPRG3. On non-booke, it's not clear to what extent SPRG4-7 are supported (they're not architected for book3s, but exist on at least some classic chips). They are copied in the get/set regs ioctls, but I do not see any non-booke emulation. I also do not see any syncing with real registers (in PR-mode) including the user-readable SPRG3. This patch should not make that situation any worse. Signed-off-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Scott Wood 提交于
This function also updates paravirt int_pending, so rename it to be more obvious that this is a collection of checks run prior to (re)entering a guest. Signed-off-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Scott Wood 提交于
This implements a shared-memory API for giving host userspace access to the guest's TLB. Signed-off-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Scott Wood 提交于
Split out the portions of tlbe_priv that should be associated with host entries into tlbe_ref. Base victim selection on the number of hardware entries, not guest entries. For TLB1, where one guest entry can be mapped by multiple host entries, we use the host tlbe_ref for tracking page references. For the guest TLB0 entries, we still track it with gtlb_priv, to avoid having to retranslate if the entry is evicted from the host TLB but not the guest TLB. Signed-off-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Christian Borntraeger 提交于
On some cpus the overhead for virtualization instructions is in the same range as a system call. Having to call multiple ioctls to get set registers will make certain userspace handled exits more expensive than necessary. Lets provide a section in kvm_run that works as a shared save area for guest registers. We also provide two 64bit flags fields (architecture specific), that will specify 1. which parts of these fields are valid. 2. which registers were modified by userspace Each bit for these flag fields will define a group of registers (like general purpose) or a single register. Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 14 2月, 2012 2 次提交
-
-
EEH may happen during a PCI driver probe. If the driver is trying to access some register in a loop, the EEH code will try to print the driver name. But the driver pointer in struct pci_dev is not set until probe returns successfully. Use a function to test if the device and the driver pointer is NULL before accessing the driver's name. Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Srikar Dronamraju 提交于
With this change, helpers such as instruction_pointer() et al, get defined in the generic header in terms of GET_IP Removed the unnecessary definition of profile_pc in !CONFIG_SMP case as suggested by Mike Frysinger. Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: NMike Frysinger <vapier@gentoo.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 18 1月, 2012 1 次提交
-
-
由 Eric Paris 提交于
The audit system previously expected arches calling to audit_syscall_exit to supply as arguments if the syscall was a success and what the return code was. Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things by converting from negative retcodes to an audit internal magic value stating success or failure. This helper was wrong and could indicate that a valid pointer returned to userspace was a failed syscall. The fix is to fix the layering foolishness. We now pass audit_syscall_exit a struct pt_reg and it in turns calls back into arch code to collect the return value and to determine if the syscall was a success or failure. We also define a generic is_syscall_success() macro which determines success/failure based on if the value is < -MAX_ERRNO. This works for arches like x86 which do not use a separate mechanism to indicate syscall failure. We make both the is_syscall_success() and regs_return_value() static inlines instead of macros. The reason is because the audit function must take a void* for the regs. (uml calls theirs struct uml_pt_regs instead of just struct pt_regs so audit_syscall_exit can't take a struct pt_regs). Since the audit function takes a void* we need to use static inlines to cast it back to the arch correct structure to dereference it. The other major change is that on some arches, like ia64, MIPS and ppc, we change regs_return_value() to give us the negative value on syscall failure. THE only other user of this macro, kretprobe_example.c, won't notice and it makes the value signed consistently for the audit functions across all archs. In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old audit code as the return value. But the ptrace_64.h code defined the macro regs_return_value() as regs[3]. I have no idea which one is correct, but this patch now uses the regs_return_value() function, so it now uses regs[3]. For powerpc we previously used regs->result but now use the regs_return_value() function which uses regs->gprs[3]. regs->gprs[3] is always positive so the regs_return_value(), much like ia64 makes it negative before calling the audit code when appropriate. Signed-off-by: NEric Paris <eparis@redhat.com> Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion] Acked-by: Tony Luck <tony.luck@intel.com> [for ia64] Acked-by: Richard Weinberger <richard@nod.at> [for uml] Acked-by: David S. Miller <davem@davemloft.net> [for sparc] Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips] Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
-
- 07 1月, 2012 2 次提交
-
-
由 Bjorn Helgaas 提交于
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
-
由 Myron Stowe 提交于
This patch converts PowerPC's architecture-specific 'pcibios_set_master()' routine to a non-inlined function. This will allow follow on patches to create a generic 'pcibios_set_master()' function using the '__weak' attribute which can be used by all architectures as a default which, if necessary, can then be over- ridden by architecture-specific code. Converting 'pci_bios_set_master()' to a non-inlined function will allow PowerPC's 'pcibios_set_master()' implementation to remain architecture-specific after the generic version is introduced and thus, not change current behavior. No functional change. Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NMyron Stowe <myron.stowe@redhat.com> Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
-
- 05 1月, 2012 1 次提交
-
-
由 Prabhakar Kushwaha 提交于
Integrated Flash Controller supports various flashes like NOR, NAND and other devices using NOR, NAND and GPCM Machine available on it. IFC supports four chip selects. Signed-off-by: NDipen Dudhat <Dipen.Dudhat@freescale.com> Signed-off-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NLi Yang <leoli@freescale.com> Signed-off-by: NLiu Shuo <b35362@freescale.com> Signed-off-by: NPrabhakar Kushwaha <prabhakar@freescale.com> Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
-
- 04 1月, 2012 2 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 30 12月, 2011 1 次提交
-
-
由 Andreas Schwab 提交于
Commit 2a95ea6c ("procfs: do not overflow get_{idle,iowait}_time for nohz") did not take into account that one some architectures jiffies and cputime use different units. This causes get_idle_time() to return numbers in the wrong units, making the idle time fields in /proc/stat wrong. Instead of converting the usec value returned by get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function usecs_to_cputime64 to convert it to the correct unit of cputime64_t. Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Artem S. Tashkinov" <t.artem@mailcity.com> Cc: Dave Jones <davej@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 12月, 2011 1 次提交
-
-
由 Alexander Graf 提交于
kvm.h had sparse whitespace at the end of the line. Clean it up so syncing with QEMU gets easier. Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 26 12月, 2011 1 次提交
-
-
由 Andreas Schwab 提交于
compute_tlbie_rb is only used on ppc64 and cannot be compiled on ppc32. Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 22 12月, 2011 1 次提交
-
-
由 Kay Sievers 提交于
This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem and converts the devices to regular devices. The sysdev drivers are implemented as subsystem interfaces now. After all sysdev classes are ported to regular driver core entities, the sysdev implementation will be entirely removed from the kernel. Userspace relies on events and generic sysfs subsystem infrastructure from sysdev devices, which are made available with this conversion. Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@amd64.org> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Cc: Len Brown <lenb@kernel.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NKay Sievers <kay.sievers@vrfy.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
-
- 20 12月, 2011 2 次提交
-
-
由 Suzuki Poulose 提交于
We find the runtime address of _stext and relocate ourselves based on the following calculation. virtual_base = ALIGN(KERNELBASE,KERNEL_TLB_PIN_SIZE) + MODULO(_stext.run,KERNEL_TLB_PIN_SIZE) relocate() is called with the Effective Virtual Base Address (as shown below) | Phys. Addr| Virt. Addr | Page |------------------------| Boundary | | | | | | | | | Kernel Load |___________|_ __ _ _ _ _|<- Effective Addr(_stext)| | ^ |Virt. Base Addr | | | | | | | | | |reloc_offset| | | | | | | | | | |______v_____|<-(KERNELBASE)%TLB_SIZE | | | | | | | | | Page |-----------|------------| Boundary | | | On BookE, we need __va() & __pa() early in the boot process to access the device tree. Currently this has been defined as : #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) - PHYSICAL_START + KERNELBASE) where: PHYSICAL_START is kernstart_addr - a variable updated at runtime. KERNELBASE is the compile time Virtual base address of kernel. This won't work for us, as kernstart_addr is dynamic and will yield different results for __va()/__pa() for same mapping. e.g., Let the kernel be loaded at 64MB and KERNELBASE be 0xc0000000 (same as PAGE_OFFSET). In this case, we would be mapping 0 to 0xc0000000, and kernstart_addr = 64M Now __va(1MB) = (0x100000) - (0x4000000) + 0xc0000000 = 0xbc100000 , which is wrong. it should be : 0xc0000000 + 0x100000 = 0xc0100000 On platforms which support AMP, like PPC_47x (based on 44x), the kernel could be loaded at highmem. Hence we cannot always depend on the compile time constants for mapping. Here are the possible solutions: 1) Update kernstart_addr(PHSYICAL_START) to match the Physical address of compile time KERNELBASE value, instead of the actual Physical_Address(_stext). The disadvantage is that we may break other users of PHYSICAL_START. They could be replaced with __pa(_stext). 2) Redefine __va() & __pa() with relocation offset #ifdef CONFIG_RELOCATABLE_PPC32 #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) - PHYSICAL_START + (KERNELBASE + RELOC_OFFSET))) #define __pa(x) ((unsigned long)(x) + PHYSICAL_START - (KERNELBASE + RELOC_OFFSET)) #endif where, RELOC_OFFSET could be a) A variable, say relocation_offset (like kernstart_addr), updated at boot time. This impacts performance, as we have to load an additional variable from memory. OR b) #define RELOC_OFFSET ((PHYSICAL_START & PPC_PIN_SIZE_OFFSET_MASK) - \ (KERNELBASE & PPC_PIN_SIZE_OFFSET_MASK)) This introduces more calculations for doing the translation. 3) Redefine __va() & __pa() with a new variable i.e, #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) + VIRT_PHYS_OFFSET)) where VIRT_PHYS_OFFSET : #ifdef CONFIG_RELOCATABLE_PPC32 #define VIRT_PHYS_OFFSET virt_phys_offset #else #define VIRT_PHYS_OFFSET (KERNELBASE - PHYSICAL_START) #endif /* CONFIG_RELOCATABLE_PPC32 */ where virt_phy_offset is updated at runtime to : Effective KERNELBASE - kernstart_addr. Taking our example, above: virt_phys_offset = effective_kernelstart_vaddr - kernstart_addr = 0xc0400000 - 0x400000 = 0xc0000000 and __va(0x100000) = 0xc0000000 + 0x100000 = 0xc0100000 which is what we want. I have implemented (3) in the following patch which has same cost of operation as the existing one. I have tested the patches on 440x platforms only. However this should work fine for PPC_47x also, as we only depend on the runtime address and the current TLB XLAT entry for the startup code, which is available in r25. I don't have access to a 47x board yet. So, it would be great if somebody could test this on 47x. Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org> Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
-
由 Suzuki Poulose 提交于
The current implementation of CONFIG_RELOCATABLE in BookE is based on mapping the page aligned kernel load address to KERNELBASE. This approach however is not enough for platforms, where the TLB page size is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used currently in BookE to DYNAMIC_MEMSTART to reflect the actual method. The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the dynamic relocations will be introduced in the later in the patch series. This change would allow the use of the old method of RELOCATABLE for platforms which can afford to enforce the page alignment (platforms with smaller TLB size). Changes since v3: * Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer) Suggested-by: NScott Wood <scottwood@freescale.com> Tested-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com> Cc: Scott Wood <scottwood@freescale.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Josh Boyer <jwboyer@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: linux ppc dev <linuxppc-dev@lists.ozlabs.org> Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
-
- 19 12月, 2011 4 次提交
-
-
由 Anton Blanchard 提交于
We support 16TB of user address space and half a million contexts so update the comment to reflect this. Signed-off-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Andreas Schwab 提交于
Commit d57af9b2 (taskstats: use real microsecond granularity for CPU times) renamed msecs_to_cputime to usecs_to_cputime, but failed to update all numbers on the way. This causes nonsensical cpu idle/iowait values to be displayed in /proc/stat (the only user of usecs_to_cputime so far). This also renames __cputime_msec_factor to __cputime_usec_factor, adapting its value and using it directly in cputime_to_usecs instead of doing two multiplications. Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org> Acked-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Matt Evans 提交于
PPC64 uses long long for u64 in the kernel, but powerpc's asm/types.h prevents 64-bit userland from seeing this definition, instead defaulting to u64 == long in userspace. Some user programs (e.g. kvmtool) may actually want LL64, so this patch adds a check for __SANE_USERSPACE_TYPES__ so that, if defined, int-ll64.h is included instead. Signed-off-by: NMatt Evans <matt@ozlabs.org> Acked-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Anton Blanchard 提交于
Implement a POWER7 optimised copy_to_user/copy_from_user using VMX. For large aligned copies this new loop is over 10% faster, and for large unaligned copies it is over 200% faster. If we take a fault we fall back to the old version, this keeps things relatively simple and easy to verify. On POWER7 unaligned stores rarely slow down - they only flush when a store crosses a 4KB page boundary. Furthermore this flush is handled completely in hardware and should be 20-30 cycles. Unaligned loads on the other hand flush much more often - whenever crossing a 128 byte cache line, or a 32 byte sector if either sector is an L1 miss. Considering this information we really want to get the loads aligned and not worry about the alignment of the stores. Microbenchmarks confirm that this approach is much faster than the current unaligned copy loop that uses shifts and rotates to ensure both loads and stores are aligned. We also want to try and do the stores in cacheline aligned, cacheline sized chunks. If the store queue is unable to merge an entire cacheline of stores then the L2 cache will have to do a read/modify/write. Even worse, we will serialise this with the stores in the next iteration of the copy loop since both iterations hit the same cacheline. Based on this, the new loop does the following things: 1 - 127 bytes Get the source 8 byte aligned and use 8 byte loads and stores. Pretty boring and similar to how the current loop works. 128 - 4095 bytes Get the source 8 byte aligned and use 8 byte loads and stores, 1 cacheline at a time. We aren't doing the stores in cacheline aligned chunks so we will potentially serialise once per cacheline. Even so it is much better than the loop we have today. 4096 - bytes If both source and destination have the same alignment get them both 16 byte aligned, then get the destination cacheline aligned. Do cacheline sized loads and stores using VMX. If source and destination do not have the same alignment, we get the destination cacheline aligned, and use permute to do aligned loads. In both cases the VMX loop should be optimal - we always do aligned loads and stores and are always doing stores in cacheline aligned, cacheline sized chunks. To be able to use VMX we must be careful about interrupts and sleeping. We don't use the VMX loop when in an interrupt (which should be rare anyway) and we wrap the VMX loop in disable/enable_pagefault and fall back to the existing copy_tofrom_user loop if we do need to sleep. The VMX breakpoint of 4096 bytes was chosen using this microbenchmark: http://ozlabs.org/~anton/junkcode/copy_to_user.c Since we are using VMX and there is a cost to saving and restoring the user VMX state there are two broad cases we need to benchmark: - Best case - userspace never uses VMX - Worst case - userspace always uses VMX In reality a userspace process will sit somewhere between these two extremes. Since we need to test both aligned and unaligned copies we end up with 4 combinations. The point at which the VMX loop begins to win is: 0% VMX aligned 2048 bytes unaligned 2048 bytes 100% VMX aligned 16384 bytes unaligned 8192 bytes Considering this is a microbenchmark, the data is hot in cache and the VMX loop has better store queue merging properties we set the breakpoint to 4096 bytes, a little below the unaligned breakpoints. Some future optimisations we can look at: - Looking at the perf data, a significant part of the cost when a task is always using VMX is the extra exception we take to restore the VMX state. As such we should do something similar to the x86 optimisation that restores FPU state for heavy users. ie: /* * If the task has used fpu the last 5 timeslices, just do a full * restore of the math state immediately to avoid the trap; the * chances of needing FPU soon are obviously high now */ preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5; and /* * fpu_counter contains the number of consecutive context switches * that the FPU is used. If this is over a threshold, the lazy fpu * saving becomes unlazy to save the trap. This is an unsigned char * so that after 256 times the counter wraps and the behavior turns * lazy again; this to deal with bursty apps that only use FPU for * a short time */ - We could create a paca bit to mirror the VMX enabled MSR bit and check that first, avoiding multiple calls to calling enable_kernel_altivec. That should help with iovec based system calls like readv. - We could have two VMX breakpoints, one for when we know the user VMX state is loaded into the registers and one when it isn't. This could be a second bit in the paca so we can calculate the break points quickly. - One suggestion from Ben was to save and restore the VSX registers we use inline instead of using enable_kernel_altivec. [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC] Signed-off-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 16 12月, 2011 1 次提交
-
-
由 Richard Kuo 提交于
As of commit dd472da3, rwsem.h was moved into asm-generic. This patch removes the arch file and points the build at its new location. Signed-off-by: NRichard Kuo <rkuo@codeaurora.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 15 12月, 2011 1 次提交
-
-
由 Martin Schwidefsky 提交于
Make cputime_t and cputime64_t nocast to enable sparse checking to detect incorrect use of cputime. Drop the cputime macros for simple scalar operations. The conversion macros are still needed. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 09 12月, 2011 1 次提交
-
-
由 Tony Breeds 提交于
Based on original work by David 'Shaggy' Kleikamp. Signed-off-by: NTony Breeds <tony@bakeyournoodle.com> Signed-off-by: NJosh Boyer <jwboyer@gmail.com>
-