提交 · 4cf302bc106566c5bad523337296ea8b72df63f5 · openanolis / cloud-kernel

05 3月, 2012 19 次提交

KVM: PPC: Allow for read-only pages backing a Book3S HV guest · 4cf302bc

由 Paul Mackerras 提交于 12月 12, 2011

With this, if a guest does an H_ENTER with a read/write HPTE on a page
which is currently read-only, we make the actual HPTE inserted be a
read-only version of the HPTE. We now intercept protection faults as
well as HPTE not found faults, and for a protection fault we work out
whether it should be reflected to the guest (e.g. because the guest HPTE
didn't allow write access to usermode) or handled by switching to
kernel context and calling kvmppc_book3s_hv_page_fault, which will then
request write access to the page and update the actual HPTE.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

4cf302bc

KVM: PPC: Implement MMU notifiers for Book3S HV guests · 342d3db7

由 Paul Mackerras 提交于 12月 12, 2011

This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7.  Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page.  Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.

Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970.  We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out.  If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

342d3db7

KVM: PPC: Implement MMIO emulation support for Book3S HV guests · 697d3899

由 Paul Mackerras 提交于 12月 12, 2011

This provides the low-level support for MMIO emulation in Book3S HV
guests.  When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page.  Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page.  An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.

When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR.  Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.

This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

697d3899

KVM: PPC: Maintain a doubly-linked list of guest HPTEs for each gfn · 06ce2c63

由 Paul Mackerras 提交于 12月 12, 2011

This expands the reverse mapping array to contain two links for each
HPTE which are used to link together HPTEs that correspond to the
same guest logical page. Each circular list of HPTEs is pointed to
by the rmap array entry for the guest logical page, pointed to by
the relevant memslot. Links are 32-bit HPT entry indexes rather than
full 64-bit pointers, to save space. We use 3 of the remaining 32
bits in the rmap array entries as a lock bit, a referenced bit and
a present bit (the present bit is needed since HPTE index 0 is valid).
The bit lock for the rmap chain nests inside the HPTE lock bit.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

06ce2c63

KVM: PPC: Allow I/O mappings in memory slots · 9d0ef5ea

由 Paul Mackerras 提交于 12月 12, 2011

This provides for the case where userspace maps an I/O device into the
address range of a memory slot using a VM_PFNMAP mapping.  In that
case, we work out the pfn from vma->vm_pgoff, and record the cache
enable bits from vma->vm_page_prot in two low-order bits in the
slot_phys array entries.  Then, in kvmppc_h_enter() we check that the
cache bits in the HPTE that the guest wants to insert match the cache
bits in the slot_phys array entry.  However, we do allow the guest to
create what it thinks is a non-cacheable or write-through mapping to
memory that is actually cacheable, so that we can use normal system
memory as part of an emulated device later on.  In that case the actual
HPTE we insert is a cacheable HPTE.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

9d0ef5ea

KVM: PPC: Allow use of small pages to back Book3S HV guests · da9d1d7f

由 Paul Mackerras 提交于 12月 12, 2011

This relaxes the requirement that the guest memory be provided as
16MB huge pages, allowing it to be provided as normal memory, i.e.
in pages of PAGE_SIZE bytes (4k or 64k).  To allow this, we index
the kvm->arch.slot_phys[] arrays with a small page index, even if
huge pages are being used, and use the low-order 5 bits of each
entry to store the order of the enclosing page with respect to
normal pages, i.e. log_2(enclosing_page_size / PAGE_SIZE).
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

da9d1d7f

KVM: PPC: Only get pages when actually needed, not in prepare_memory_region() · c77162de

由 Paul Mackerras 提交于 12月 12, 2011

This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory.  We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.

Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page.  Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context.  That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.

When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work.  Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma().  It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.

Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

c77162de

KVM: PPC: Make the H_ENTER hcall more reliable · 075295dd

由 Paul Mackerras 提交于 12月 12, 2011

At present, our implementation of H_ENTER only makes one try at locking
each slot that it looks at, and doesn't even retry the ldarx/stdcx.
atomic update sequence that it uses to attempt to lock the slot.  Thus
it can return the H_PTEG_FULL error unnecessarily, particularly when
the H_EXACT flag is set, meaning that the caller wants a specific PTEG
slot.

This improves the situation by making a second pass when no free HPTE
slot is found, where we spin until we succeed in locking each slot in
turn and then check whether it is full while we hold the lock.  If the
second pass fails, then we return H_PTEG_FULL.

This also moves lock_hpte to a header file (since later commits in this
series will need to use it from other source files) and renames it to
try_lock_hpte, which is a somewhat less misleading name.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

075295dd

KVM: PPC: Add an interface for pinning guest pages in Book3s HV guests · 93e60249

由 Paul Mackerras 提交于 12月 12, 2011

This adds two new functions, kvmppc_pin_guest_page() and
kvmppc_unpin_guest_page(), and uses them to pin the guest pages where
the guest has registered areas of memory for the hypervisor to update,
(i.e. the per-cpu virtual processor areas, SLB shadow buffers and
dispatch trace logs) and then unpin them when they are no longer
required.

Although it is not strictly necessary to pin the pages at this point,
since all guest pages are already pinned, later commits in this series
will mean that guest pages aren't all pinned.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

93e60249

KVM: PPC: Keep page physical addresses in per-slot arrays · b2b2f165

由 Paul Mackerras 提交于 12月 12, 2011

This allocates an array for each memory slot that is added to store
the physical addresses of the pages in the slot.  This array is
vmalloc'd and accessed in kvmppc_h_enter using real_vmalloc_addr().
This allows us to remove the ram_pginfo field from the kvm_arch
struct, and removes the 64GB guest RAM limit that we had.

We use the low-order bits of the array entries to store a flag
indicating that we have done get_page on the corresponding page,
and therefore need to call put_page when we are finished with the
page.  Currently this is set for all pages except those in our
special RMO regions.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

b2b2f165

KVM: PPC: Keep a record of HV guest view of hashed page table entries · 8936dda4

由 Paul Mackerras 提交于 12月 12, 2011

This adds an array that parallels the guest hashed page table (HPT),
that is, it has one entry per HPTE, used to store the guest's view
of the second doubleword of the corresponding HPTE.  The first
doubleword in the HPTE is the same as the guest's idea of it, so we
don't need to store a copy, but the second doubleword in the HPTE has
the real page number rather than the guest's logical page number.
This allows us to remove the back_translate() and reverse_xlate()
functions.

This "reverse mapping" array is vmalloc'd, meaning that to access it
in real mode we have to walk the kernel's page tables explicitly.
That is done by the new real_vmalloc_addr() function.  (In fact this
returns an address in the linear mapping, so the result is usable
both in real mode and in virtual mode.)

There are also some minor cleanups here: moving the definitions of
HPT_ORDER etc. to a header file and defining HPT_NPTE for HPT_NPTEG << 3.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

8936dda4

KVM: PPC: e500: use hardware hint when loading TLB0 entries · 57013524

由 Scott Wood 提交于 11月 29, 2011

The hardware maintains a per-set next victim hint.  Using this
reduces conflicts, especially on e500v2 where a single guest
TLB entry is mapped to two shadow TLB entries (user and kernel).
We want those two entries to go to different TLB ways.

sesel is now only used for TLB1.
Reported-by: NLiu Yu <yu.liu@freescale.com>
Signed-off-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

57013524

KVM: PPC: Use get/set for to_svcpu to help preemption · 468a12c2

由 Alexander Graf 提交于 12月 09, 2011

When running the 64-bit Book3s PR code without CONFIG_PREEMPT_NONE, we were
doing a few things wrong, most notably access to PACA fields without making
sure that the pointers stay stable accross the access (preempt_disable()).

This patch moves to_svcpu towards a get/put model which allows us to disable
preemption while accessing the shadow vcpu fields in the PACA. That way we
can run preemptible and everyone's happy!
Reported-by: NJörg Sommer <joerg@alea.gnuu.de>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

468a12c2

KVM: PPC: booke: Improve timer register emulation · dfd4d47e

由 Scott Wood 提交于 11月 17, 2011

Decrementers are now properly driven by TCR/TSR, and the guest
has full read/write access to these registers.

The decrementer keeps ticking (and setting the TSR bit) regardless of
whether the interrupts are enabled with TCR.

The decrementer stops at zero, rather than going negative.

Decrementers (and FITs, once implemented) are delivered as
level-triggered interrupts -- dequeued when the TSR bit is cleared, not
on delivery.
Signed-off-by: NLiu Yu <yu.liu@freescale.com>
[scottwood@freescale.com: significant changes]
Signed-off-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

dfd4d47e

KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn · b5904972

由 Scott Wood 提交于 11月 08, 2011

This allows additional registers to be accessed by the guest
in PR-mode KVM without trapping.

SPRG4-7 are readable from userspace.  On booke, KVM will sync
these registers when it enters the guest, so that accesses from
guest userspace will work.  The guest kernel, OTOH, must consistently
use either the real registers or the shared area between exits.  This
also applies to the already-paravirted SPRG3.

On non-booke, it's not clear to what extent SPRG4-7 are supported
(they're not architected for book3s, but exist on at least some classic
chips).  They are copied in the get/set regs ioctls, but I do not see any
non-booke emulation.  I also do not see any syncing with real registers
(in PR-mode) including the user-readable SPRG3.  This patch should not
make that situation any worse.
Signed-off-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

b5904972

KVM: PPC: Rename deliver_interrupts to prepare_to_enter · 7e28e60e

由 Scott Wood 提交于 11月 08, 2011

This function also updates paravirt int_pending, so rename it
to be more obvious that this is a collection of checks run prior
to (re)entering a guest.
Signed-off-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

7e28e60e

KVM: PPC: e500: MMU API · dc83b8bc

由 Scott Wood 提交于 8月 18, 2011

This implements a shared-memory API for giving host userspace access to
the guest's TLB.
Signed-off-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

dc83b8bc

KVM: PPC: e500: clear up confusion between host and guest entries · 0164c0f0

由 Scott Wood 提交于 8月 18, 2011

Split out the portions of tlbe_priv that should be associated with host
entries into tlbe_ref.  Base victim selection on the number of hardware
entries, not guest entries.

For TLB1, where one guest entry can be mapped by multiple host entries,
we use the host tlbe_ref for tracking page references.  For the guest
TLB0 entries, we still track it with gtlb_priv, to avoid having to
retranslate if the entry is evicted from the host TLB but not the
guest TLB.
Signed-off-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

0164c0f0

KVM: provide synchronous registers in kvm_run · b9e5dc8d

由 Christian Borntraeger 提交于 1月 11, 2012

On some cpus the overhead for virtualization instructions is in the same
range as a system call. Having to call multiple ioctls to get set registers
will make certain userspace handled exits more expensive than necessary.
Lets provide a section in kvm_run that works as a shared save area
for guest registers.
We also provide two 64bit flags fields (architecture specific), that will
specify
1. which parts of these fields are valid.
2. which registers were modified by userspace

Each bit for these flag fields will define a group of registers (like
general purpose) or a single register.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

b9e5dc8d

14 2月, 2012 2 次提交

powerpc/pseries/eeh: Fix crash when error happens during device probe · 778a785f

由 Thadeu Lima de Souza Cascardo 提交于 1月 11, 2012

EEH may happen during a PCI driver probe. If the driver is trying to
access some register in a loop, the EEH code will try to print the
driver name. But the driver pointer in struct pci_dev is not set until
probe returns successfully.

Use a function to test if the device and the driver pointer is NULL
before accessing the driver's name.
Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

778a785f

powerpc: Implement GET_IP/SET_IP · e6289427

由 Srikar Dronamraju 提交于 2月 08, 2012

With this change, helpers such as instruction_pointer() et al, get defined
in the generic header in terms of GET_IP

Removed the unnecessary definition of profile_pc in !CONFIG_SMP case as
suggested by Mike Frysinger.
Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: NMike Frysinger <vapier@gentoo.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

e6289427

18 1月, 2012 1 次提交

Audit: push audit success and retcode into arch ptrace.h · d7e7528b

由 Eric Paris 提交于 1月 03, 2012

The audit system previously expected arches calling to audit_syscall_exit to
supply as arguments if the syscall was a success and what the return code was.
Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
by converting from negative retcodes to an audit internal magic value stating
success or failure. This helper was wrong and could indicate that a valid
pointer returned to userspace was a failed syscall. The fix is to fix the
layering foolishness. We now pass audit_syscall_exit a struct pt_reg and it
in turns calls back into arch code to collect the return value and to
determine if the syscall was a success or failure. We also define a generic
is_syscall_success() macro which determines success/failure based on if the
value is < -MAX_ERRNO. This works for arches like x86 which do not use a
separate mechanism to indicate syscall failure.

We make both the is_syscall_success() and regs_return_value() static inlines
instead of macros. The reason is because the audit function must take a void*
for the regs. (uml calls theirs struct uml_pt_regs instead of just struct
pt_regs so audit_syscall_exit can't take a struct pt_regs). Since the audit
function takes a void* we need to use static inlines to cast it back to the
arch correct structure to dereference it.

The other major change is that on some arches, like ia64, MIPS and ppc, we
change regs_return_value() to give us the negative value on syscall failure.
THE only other user of this macro, kretprobe_example.c, won't notice and it
makes the value signed consistently for the audit functions across all archs.

In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
audit code as the return value. But the ptrace_64.h code defined the macro
regs_return_value() as regs[3]. I have no idea which one is correct, but this
patch now uses the regs_return_value() function, so it now uses regs[3].

For powerpc we previously used regs->result but now use the
regs_return_value() function which uses regs->gprs[3]. regs->gprs[3] is
always positive so the regs_return_value(), much like ia64 makes it negative
before calling the audit code when appropriate.
Signed-off-by: NEric Paris <eparis@redhat.com>
Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion]
Acked-by: Tony Luck <tony.luck@intel.com> [for ia64]
Acked-by: Richard Weinberger <richard@nod.at> [for uml]
Acked-by: David S. Miller <davem@davemloft.net> [for sparc]
Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips]
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]

d7e7528b

07 1月, 2012 2 次提交

powerpc/PCI: make pcibios_setup_phb_resources() static · a46770f5

由 Bjorn Helgaas 提交于 10月 28, 2011

CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

a46770f5

PCI: PowerPC: convert pcibios_set_master() to a non-inlined function · 79c8be83

由 Myron Stowe 提交于 10月 28, 2011

This patch converts PowerPC's architecture-specific
'pcibios_set_master()' routine to a non-inlined function.  This will
allow follow on patches to create a generic 'pcibios_set_master()'
function using the '__weak' attribute which can be used by all
architectures as a default which, if necessary, can then be over-
ridden by architecture-specific code.

Converting 'pci_bios_set_master()' to a non-inlined function will
allow PowerPC's 'pcibios_set_master()' implementation to remain
architecture-specific after the generic version is introduced and
thus, not change current behavior.

No functional change.
Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMyron Stowe <myron.stowe@redhat.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

79c8be83

05 1月, 2012 1 次提交

powerpc/fsl: Add support for Integrated Flash Controller · a20cbdef

由 Prabhakar Kushwaha 提交于 12月 27, 2011

Integrated Flash Controller supports various flashes like NOR, NAND
and other devices using NOR, NAND and GPCM Machine available on it.
IFC supports four chip selects.
Signed-off-by: NDipen Dudhat <Dipen.Dudhat@freescale.com>
Signed-off-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NLi Yang <leoli@freescale.com>
Signed-off-by: NLiu Shuo <b35362@freescale.com>
Signed-off-by: NPrabhakar Kushwaha <prabhakar@freescale.com>
Signed-off-by: NKumar Gala <galak@kernel.crashing.org>

a20cbdef

04 1月, 2012 2 次提交
- A
  consolidate umode_t declarations · 0583fcc9
  由 Al Viro 提交于 7月 26, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  0583fcc9
- A
  switch spufs guts to umode_t · c6684b26
  由 Al Viro 提交于 7月 26, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  c6684b26
30 12月, 2011 1 次提交

procfs: do not confuse jiffies with cputime64_t · 34845636

由 Andreas Schwab 提交于 12月 28, 2011

Commit 2a95ea6c ("procfs: do not overflow get_{idle,iowait}_time
for nohz") did not take into account that one some architectures jiffies
and cputime use different units.

This causes get_idle_time() to return numbers in the wrong units, making
the idle time fields in /proc/stat wrong.

Instead of converting the usec value returned by
get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
usecs_to_cputime64 to convert it to the correct unit of cputime64_t.
Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

34845636

27 12月, 2011 1 次提交

KVM: PPC: Whitespace fix for kvm.h · da69dee0

由 Alexander Graf 提交于 9月 19, 2011

kvm.h had sparse whitespace at the end of the line. Clean it
up so syncing with QEMU gets easier.
Signed-off-by: NAlexander Graf <agraf@suse.de>

da69dee0

26 12月, 2011 1 次提交

KVM: PPC: move compute_tlbie_rb to book3s_64 common header · 36cc66d6

由 Andreas Schwab 提交于 11月 08, 2011

compute_tlbie_rb is only used on ppc64 and cannot be compiled on ppc32.
Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

36cc66d6

22 12月, 2011 1 次提交

cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem · 8a25a2fd

由 Kay Sievers 提交于 12月 21, 2011

This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.

After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.

Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.

Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

8a25a2fd

20 12月, 2011 2 次提交

powerpc: Define virtual-physical translations for RELOCATABLE · 368ff8f1

由 Suzuki Poulose 提交于 12月 14, 2011

We find the runtime address of _stext and relocate ourselves based
on the following calculation.

	virtual_base = ALIGN(KERNELBASE,KERNEL_TLB_PIN_SIZE) +
			MODULO(_stext.run,KERNEL_TLB_PIN_SIZE)

relocate() is called with the Effective Virtual Base Address (as
shown below)

            | Phys. Addr| Virt. Addr |
Page        |------------------------|
Boundary    |           |            |
            |           |            |
            |           |            |
Kernel Load |___________|_ __ _ _ _ _|<- Effective
Addr(_stext)|           |      ^     |Virt. Base Addr
            |           |      |     |
            |           |      |     |
            |           |reloc_offset|
            |           |      |     |
            |           |      |     |
            |           |______v_____|<-(KERNELBASE)%TLB_SIZE
            |           |            |
            |           |            |
            |           |            |
Page        |-----------|------------|
Boundary    |           |            |

On BookE, we need __va() & __pa() early in the boot process to access
the device tree.

Currently this has been defined as :

#define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) -
						PHYSICAL_START + KERNELBASE)
where:
 PHYSICAL_START is kernstart_addr - a variable updated at runtime.
 KERNELBASE	is the compile time Virtual base address of kernel.

This won't work for us, as kernstart_addr is dynamic and will yield different
results for __va()/__pa() for same mapping.

e.g.,

Let the kernel be loaded at 64MB and KERNELBASE be 0xc0000000 (same as
PAGE_OFFSET).

In this case, we would be mapping 0 to 0xc0000000, and kernstart_addr = 64M

Now __va(1MB) = (0x100000) - (0x4000000) + 0xc0000000
		= 0xbc100000 , which is wrong.

it should be : 0xc0000000 + 0x100000 = 0xc0100000

On platforms which support AMP, like PPC_47x (based on 44x), the kernel
could be loaded at highmem. Hence we cannot always depend on the compile
time constants for mapping.

Here are the possible solutions:

1) Update kernstart_addr(PHSYICAL_START) to match the Physical address of
compile time KERNELBASE value, instead of the actual Physical_Address(_stext).

The disadvantage is that we may break other users of PHYSICAL_START. They
could be replaced with __pa(_stext).

2) Redefine __va() & __pa() with relocation offset

#ifdef	CONFIG_RELOCATABLE_PPC32
#define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) - PHYSICAL_START + (KERNELBASE + RELOC_OFFSET)))
#define __pa(x) ((unsigned long)(x) + PHYSICAL_START - (KERNELBASE + RELOC_OFFSET))
#endif

where, RELOC_OFFSET could be

  a) A variable, say relocation_offset (like kernstart_addr), updated
     at boot time. This impacts performance, as we have to load an additional
     variable from memory.

		OR

  b) #define RELOC_OFFSET ((PHYSICAL_START & PPC_PIN_SIZE_OFFSET_MASK) - \
                      (KERNELBASE & PPC_PIN_SIZE_OFFSET_MASK))

   This introduces more calculations for doing the translation.

3) Redefine __va() & __pa() with a new variable

i.e,

#define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) + VIRT_PHYS_OFFSET))

where VIRT_PHYS_OFFSET :

#ifdef CONFIG_RELOCATABLE_PPC32
#define VIRT_PHYS_OFFSET virt_phys_offset
#else
#define VIRT_PHYS_OFFSET (KERNELBASE - PHYSICAL_START)
#endif /* CONFIG_RELOCATABLE_PPC32 */

where virt_phy_offset is updated at runtime to :

	Effective KERNELBASE - kernstart_addr.

Taking our example, above:

virt_phys_offset = effective_kernelstart_vaddr - kernstart_addr
		 = 0xc0400000 - 0x400000
		 = 0xc0000000
	and

	__va(0x100000) = 0xc0000000 + 0x100000 = 0xc0100000
	 which is what we want.

I have implemented (3) in the following patch which has same cost of
operation as the existing one.

I have tested the patches on 440x platforms only. However this should
work fine for PPC_47x also, as we only depend on the runtime address
and the current TLB XLAT entry for the startup code, which is available
in r25. I don't have access to a 47x board yet. So, it would be great if
somebody could test this on 47x.
Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Signed-off-by: NJosh Boyer <jwboyer@gmail.com>

368ff8f1

powerpc: Rename mapping based RELOCATABLE to DYNAMIC_MEMSTART for BookE · 0f890c8d

由 Suzuki Poulose 提交于 12月 14, 2011

The current implementation of CONFIG_RELOCATABLE in BookE is based
on mapping the page aligned kernel load address to KERNELBASE. This
approach however is not enough for platforms, where the TLB page size
is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used
currently in BookE to DYNAMIC_MEMSTART to reflect the actual method.

The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the
dynamic relocations will be introduced in the later in the patch series.

This change would allow the use of the old method of RELOCATABLE for
platforms which can afford to enforce the page alignment (platforms with
smaller TLB size).

Changes since v3:

* Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is
  either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer)
Suggested-by: NScott Wood <scottwood@freescale.com>
Tested-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NSuzuki K. Poulose <suzuki@in.ibm.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linux ppc dev <linuxppc-dev@lists.ozlabs.org>
Signed-off-by: NJosh Boyer <jwboyer@gmail.com>

0f890c8d

19 12月, 2011 4 次提交

powerpc: Fix comment explaining our VSID layout · b206590c

由 Anton Blanchard 提交于 12月 12, 2011

We support 16TB of user address space and half a million contexts
so update the comment to reflect this.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

b206590c

powerpc: Fix wrong divisor in usecs_to_cputime · 9f5072d4

由 Andreas Schwab 提交于 12月 09, 2011

Commit d57af9b2 (taskstats: use real microsecond granularity for CPU times)
renamed msecs_to_cputime to usecs_to_cputime, but failed to update all
numbers on the way. This causes nonsensical cpu idle/iowait values to be
displayed in /proc/stat (the only user of usecs_to_cputime so far).

This also renames __cputime_msec_factor to __cputime_usec_factor, adapting
its value and using it directly in cputime_to_usecs instead of doing two
multiplications.
Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
Acked-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

9f5072d4

powerpc: Add __SANE_USERSPACE_TYPES__ to asm/types.h for LL64 · 2c9c6ce0

由 Matt Evans 提交于 12月 07, 2011

PPC64 uses long long for u64 in the kernel, but powerpc's asm/types.h
prevents 64-bit userland from seeing this definition, instead defaulting
to u64 == long in userspace.  Some user programs (e.g. kvmtool) may actually
want LL64, so this patch adds a check for __SANE_USERSPACE_TYPES__ so that,
if defined, int-ll64.h is included instead.
Signed-off-by: NMatt Evans <matt@ozlabs.org>
Acked-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

2c9c6ce0

powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX · a66086b8

由 Anton Blanchard 提交于 12月 07, 2011

Implement a POWER7 optimised copy_to_user/copy_from_user using VMX.
For large aligned copies this new loop is over 10% faster, and for
large unaligned copies it is over 200% faster.

If we take a fault we fall back to the old version, this keeps
things relatively simple and easy to verify.

On POWER7 unaligned stores rarely slow down - they only flush when
a store crosses a 4KB page boundary. Furthermore this flush is
handled completely in hardware and should be 20-30 cycles.

Unaligned loads on the other hand flush much more often - whenever
crossing a 128 byte cache line, or a 32 byte sector if either sector
is an L1 miss.

Considering this information we really want to get the loads aligned
and not worry about the alignment of the stores. Microbenchmarks
confirm that this approach is much faster than the current unaligned
copy loop that uses shifts and rotates to ensure both loads and
stores are aligned.

We also want to try and do the stores in cacheline aligned, cacheline
sized chunks. If the store queue is unable to merge an entire
cacheline of stores then the L2 cache will have to do a
read/modify/write. Even worse, we will serialise this with the stores
in the next iteration of the copy loop since both iterations hit
the same cacheline.

Based on this, the new loop does the following things:

1 - 127 bytes
Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
boring and similar to how the current loop works.

128 - 4095 bytes
Get the source 8 byte aligned and use 8 byte loads and stores,
1 cacheline at a time. We aren't doing the stores in cacheline
aligned chunks so we will potentially serialise once per cacheline.
Even so it is much better than the loop we have today.

4096 - bytes
If both source and destination have the same alignment get them both
16 byte aligned, then get the destination cacheline aligned. Do
cacheline sized loads and stores using VMX.

If source and destination do not have the same alignment, we get the
destination cacheline aligned, and use permute to do aligned loads.

In both cases the VMX loop should be optimal - we always do aligned
loads and stores and are always doing stores in cacheline aligned,
cacheline sized chunks.

To be able to use VMX we must be careful about interrupts and
sleeping. We don't use the VMX loop when in an interrupt (which should
be rare anyway) and we wrap the VMX loop in disable/enable_pagefault
and fall back to the existing copy_tofrom_user loop if we do need to
sleep.

The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:

http://ozlabs.org/~anton/junkcode/copy_to_user.c

Since we are using VMX and there is a cost to saving and restoring
the user VMX state there are two broad cases we need to benchmark:

- Best case - userspace never uses VMX

- Worst case - userspace always uses VMX

In reality a userspace process will sit somewhere between these two
extremes. Since we need to test both aligned and unaligned copies we
end up with 4 combinations. The point at which the VMX loop begins to
win is:

0% VMX
aligned		2048 bytes
unaligned	2048 bytes

100% VMX
aligned		16384 bytes
unaligned	8192 bytes

Considering this is a microbenchmark, the data is hot in cache and
the VMX loop has better store queue merging properties we set the
breakpoint to 4096 bytes, a little below the unaligned breakpoints.

Some future optimisations we can look at:

- Looking at the perf data, a significant part of the cost when a
  task is always using VMX is the extra exception we take to restore
  the VMX state. As such we should do something similar to the x86
  optimisation that restores FPU state for heavy users. ie:

        /*
         * If the task has used fpu the last 5 timeslices, just do a full
         * restore of the math state immediately to avoid the trap; the
         * chances of needing FPU soon are obviously high now
         */
        preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;

  and

        /*
         * fpu_counter contains the number of consecutive context switches
         * that the FPU is used. If this is over a threshold, the lazy fpu
         * saving becomes unlazy to save the trap. This is an unsigned char
         * so that after 256 times the counter wraps and the behavior turns
         * lazy again; this to deal with bursty apps that only use FPU for
         * a short time
         */

- We could create a paca bit to mirror the VMX enabled MSR bit and check
  that first, avoiding multiple calls to calling enable_kernel_altivec.
  That should help with iovec based system calls like readv.

- We could have two VMX breakpoints, one for when we know the user VMX
  state is loaded into the registers and one when it isn't. This could
  be a second bit in the paca so we can calculate the break points quickly.

- One suggestion from Ben was to save and restore the VSX registers
  we use inline instead of using enable_kernel_altivec.

[BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC]
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

a66086b8

16 12月, 2011 1 次提交

powerpc: Use rwsem.h from generic location · 0766387b

由 Richard Kuo 提交于 11月 29, 2011

As of commit dd472da3, rwsem.h was moved into asm-generic.
This patch removes the arch file and points the build at
its new location.
Signed-off-by: NRichard Kuo <rkuo@codeaurora.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

0766387b

15 12月, 2011 1 次提交

[S390] cputime: add sparse checking and cleanup · 64861634

由 Martin Schwidefsky 提交于 12月 15, 2011

Make cputime_t and cputime64_t nocast to enable sparse checking to
detect incorrect use of cputime. Drop the cputime macros for simple
scalar operations. The conversion macros are still needed.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

64861634

09 12月, 2011 1 次提交

powerpc/476fpe: Add 476fpe SoC code · df777bd3

由 Tony Breeds 提交于 11月 30, 2011

Based on original work by David 'Shaggy' Kleikamp.
Signed-off-by: NTony Breeds <tony@bakeyournoodle.com>
Signed-off-by: NJosh Boyer <jwboyer@gmail.com>

df777bd3

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功