提交 · fb5a515704d7e84c139140a83c5eff515adfc000 · gsplhtlxg / clone-Linux

11 6月, 2014 4 次提交

powerpc: Remove platforms/wsp and associated pieces · fb5a5157

由 Michael Ellerman 提交于 6月 02, 2014

__attribute__ ((unused))

WSP is the last user of CONFIG_PPC_A2, so we remove that as well.

Although CONFIG_PPC_ICSWX still exists, it's no longer selectable for
any Book3E platform, so we can remove the code in mmu-book3e.h that
depended on it.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

fb5a5157

powerpc: Remove check for CONFIG_SERIAL_TEXT_DEBUG · 94314290

由 Paul Bolle 提交于 5月 20, 2014

The Kconfig symbol SERIAL_TEXT_DEBUG was removed from
arch/powerpc/Kconfig.debug in v2.6.22. (In v2.6.27 it was also removed
from arch/ppc/Kconfig.debug.) So the check for its macro has evaluated
to false for over five years now. Remove that check and the few lines
of code hidden behind it.
Signed-off-by: NPaul Bolle <pebolle@tiscali.nl>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

94314290

powerpc: Add AT_HWCAP2 to indicate V.CRYPTO category support · dd58a092

由 Benjamin Herrenschmidt 提交于 6月 10, 2014

The Vector Crypto category instructions are supported by current POWER8
chips, advertise them to userspace using a specific bit to properly
differentiate with chips of the same architecture level that might not
have them.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.10+]

dd58a092

booke/watchdog: refine and clean up the codes · d2deebab

由 Tang Yuantian 提交于 5月 08, 2014

Basically, this patch does the following:
1. Move the codes of parsing boot parameters from setup-common.c
   to driver. In this way, code reader can know directly that
   there are boot parameters that can change the timeout.
2. Make boot parameter 'booke_wdt_period' effective.
   currently, when driver is loaded, default timeout is always
   being used in stead of booke_wdt_period.
3. Wrap up the watchdog timeout in device struct and clean up
   unnecessary codes.
Signed-off-by: NTang Yuantian <yuantian.tang@freescale.com>
Acked-by: NScott Wood <scottwood@freescale.com>
Reviewed-by: NLi Yang <leoli@freescale.com>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NWim Van Sebroeck <wim@iguana.be>

d2deebab

07 6月, 2014 2 次提交

powerpc: update comments for generic idle conversion · 0d2b7ea9

由 Geert Uytterhoeven 提交于 6月 06, 2014

As of commit 799fef06 ("powerpc: Use generic idle loop"), this
applies to arch_cpu_idle() instead of cpu_idle().
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0d2b7ea9

powerpc/powernv: Add missing include to LPC code · 0c0a3e5a

由 Benjamin Herrenschmidt 提交于 6月 07, 2014

kbuild bot spotted that one:

  arch/powerpc/platforms/powernv/opal-lpc.c: In function 'opal_lpc_init_debugfs':
>> arch/powerpc/platforms/powernv/opal-lpc.c:319:35: error: 'powerpc_debugfs_root' undeclared (first use in this function)
     root = debugfs_create_dir("lpc", powerpc_debugfs_root);
                                      ^
We neet to include the definition explicitely.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

0c0a3e5a

06 6月, 2014 1 次提交

powerpc/mm: Check paca psize is up to date for huge mappings · 09567e7f

由 Michael Ellerman 提交于 5月 28, 2014

We have a bug in our hugepage handling which exhibits as an infinite
loop of hash faults. If the fault is being taken in the kernel it will
typically trigger the softlockup detector, or the RCU stall detector.

The bug is as follows:

 1. mmap(0xa0000000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
 2. Slice code converts the slice psize to 16M.
 3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
    synchronises the mm->context with the paca->context. So the paca slice
    mask is updated to include the 16M slice.
 3. Either:
    * mmap() fails because there are no huge pages available.
    * mmap() succeeds and the mapping is then munmapped.
    In both cases the slice psize remains at 16M in both the paca & mm.
 4. mmap(0xa0000000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
 5. The slice psize is converted back to 64K. Because of the check on line 539
    of slice.c we DO NOT update the paca->context. The paca slice mask is now
    out of sync with the mm slice mask.
 6. User/kernel accesses 0xa0000000.
 7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
    to create an SLB entry and inserts it in the SLB.
18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
    is found so a data access exception is generated.
19. The data access handler calls do_page_fault() -> handle_mm_fault().
10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page().
11. The hardware retries the access, there is still nothing in the hash table
    so once again a data access exception is generated.
12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
    hash. Although the THP mapping maps 16M the hashing is done using 64K
    as the segment page size.
13. hash_page() returns immediately after calling __hash_page_thp(), skipping
    over the code at line 1125. Resulting in the mismatch between the
    paca->context and mm->context not being detected.
14. The hardware retries the access, the hash it generates using the 16M
    SLB entry does NOT match the hash we inserted.
15. We take another data access and go into __hash_page_thp().
16. We see a valid entry in the hpte_slot_array and so we call updatepp()
    which succeeds.
17. Goto 14.

We could fix this in two ways. The first would be to remove or modify
the check on line 539 of slice.c.

The second option is to cause the check of paca psize in hash_page() on
line 1125 to also be done for THP pages.

We prefer the latter, because the check & update of the paca psize is
not done until we know it's necessary. It's also done only on the
current cpu, so we don't need to IPI all other cpus.

Without further rearranging the code, the simplest fix is to pull out
the code that checks paca psize and call it in two places. Firstly for
THP/hugetlb, and secondly for other mappings as before.

Thanks to Dave Jones for trinity, which originally found this bug.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
CC: stable@vger.kernel.org [v3.11+]

09567e7f

05 6月, 2014 16 次提交

powerpc/powernv: Pass buffer size to OPAL validate flash call · 8b8f7bf4

由 Vasant Hegde 提交于 6月 05, 2014

We pass actual buffer size to opal_validate_flash() OPAL API call
and in return it contains output buffer size.

Commit cc146d1d (Fix little endian issues) missed to set the size
param before making OPAL call. So firmware image validation fails.

This patch sets size variable before making OPAL call.
Signed-off-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com>
Tested-by: NThomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

8b8f7bf4

powerpc/pseries: hcall functions are exported to modules, need _GLOBAL_TOC() · c1931e21

由 Anton Blanchard 提交于 5月 13, 2014

The hcall macros may call out to c code for tracing, so we need
to set up a valid r2. This fixes an oops found when testing
ibmvscsi as a module.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

c1931e21

powerpc: Exported functions __clear_user and copy_page use r2 so need _GLOBAL_TOC() · 2ac7b016

由 Anton Blanchard 提交于 6月 05, 2014

__clear_user and copy_page load from the TOC and are also exported
to modules. This means we have to use _GLOBAL_TOC() so that we
create the global entry point that sets up the TOC.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

2ac7b016

powerpc/powernv: Set memory_block_size_bytes to 256MB · 6d97d7a2

由 Anton Blanchard 提交于 6月 04, 2014

powerpc sets a low SECTION_SIZE_BITS to accomodate small pseries
boxes. We default to 16MB memory blocks, and boxes with a lot
of memory end up with enormous numbers of sysfs memory nodes.

Set a more reasonable default for powernv of 256MB.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

6d97d7a2

powerpc: Allow ppc_md platform hook to override memory_block_size_bytes · a5d86257

由 Anton Blanchard 提交于 6月 04, 2014

The pseries platform code unconditionally overrides
memory_block_size_bytes regardless of the running platform.

Create a ppc_md hook that so each platform can choose to
do what it wants.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

a5d86257

powerpc/powernv: Fix endian issues in memory error handling code · 223ca9d8

由 Anton Blanchard 提交于 6月 04, 2014

struct OpalMemoryErrorData is passed to us from firmware, so we
have to byteswap it.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

223ca9d8

powerpc/eeh: Skip eeh sysfs when eeh is disabled · 2213fb14

由 Wei Yang 提交于 6月 04, 2014

When eeh is not enabled, and hotplug two pci devices on the same bus, eeh
related sysfs would be added twice for the first added pci device. Since the
eeh_dev is not created when eeh is not enabled.

This patch adds the check, if eeh is not enabled, eeh sysfs will not be
created.

After applying this patch, following warnings are reduced:

sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:00.0/eeh_mode'
sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:00.0/eeh_config_addr'
sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:00.0/eeh_pe_config_addr'
Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

2213fb14

powerpc: 64bit sendfile is capped at 2GB · 5d73320a

由 Anton Blanchard 提交于 6月 04, 2014

commit 8f9c0119 (compat: fs: Generic compat_sys_sendfile
implementation) changed the PowerPC 64bit sendfile call from
sys_sendile64 to sys_sendfile.

Unfortunately this broke sendfile of lengths greater than 2G because
sys_sendfile caps at MAX_NON_LFS. Restore what we had previously which
fixes the bug.

Cc: stable@vger.kernel.org
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

5d73320a

powerpc/powernv: Provide debugfs access to the LPC bus via OPAL · fa2dbe2e

由 Benjamin Herrenschmidt 提交于 6月 03, 2014

This provides debugfs files to access the LPC bus on Power8
non-virtualized using the appropriate OPAL firmware calls.

The usage is simple: one file per space (IO, MEM and FW),
lseek to the address and read/write the data. IO and MEM always
generate series of byte accesses. FW can generate word and dword
accesses if aligned properly.

Based on an original patch from Rob Lippert and reworked.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

fa2dbe2e

powerpc/serial: Use saner flags when creating legacy ports · c4cad90f

由 Benjamin Herrenschmidt 提交于 6月 03, 2014

We had a mix & match of flags used when creating legacy ports
depending on where we found them in the device-tree. Among others
we were missing UPF_SKIP_TEST for some kind of ISA ports which is
a problem as quite a few UARTs out there don't support the loopback
test (such as a lot of BMCs).

Let's pick the set of flags used by the SoC code and generalize it
which means autoconf, no loopback test, irq maybe shared and fixed
port.

Sending to stable as the lack of UPF_SKIP_TEST is breaking
serial on some machines so I want this back into distros
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
CC: stable@vger.kernel.org

c4cad90f

powerpc/xmon: Fix up xmon format strings · 736256e4

由 Michael Ellerman 提交于 5月 26, 2014

There are a couple of places where xmon is using %x to print values that
are unsigned long.

I found this out the hard way recently:

 0:mon> p c000000000d0e7c8 c00000033dc90000 00000000a0000089 c000000000000000
 return value is 0x96300500

Which is calling find_linux_pte_or_hugepte(), the result should be a
kernel pointer. After decoding the page tables by hand I discovered the
correct value was c000000396300500.

So fix up that case and a few others.

We also use a mix of 0x%x, %x and %u to print cpu numbers. So
standardise on 0x%x.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

736256e4

powerpc/powernv: Add calls to support little endian host · 4926616c

由 Benjamin Herrenschmidt 提交于 5月 20, 2014

When running as a powernv "host" system on P8, we need to switch
the endianness of interrupt handlers. This does it via the appropriate
call to the OPAL firmware which may result in just switching HID0:HILE
but depending on the processor version might need to do a few more
things. This call must be done early before any other processor has
been brought out of firmware.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NAndy Whitcroft <apw@canonical.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

4926616c

sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALL · f6187769

由 Fabian Frederick 提交于 6月 04, 2014

sys_sgetmask and sys_ssetmask are obsolete system calls no longer
supported in libc.

This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
mode configuration.That option is enabled by default for those
architectures.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f6187769

mm: disable zone_reclaim_mode by default · 4f9b16a6

由 Mel Gorman 提交于 6月 04, 2014

When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node.  NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
problem.

Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.

This patch (of 2):

zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes.  This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes.  The NUMA penalties were
sufficiently high to justify reclaiming the memory.  On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this.  Favour the
common case and disable it by default.  Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4f9b16a6

x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81

由 Mel Gorman 提交于 6月 04, 2014

_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86.  Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults
and prot_none faults.  This decision was x86-specific and conceptually
it is difficult requiring special casing to distinguish between PROTNONE
and NUMA ptes based on context.

Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults as if the PTE is not present then a fault will
be trapped.

Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
This patch shrinks the maximum possible swap size and uses the bit to
uniquely distinguish between NUMA hinting ptes and swap ptes.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c46a7c81

hugetlb: restrict hugepage_migration_support() to x86_64 · c177c81e

由 Naoya Horiguchi 提交于 6月 04, 2014

Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs.  So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.

Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c177c81e

02 6月, 2014 1 次提交
- B
  powerpc: Wire renameat2() syscall · 8212f58a
  由 Benjamin Herrenschmidt 提交于 6月 02, 2014
```
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
```
  8212f58a
30 5月, 2014 16 次提交

KVM: PPC: Book3S PR: Rework SLB switching code · d8d164a9

由 Alexander Graf 提交于 5月 15, 2014

On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.

Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.

Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.

While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.
Signed-off-by: NAlexander Graf <agraf@suse.de>

d8d164a9

KVM: PPC: Book3S PR: Use SLB entry 0 · 207438d4

由 Alexander Graf 提交于 5月 15, 2014

We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0
will always be used by the Linux linear SLB entry, so the fact that slbia
does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway.

Just enable use of SLB entry 0 for our shadow SLB code.
Signed-off-by: NAlexander Graf <agraf@suse.de>

207438d4

KVM: PPC: Book3S HV: Fix machine check delivery to guest · 000a25dd

由 Paul Mackerras 提交于 5月 26, 2014

The code that delivered a machine check to the guest after handling
it in real mode failed to load up r11 before calling kvmppc_msr_interrupt,
which needs the old MSR value in r11 so it can see the transactional
state there.  This adds the missing load.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

000a25dd

KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs · 9bc01a9b

由 Paul Mackerras 提交于 5月 26, 2014

This adds workarounds for two hardware bugs in the POWER8 performance
monitor unit (PMU), both related to interrupt generation.  The effect
of these bugs is that PMU interrupts can get lost, leading to tools
such as perf reporting fewer counts and samples than they should.

The first bug relates to the PMAO (perf. mon. alert occurred) bit in
MMCR0; setting it should cause an interrupt, but doesn't.  The other
bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0.
Setting PMAE when a counter is negative and counter negative
conditions are enabled to cause alerts should cause an alert, but
doesn't.

The workaround for the first bug is to create conditions where a
counter will overflow, whenever we are about to restore a MMCR0
value that has PMAO set (and PMAO_SYNC clear).  The workaround for
the second bug is to freeze all counters using MMCR2 before reading
MMCR0.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

9bc01a9b

KVM: PPC: Book3S HV: Make sure we don't miss dirty pages · 6c576e74

由 Paul Mackerras 提交于 5月 26, 2014

Current, when testing whether a page is dirty (when constructing the
bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit
in the HPT entries mapping the page, and if it is 0, we consider the
page to be clean.  However, the Power ISA doesn't require processors
to set the C bit to 1 immediately when writing to a page, and in fact
allows them to delay the writeback of the C bit until they receive a
TLB invalidation for the page.  Thus it is possible that the page
could be dirty and we miss it.

Now, if there are vcpus running, this is not serious since the
collection of the dirty log is racy already - some vcpu could dirty
the page just after we check it.  But if there are no vcpus running we
should return definitive results, in case we are in the final phase of
migrating the guest.

Also, if the permission bits in the HPTE don't allow writing, then we
know that no CPU can set C.  If the HPTE was previously writable and
the page was modified, any C bit writeback would have been flushed out
by the tlbie that we did when changing the HPTE to read-only.

Otherwise we need to do a TLB invalidation even if the C bit is 0, and
then check the C bit.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

6c576e74

KVM: PPC: Book3S HV: Fix dirty map for hugepages · 687414be

由 Alexey Kardashevskiy 提交于 5月 26, 2014

The dirty map that we construct for the KVM_GET_DIRTY_LOG ioctl has
one bit per system page (4K/64K).  Currently, we only set one bit in
the map for each HPT entry with the Change bit set, even if the HPT is
for a large page (e.g., 16MB).  Userspace then considers only the
first system page dirty, though in fact the guest may have modified
anywhere in the large page.

To fix this, we make kvm_test_clear_dirty() return the actual number
of pages that are dirty (and rename it to kvm_test_clear_dirty_npages()
to emphasize that that's what it returns).  In kvmppc_hv_get_dirty_log()
we then set that many bits in the dirty map.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

687414be

KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address · 1066f772

由 Paul Mackerras 提交于 5月 26, 2014

Currently, when a huge page is faulted in for a guest, we select the
rmap chain to insert the HPTE into based on the guest physical address
that the guest tried to access.  Since there is an rmap chain for each
system page, there are many rmap chains for the area covered by a huge
page (e.g. 256 for 16MB pages when PAGE_SIZE = 64kB), and the huge-page
HPTE could end up in any one of them.

For consistency, and to make the huge-page HPTEs easier to find, we now
put huge-page HPTEs in the rmap chain corresponding to the base address
of the huge page.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

1066f772

KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() · 55765483

由 Paul Mackerras 提交于 5月 26, 2014

The global_invalidates() function contains a check that is intended
to tell whether we are currently executing in the context of a hypercall
issued by the guest.  The reason is that the optimization of using a
local TLB invalidate instruction is only valid in that context.  The
check was testing local_paca->kvm_hstate.kvm_vcore, which gets set
when entering the guest but no longer gets cleared when exiting the
guest.  To fix this, we use the kvm_vcpu field instead, which does
get cleared when exiting the guest, by the kvmppc_release_hwthread()
calls inside kvmppc_run_core().

The effect of having the check wrong was that when kvmppc_do_h_remove()
got called from htab_write() on the destination machine during a
migration, it cleared the current cpu's bit in kvm->arch.need_tlb_flush.
This meant that when the guest started running in the destination VM,
it may miss out on doing a complete TLB flush, and therefore may end
up using stale TLB entries from a previous guest that used the same
LPID value.

This should make migration more reliable.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

55765483

KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number · e1d8a96d

由 Paul Mackerras 提交于 5月 26, 2014

Commit b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8
SPRs") added a definition of KVM_REG_PPC_WORT with the same register
number as the existing KVM_REG_PPC_VRSAVE (though in fact the
definitions are not identical because of the different register sizes.)

For clarity, this moves KVM_REG_PPC_WORT to the next unused number,
and also adds it to api.txt.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

e1d8a96d

KVM: PPC: Add CAP to indicate hcall fixes · f2e91042

由 Alexander Graf 提交于 5月 22, 2014

We worked around some nasty KVM magic page hcall breakages:

  1) NX bit not honored, so ignore NX when we detect it
  2) LE guests swizzle hypercall instruction

Without these fixes in place, there's no way it would make sense to expose kvm
hypercalls to a guest. Chances are immensely high it would trip over and break.

So add a new CAP that gives user space a hint that we have workarounds for the
bugs above in place. It can use those as hint to disable PV hypercalls when
the guest CPU is anything POWER7 or higher and the host does not have fixes
in place.
Signed-off-by: NAlexander Graf <agraf@suse.de>

f2e91042

KVM: PPC: MPIC: Reset IRQ source private members · aae65596

由 Alexander Graf 提交于 5月 22, 2014

When we reset the in-kernel MPIC controller, we forget to reset some hidden
state such as destmask and output. This state is usually set when the guest
writes to the IDR register for a specific IRQ line.

To make sure we stay in sync and don't forget hidden state, treat reset of
the IDR register as a simple write of the IDR register. That automatically
updates all the hidden state as well.
Reported-by: NPaul Janzen <pcj@pauljanzen.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

aae65596

KVM: PPC: Graciously fail broken LE hypercalls · 42188365

由 Alexander Graf 提交于 5月 13, 2014

There are LE Linux guests out there that don't handle hypercalls correctly.
Instead of interpreting the instruction stream from device tree as big endian
they assume it's a little endian instruction stream and fail.

When we see an illegal instruction from such a byte reversed instruction stream,
bail out graciously and just declare every hcall as error.
Signed-off-by: NAlexander Graf <agraf@suse.de>

42188365

PPC: ePAPR: Fix hypercall on LE guest · 235959be

由 Alexander Graf 提交于 5月 13, 2014

We get an array of instructions from the hypervisor via device tree that
we write into a buffer that gets executed whenever we want to make an
ePAPR compliant hypercall.

However, the hypervisor passes us these instructions in BE order which
we have to manually convert to LE when we want to run them in LE mode.

With this fixup in place, I can successfully run LE kernels with KVM
PV enabled on PR KVM.
Signed-off-by: NAlexander Graf <agraf@suse.de>

235959be

KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler · ddca156a

由 Aneesh Kumar K.V 提交于 5月 12, 2014

Use make_dsisr instead of open coding it. This also have
the added benefit of handling alignment interrupt on additional
instructions.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>

ddca156a

KVM: PPC: BOOK3S: Always use the saved DAR value · 7310f3a5

由 Aneesh Kumar K.V 提交于 5月 12, 2014

Although it's optional, IBM POWER cpus always had DAR value set on
alignment interrupt. So don't try to compute these values.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>

7310f3a5

PPC: KVM: Make NX bit available with magic page · 5c165aec

由 Alexander Graf 提交于 5月 12, 2014

Because old kernels enable the magic page and then choke on NXed trampoline
code we have to disable NX by default in KVM when we use the magic page.

However, since commit b18db0b8 we have successfully fixed that and can now
leave NX enabled, so tell the hypervisor about this.
Signed-off-by: NAlexander Graf <agraf@suse.de>

5c165aec