提交 · f4329f2ecb149282fdfdd8830a936a56b1497a05 · openanolis / cloud-kernel

14 11月, 2016 1 次提交

powerpc/64s: Reduce exception alignment · f4329f2e

由 Nicholas Piggin 提交于 10月 13, 2016

Exception handlers are aligned to 128 bytes (L1 cache) on 64s, which is
overkill. It can reduce the icache footprint of any individual exception
path. However taken as a whole, the expansion in icache footprint seems
likely to be counter-productive and cause more total misses.

Create IFETCH_ALIGN_SHIFT/BYTES, which should give optimal ifetch
alignment with much more reasonable alignment. This saves 1792 bytes
from head_64.o text with an allmodconfig build.

Other subarchitectures should define appropriate IFETCH_ALIGN_SHIFT
values if this becomes more widely used.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f4329f2e

12 3月, 2016 1 次提交

powerpc: add inline functions for cache related instructions · d6bfa02f

由 Christophe Leroy 提交于 2月 09, 2016

This patch adds inline functions to use dcbz, dcbi, dcbf, dcbst
from C functions
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NScott Wood <oss@buserror.net>

d6bfa02f

21 10月, 2015 1 次提交

powerpc: Revert "Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8" · 23316316

由 Paul Mackerras 提交于 10月 21, 2015

This reverts commit 9678cdaa ("Use the POWER8 Micro Partition
Prefetch Engine in KVM HV on POWER8") because the original commit had
multiple, partly self-cancelling bugs, that could cause occasional
memory corruption.

In fact the logmpp instruction was incorrectly using register r0 as the
source of the buffer address and operation code, and depending on what
was in r0, it would either do nothing or corrupt the 64k page pointed to
by r0.

The logmpp instruction encoding and the operation code definitions could
be corrected, but then there is the problem that there is no clearly
defined way to know when the hardware has finished writing to the
buffer.

The original commit attempted to work around this by aborting the
write-out before starting the prefetch, but this is ineffective in the
case where the virtual core is now executing on a different physical
core from the one where the write-out was initiated.

These problems plus advice from the hardware designers not to use the
function (since the measured performance improvement from using the
feature was actually mostly negative), mean that reverting the code is
the best option.

Fixes: 9678cdaa ("Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8")
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

23316316

17 3月, 2015 1 次提交

powerpc: Remove duplicate cacheable_memcpy/memzero functions · b05ae4ee

由 Kyle Moffett 提交于 11月 14, 2011

These functions are only used from one place each.  If the cacheable_*
versions really are more efficient, then those changes should be
migrated into the common code instead.

NOTE: The old routines are just flat buggy on kernels that support
      hardware with different cacheline sizes.
Signed-off-by: NKyle Moffett <Kyle.D.Moffett@boeing.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

b05ae4ee

28 7月, 2014 1 次提交

Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 · 9678cdaa

由 Stewart Smith 提交于 7月 18, 2014

The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.

The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
  - 32kb for L2
  - 128kb for L2+L3
  - Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)

We should abort any ongoing logging before initiating one.

To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
  32 cycles)

The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.

The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.

This implementation just does L2 store/load. In my benchmarks this proves
to be useful.

Benchmark 1:
 - 16 core POWER8
 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
 - No split core/SMT
 - two guests running sysbench memory test.
   sysbench --test=memory --num-threads=8 run
 - one guest running apache bench (of default HTML page)
   ab -n 490000 -c 400 http://localhost/

This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.

In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.

benchmark 2:
 - Same VM config as benchmark 1
 - all three guests running sysbench memory benchmark

This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)

The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.

benchmark 3:
 - same 3 guests and CPU configuration as benchmark 1 and 2.
 - two idle guests
 - 1 guest running STREAM benchmark

This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).

benchmark 4:
 - same 3 guests as previous benchmarks
 - two guests running sysbench --memory, distinctly different cache heavy
   workload
 - one guest running STREAM benchmark.

Similar improvements to benchmark 3.

benchmark 5:
 - 1 guest, 8 VCPUs, Ubuntu 14.04
 - Host configured with split core (SMT8, subcores-per-core=4)
 - STREAM benchmark

In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.

Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>
Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

9678cdaa

02 12月, 2013 1 次提交

powerpc: purge all the prefetched instructions for the coherent icache flush · 0ce63670

由 Kevin Hao 提交于 8月 22, 2013

As Benjamin Herrenschmidt has indicated, we still need a dummy icbi to
purge all the prefetched instructions from the ifetch buffers for the
snooping icache. We also need a sync before the icbi to order the
actual stores to memory that might have modified instructions with
the icbi.
Signed-off-by: NKevin Hao <haokexin@gmail.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

0ce63670

29 3月, 2012 1 次提交

Disintegrate asm/system.h for PowerPC · ae3a197e

由 David Howells 提交于 3月 28, 2012

Disintegrate asm/system.h for PowerPC.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
cc: linuxppc-dev@lists.ozlabs.org

ae3a197e

05 5月, 2010 1 次提交

powerpc/47x: Base ppc476 support · e7f75ad0

由 Dave Kleikamp 提交于 3月 05, 2010

This patch adds the base support for the 476 processor.  The code was
primarily written by Ben Herrenschmidt and Torez Smith, but I've been
maintaining it for a while.

The goal is to have a single binary that will run on 44x and 47x, but
we still have some details to work out.  The biggest is that the L1 cache
line size differs on the two platforms, but it's currently a compile-time
option.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NTorez Smith  <lnxtorez@linux.vnet.ibm.com>
Signed-off-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: NJosh Boyer <jwboyer@linux.vnet.ibm.com>

e7f75ad0

03 3月, 2010 1 次提交

Rename .data.read_mostly to .data..read_mostly. · 54cb27a7

由 Denys Vlasenko 提交于 2月 20, 2010

Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: NMichal Marek <mmarek@suse.cz>

54cb27a7

04 8月, 2008 1 次提交

powerpc: Move include files to arch/powerpc/include/asm · b8b572e1

由 Stephen Rothwell 提交于 8月 01, 2008

from include/asm-powerpc.  This is the result of a

mkdir arch/powerpc/include/asm
git mv include/asm-powerpc/* arch/powerpc/include/asm

Followed by a few documentation/comment fixups and a couple of places
where <asm-powepc/...> was being used explicitly.  Of the latter only
one was outside the arch code and it is a driver only built for powerpc.
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPaul Mackerras <paulus@samba.org>

b8b572e1

19 6月, 2008 1 次提交

powerpc/booke: Add support for new e500mc core · 3dfa8773

由 Kumar Gala 提交于 6月 16, 2008

The new e500mc core from Freescale is based on the e500v2 but with the
following changes:

* Supports only the Enhanced Debug Architecture (DSRR0/1, etc)
* Floating Point
* No SPE
* Supports lwsync
* Doorbell Exceptions
* Hypervisor
* Cache line size is now 64-bytes (e500v1/v2 have a 32-byte cache line)
Signed-off-by: NKumar Gala <galak@kernel.crashing.org>

3dfa8773

10 7月, 2007 1 次提交

[POWERPC] Add __read_mostly support for powerpc · bd67fcf9

由 Tony Breeds 提交于 7月 04, 2007

Signed-off-by: NTony Breeds <tony@bakeyournoodle.com>
Signed-off-by: NPaul Mackerras <paulus@samba.org>

bd67fcf9

26 4月, 2006 1 次提交
- D
  Don't include linux/config.h from anywhere else in include/ · 62c4f0a2
  由 David Woodhouse 提交于 4月 26, 2006
```
Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
```
  62c4f0a2
09 1月, 2006 1 次提交

[PATCH] Kill L1_CACHE_SHIFT_MAX · 1fd73c6b

由 Ravikiran G Thirumalai 提交于 1月 08, 2006

Kill L1_CACHE_SHIFT from all arches.  Since L1_CACHE_SHIFT_MAX is not used
anymore with the introduction of INTERNODE_CACHE, kill L1_CACHE_SHIFT_MAX.
Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: NShai Fultheim <shai@scalex86.org>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1fd73c6b

10 11月, 2005 1 次提交

[PATCH] powerpc: Merge cacheflush.h and cache.h · 26ef5c09

由 David Gibson 提交于 11月 10, 2005

The ppc32 and ppc64 versions of cacheflush.h were almost identical.
The two versions of cache.h are fairly similar, except for a bunch of
register definitions in the ppc32 version which probably belong better
elsewhere.  This patch, therefore, merges both headers.  Notable
points:
	- there are several functions in cacheflush.h which exist only
on ppc32 or only on ppc64.  These are handled by #ifdef for now, but
these should probably be consolidated, along with the actual code
behind them later.
	- Confusingly, both ppc32 and ppc64 have a
flush_dcache_range(), but they're subtly different: it uses dcbf on
ppc32 and dcbst on ppc64, ppc64 has a flush_inval_dcache_range() which
uses dcbf.  These too should be merged and consolidated later.
	- Also flush_dcache_range() was defined in cacheflush.h on
ppc64, and in cache.h on ppc32.  In the merged version it's in
cacheflush.h
	- On ppc32 flush_icache_range() is a normal function from
misc.S.  On ppc64, it was wrapper, testing a feature bit before
calling __flush_icache_range() which does the actual flush.  This
patch takes the ppc64 approach, which amounts to no change on ppc32,
since CPU_FTR_COHERENT_ICACHE will never be set there, but does mean
renaming flush_icache_range() to __flush_icache_range() in
arch/ppc/kernel/misc.S and arch/powerpc/kernel/misc_32.S
	- The PReP register info from asm-ppc/cache.h has moved to
arch/ppc/platforms/prep_setup.c
	- The 8xx register info from asm-ppc/cache.h has moved to a
new asm-powerpc/reg_8xx.h, included from reg.h
	- flush_dcache_all() was defined on ppc32 (only), but was
never called (although it was exported).  Thus this patch removes it
from cacheflush.h and from ARCH=powerpc (misc_32.S) entirely.  It's
left in ARCH=ppc for now, with the prototype moved to ppc_ksyms.c.

Built for Walnut (ARCH=ppc), 32-bit multiplatform (pmac, CHRP and PReP
ARCH=ppc, pmac and CHRP ARCH=powerpc).  Built and booted on POWER5
LPAR (ARCH=powerpc and ARCH=ppc64).

Built for 32-bit powermac (ARCH=ppc and ARCH=powerpc).  Built and
booted on POWER5 LPAR (ARCH=powerpc and ARCH=ppc64).  Built and booted
on G5 (ARCH=powerpc)
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@samba.org>

26ef5c09

openanolis / cloud-kernel 大约 2 年 前同步成功

openanolis / cloud-kernel
大约 2 年前同步成功