提交 · e42389c5f19fed0a7c578769b00edc2cf23ee319 · openeuler / Kernel

07 8月, 2018 9 次提交

M
powerpc/64s: Rename STD_RELON_EXCEPTION_PSERIES to STD_RELON_EXCEPTION · e42389c5
由 Michael Ellerman 提交于 7月 26, 2018
```
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
```
e42389c5
M
powerpc/64s: Rename STD_EXCEPTION_PSERIES_OOL to STD_EXCEPTION_OOL · 75e8bef3
由 Michael Ellerman 提交于 7月 26, 2018
```
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
```
75e8bef3

powerpc/64s: Rename STD_EXCEPTION_PSERIES to STD_EXCEPTION · e899fce5

由 Michael Ellerman 提交于 7月 26, 2018

The "PSERIES" in STD_EXCEPTION_PSERIES is to differentiate the macros
from the legacy iSeries versions, which are called
STD_EXCEPTION_ISERIES. It is not anything to do with pseries vs
powernv or powermac etc.

We removed the legacy iSeries code in 2012, in commit 8ee3e0d6x
("powerpc: Remove the main legacy iSerie platform code").

So remove "PSERIES" from the macros.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e899fce5

powerpc/64s: Move SET_SCRATCH0() into EXCEPTION_RELON_PROLOG_PSERIES() · 92b6d65c

由 Michael Ellerman 提交于 7月 26, 2018

EXCEPTION_RELON_PROLOG_PSERIES() only has two users,
STD_RELON_EXCEPTION_PSERIES() and STD_RELON_EXCEPTION_HV() both of
which "call" SET_SCRATCH0(), so just move SET_SCRATCH0() into
EXCEPTION_RELON_PROLOG_PSERIES().
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

92b6d65c

powerpc/64s: Move SET_SCRATCH0() into EXCEPTION_PROLOG_PSERIES() · 4a7a0a84

由 Michael Ellerman 提交于 7月 26, 2018

EXCEPTION_PROLOG_PSERIES() only has two users, STD_EXCEPTION_PSERIES()
and STD_EXCEPTION_HV() both of which "call" SET_SCRATCH0(), so just
move SET_SCRATCH0() into EXCEPTION_PROLOG_PSERIES().
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

4a7a0a84

powerpc/lib: Implement strlen() in assembly for PPC32 · 9412b234

由 Christophe Leroy 提交于 8月 01, 2018

The generic implementation of strlen() reads strings byte per byte.

This patch implements strlen() in assembly based on a read of entire
words, in the same spirit as what some other arches and glibc do.

On a 8xx the time spent in strlen is reduced by 3/4 for long strings.

strlen() selftest on an 8xx provides the following values:

Before the patch (ie with the generic strlen() in lib/string.c):

  len 256 : time = 1.195055
  len 016 : time = 0.083745
  len 008 : time = 0.046828
  len 004 : time = 0.028390

After the patch:

  len 256 : time = 0.272185 ==> 78% improvment
  len 016 : time = 0.040632 ==> 51% improvment
  len 008 : time = 0.033060 ==> 29% improvment
  len 004 : time = 0.029149 ==> 2% degradation

On a 832x:

Before the patch:

  len 256 : time = 0.236125
  len 016 : time = 0.018136
  len 008 : time = 0.011000
  len 004 : time = 0.007229

After the patch:

  len 256 : time = 0.094950 ==> 60% improvment
  len 016 : time = 0.013357 ==> 26% improvment
  len 008 : time = 0.010586 ==> 4% improvment
  len 004 : time = 0.008784
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9412b234

powerpc/pseries: Defer the logging of rtas error to irq work queue. · 94675cce

由 Mahesh Salgaonkar 提交于 7月 04, 2018

rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from
vmalloc (non-linear mapping) area.

On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. Apart from that, as Nick pointed out,
pSeries_log_error() also takes a spin_lock while logging error which
is not safe in NMI context. It may endup in deadlock if we get another
MCE before releasing the lock. Fix this by deferring the logging of
rtas error to irq work queue.

Current implementation uses two different buffers to hold rtas error
log depending on whether extended log is provided or not. This makes
bit difficult to identify which buffer has valid data that needs to
logged later in irq work. Simplify this using single buffer, one per
paca, and copy rtas log to it irrespective of whether extended log is
provided or not. Allocate this buffer below RMA region so that it can
be accessed in real mode mce handler.

Fixes: b96672dd ("powerpc: Machine check interrupt is a non-maskable interrupt")
Cc: stable@vger.kernel.org # v4.14+
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

94675cce

powerpc/xive: Remove xive_kexec_teardown_cpu() · e27e0a94

由 Benjamin Herrenschmidt 提交于 4月 11, 2018

It's identical to xive_teardown_cpu() so just use the latter
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e27e0a94

powerpc/64s/radix: tlb do not flush on page size when fullmm · 5a609934

由 Nicholas Piggin 提交于 7月 25, 2018

When the mm is being torn down there will be a full PID flush so
there is no need to flush the TLB on page size changes.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5a609934

31 7月, 2018 3 次提交

powerpc/powernv: Add support to enable sensor groups · 04baaf28

由 Shilpasri G Bhat 提交于 7月 24, 2018

Adds support to enable/disable a sensor group at runtime. This
can be used to select the sensor groups that needs to be copied to
main memory by OCC. Sensor groups like power, temperature, current,
voltage, frequency, utilization can be enabled/disabled at runtime.
Signed-off-by: NShilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

04baaf28

powernv/cpuidle: Use parsed device tree values for cpuidle_init · 1961acad

由 Akshay Adiga 提交于 7月 05, 2018

Export pnv_idle_states and nr_pnv_idle_states so that its accessible to
cpuidle driver. Use properties from pnv_idle_states structure for powernv
cpuidle_init.
Signed-off-by: NAkshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

1961acad

powernv/cpuidle: Parse dt idle properties into global structure · 9c7b185a

由 Akshay Adiga 提交于 7月 05, 2018

Device-tree parsing happens twice, once while deciding idle state to be
used for hotplug and once during cpuidle init. Hence, parsing the device
tree and caching it will reduce code duplication. Parsing code has been
moved to pnv_parse_cpuidle_dt() from pnv_probe_idle_states(). In addition
to the properties in the device tree the number of available states is
also required.
Signed-off-by: NAkshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9c7b185a

30 7月, 2018 15 次提交

powerpc/mm: Don't report PUDs as memory leaks when using kmemleak · a984506c

由 Michael Ellerman 提交于 7月 20, 2018

Paul Menzel reported that kmemleak was producing reports such as:

  unreferenced object 0xc0000000f8b80000 (size 16384):
    comm "init", pid 1, jiffies 4294937416 (age 312.240s)
    hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace:
      [<00000000d997deb7>] __pud_alloc+0x80/0x190
      [<0000000087f2e8a3>] move_page_tables+0xbac/0xdc0
      [<00000000091e51c2>] shift_arg_pages+0xc0/0x210
      [<00000000ab88670c>] setup_arg_pages+0x22c/0x2a0
      [<0000000060871529>] load_elf_binary+0x41c/0x1648
      [<00000000ecd9d2d4>] search_binary_handler.part.11+0xbc/0x280
      [<0000000034e0cdd7>] __do_execve_file.isra.13+0x73c/0x940
      [<000000005f953a6e>] sys_execve+0x58/0x70
      [<000000009700a858>] system_call+0x5c/0x70

Indicating that a PUD was being leaked.

However what's really happening is that kmemleak is not able to
recognise the references from the PGD to the PUD, because they are not
fully qualified pointers.

We can confirm that in xmon, eg:

Find the task struct for pid 1 "init":
  0:mon> P
       task_struct     ->thread.ksp    PID   PPID S  P CMD
  c0000001fe7c0000 c0000001fe803960      1      0 S 13 systemd

Dump virtual address 0 to find the PGD:
  0:mon> dv 0 c0000001fe7c0000
  pgd  @ 0xc0000000f8b01000

Dump the memory of the PGD:
  0:mon> d c0000000f8b01000
  c0000000f8b01000 00000000f8b90000 0000000000000000  |................|
  c0000000f8b01010 0000000000000000 0000000000000000  |................|
  c0000000f8b01020 0000000000000000 0000000000000000  |................|
  c0000000f8b01030 0000000000000000 00000000f8b80000  |................|
                                    ^^^^^^^^^^^^^^^^

There we can see the reference to our supposedly leaked PUD. But
because it's missing the leading 0xc, kmemleak won't recognise it.

We can confirm it's still in use by translating an address that is
mapped via it:
  0:mon> dv 7fff94000000 c0000001fe7c0000
  pgd  @ 0xc0000000f8b01000
  pgdp @ 0xc0000000f8b01038 = 0x00000000f8b80000 <--
  pudp @ 0xc0000000f8b81ff8 = 0x00000000037c4000
  pmdp @ 0xc0000000037c5ca0 = 0x00000000fbd89000
  ptep @ 0xc0000000fbd89000 = 0xc0800001d5ce0386
  Maps physical address = 0x00000001d5ce0000
  Flags = Accessed Dirty Read Write

The fix is fairly simple. We need to tell kmemleak to ignore PUD
allocations and never report them as leaks. We can also tell it not to
scan the PGD, because it will never find pointers in there. However it
will still notice if we allocate a PGD and then leak it.
Reported-by: NPaul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Tested-by: NPaul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a984506c

powerpc: split asm/tlbflush.h · 405cb402

由 Christophe Leroy 提交于 7月 05, 2018

Split asm/tlbflush.h into:
asm/nohash/tlbflush.h
asm/book3s/32/tlbflush.h
asm/book3s/64/tlbflush.h (already existing)
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

405cb402

powerpc: remove unnecessary inclusion of asm/tlbflush.h · 45ef5992

由 Christophe Leroy 提交于 7月 05, 2018

asm/tlbflush.h is only needed for:
- using functions xxx_flush_tlb_xxx()
- using MMU_NO_CONTEXT
- including asm-generic/pgtable.h
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

45ef5992

powerpc/44x: remove page.h from mmu-44x.h · 7bc39695

由 Christophe Leroy 提交于 7月 05, 2018

mmu-44x.h doesn't need asm/page.h if PAGE_SHIFT are replaced by CONFIG_PPC_XX_PAGES
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

7bc39695

powerpc/nohash: fix hash related comments in pgtable.h · 0c295d0e

由 Christophe Leroy 提交于 7月 05, 2018

Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0c295d0e

powerpc: fix includes in asm/processor.h · 62b84265

由 Christophe Leroy 提交于 7月 05, 2018

Remove superflous includes and add missing ones
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

62b84265

powerpc/book3s: Remove PPC_PIN_SIZE · 6b622669

由 Christophe Leroy 提交于 7月 05, 2018

PPC_PIN_SIZE is specific to the 44x and is defined in mmu.h
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6b622669

powerpc: declare set_breakpoint() static · b5ac51d7

由 Christophe Leroy 提交于 7月 05, 2018

set_breakpoint() is only used in process.c so make it static
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b5ac51d7

powerpc: remove superflous inclusions of asm/fixmap.h · e8cb7a55

由 Christophe Leroy 提交于 7月 05, 2018

Files not using fixmap consts or functions don't need asm/fixmap.h
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e8cb7a55

powerpc: clean inclusions of asm/feature-fixups.h · 2c86cd18

由 Christophe Leroy 提交于 7月 05, 2018

files not using feature fixup don't need asm/feature-fixups.h
files using feature fixup need asm/feature-fixups.h
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2c86cd18

powerpc: clean the inclusion of stringify.h · 5c35a02c

由 Christophe Leroy 提交于 7月 05, 2018

Only include linux/stringify.h is files using __stringify()
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5c35a02c

powerpc: move ASM_CONST and stringify_in_c() into asm-const.h · ec0c464c

由 Christophe Leroy 提交于 7月 05, 2018

This patch moves ASM_CONST() and stringify_in_c() into
dedicated asm-const.h, then cleans all related inclusions.
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
[mpe: asm-compat.h should include asm-const.h]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ec0c464c

powerpc/405: move PPC405_ERR77 in asm-405.h · 36a7eeaf

由 Christophe Leroy 提交于 7月 05, 2018

Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

36a7eeaf

powerpc: remove unneeded inclusions of cpu_has_feature.h · 8c58259b

由 Christophe Leroy 提交于 7月 05, 2018

Files not using cpu_has_feature() don't need cpu_has_feature.h
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

8c58259b

powerpc: remove kdump.h from page.h · db0a2b63

由 Christophe Leroy 提交于 7月 05, 2018

page.h doesn't need kdump.h
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

db0a2b63

24 7月, 2018 10 次提交

powerpc/powernv: implement opal_put_chars_atomic · 17cc1dd4

由 Nicholas Piggin 提交于 5月 01, 2018

The RAW console does not need writes to be atomic, so relax
opal_put_chars to be able to do partial writes, and implement an
_atomic variant which does not take a spinlock. This API is used
in xmon, so the less locking that is used, the better chance there
is that a crash can be debugged.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

17cc1dd4

powerpc/powernv: Implement and use opal_flush_console · d2a2262e

由 Nicholas Piggin 提交于 5月 01, 2018

A new console flushing firmware API was introduced to replace event
polling loops, and implemented in opal-kmsg with affddff6
("powerpc/powernv: Add a kmsg_dumper that flushes console output on
panic"), to flush the console in the panic path.

The OPAL console driver has other situations where interrupts are off
and it needs to flush the console synchronously. These still use a
polling loop.

So move the opal-kmsg flush code to opal_flush_console, and use the
new function in opal-kmsg and opal_put_chars.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d2a2262e

powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · d58badfb

由 Simon Guo 提交于 6月 07, 2018

This patch add VMX primitives to do memcmp() in case the compare size
is equal or greater than 4K bytes. KSM feature can benefit from this.

Test result with following test program(replace the "^>" with ""):
------
># cat tools/testing/selftests/powerpc/stringloops/memcmp.c
>#include <malloc.h>
>#include <stdlib.h>
>#include <string.h>
>#include <time.h>
>#include "utils.h"
>#define SIZE (1024 * 1024 * 900)
>#define ITERATIONS 40

int test_memcmp(const void *s1, const void *s2, size_t n);

static int testcase(void)
{
        char *s1;
        char *s2;
        unsigned long i;

        s1 = memalign(128, SIZE);
        if (!s1) {
                perror("memalign");
                exit(1);
        }

        s2 = memalign(128, SIZE);
        if (!s2) {
                perror("memalign");
                exit(1);
        }

        for (i = 0; i < SIZE; i++)  {
                s1[i] = i & 0xff;
                s2[i] = i & 0xff;
        }
        for (i = 0; i < ITERATIONS; i++) {
		int ret = test_memcmp(s1, s2, SIZE);

		if (ret) {
			printf("return %d at[%ld]! should have returned zero\n", ret, i);
			abort();
		}
	}

        return 0;
}

int main(void)
{
        return test_harness(testcase, "memcmp");
}
------
Without this patch (but with the first patch "powerpc/64: Align bytes
before fall back to .Lshort in powerpc64 memcmp()." in the series):
	4.726728762 seconds time elapsed                                          ( +-  3.54%)
With VMX patch:
	4.234335473 seconds time elapsed                                          ( +-  2.63%)
		There is ~+10% improvement.

Testing with unaligned and different offset version (make s1 and s2 shift
random offset within 16 bytes) can archieve higher improvement than 10%..
Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d58badfb

powerpc: add vcmpequd/vcmpequb ppc instruction macro · f1ecbaf4

由 Simon Guo 提交于 6月 07, 2018

Some old tool chains don't know about instructions like vcmpequd.

This patch adds .long macro for vcmpequd and vcmpequb, which is
a preparation to optimize ppc64 memcmp with VMX instructions.
Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f1ecbaf4

powerpc/mm/hash: Add hpte_get_old_v and use that instead of opencoding · a833280b

由 Aneesh Kumar K.V 提交于 6月 29, 2018

No functional change
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a833280b

powerpc/mm: Increase MAX_PHYSMEM_BITS to 128TB with SPARSEMEM_VMEMMAP config · 7d4340bb

由 Aneesh Kumar K.V 提交于 6月 21, 2018

We do this only with VMEMMAP config so that our page_to_[nid/section] etc are not
impacted.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

7d4340bb

powerpc: NMI IPI make NMI IPIs fully sychronous · 5b73151f

由 Nicholas Piggin 提交于 4月 25, 2018

There is an asynchronous aspect to smp_send_nmi_ipi. The caller waits
for all CPUs to call in to the handler, but it does not wait for
completion of the handler. This is a needless complication, so remove
it and always wait synchronously.

The synchronous wait allows the caller to easily time out and clear
the wait for completion (zero nmi_ipi_busy_count) in the case of badly
behaved handlers. This would have prevented the recent smp_send_stop
NMI IPI bug from causing the system to hang.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5b73151f

powerpc/64s: make PACA_IRQ_HARD_DIS track MSR[EE] closely · 9b81c021

由 Nicholas Piggin 提交于 6月 03, 2018

When the masked interrupt handler clears MSR[EE] for an interrupt in
the PACA_IRQ_MUST_HARD_MASK set, it does not set PACA_IRQ_HARD_DIS.
This makes them get out of synch.

With that taken into account, it's only low level irq manipulation
(and interrupt entry before reconcile) where they can be out of synch.
This makes the code less surprising.

It also allows the IRQ replay code to rely on the IRQ_HARD_DIS value
and not have to mtmsrd again in this case (e.g., for an external
interrupt that has been masked). The bigger benefit might just be
that there is not such an element of surprise in these two bits of
state.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9b81c021

powerpc/pkeys: make protection key 0 less special · 07f522d2

由 Ram Pai 提交于 7月 17, 2018

Applications need the ability to associate an address-range with some
key and latter revert to its initial default key. Pkey-0 comes close to
providing this function but falls short, because the current
implementation disallows applications to explicitly associate pkey-0 to
the address range.

Lets make pkey-0 less special and treat it almost like any other key.
Thus it can be explicitly associated with any address range, and can be
freed. This gives the application more flexibility and power.  The
ability to free pkey-0 must be used responsibily, since pkey-0 is
associated with almost all address-range by default.

Even with this change pkey-0 continues to be slightly more special
from the following point of view.
(a) it is implicitly allocated.
(b) it is the default key assigned to any address-range.
(c) its permissions cannot be modified by userspace.

NOTE: (c) is specific to powerpc only. pkey-0 is associated by default
with all pages including kernel pages, and pkeys are also active in
kernel mode. If any permission is denied on pkey-0, the kernel running
in the context of the application will be unable to operate.

Tested on powerpc.
Signed-off-by: NRam Pai <linuxram@us.ibm.com>
[mpe: Drop #define PKEY_0 0 in favour of plain old 0]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

07f522d2

powerpc/pkeys: key allocation/deallocation must not change pkey registers · 4a4a5e5d

由 Ram Pai 提交于 7月 17, 2018

Key allocation and deallocation has the side effect of programming the
UAMOR/AMR/IAMR registers. This is wrong, since its the responsibility of
the application and not that of the kernel, to modify the permission on
the key.

Do not modify the pkey registers at key allocation/deallocation.

This patch also fixes a bug where a sys_pkey_free() resets the UAMOR
bits of the key, thus making its permissions unmodifiable from user
space. Later if the same key gets reallocated from a different thread
this thread will no longer be able to change the permissions on the key.

Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: NRam Pai <linuxram@us.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

4a4a5e5d

16 7月, 2018 3 次提交

powerpc/powernv/ioda: Allocate indirect TCE levels on demand · a68bd126

由 Alexey Kardashevskiy 提交于 7月 04, 2018

At the moment we allocate the entire TCE table, twice (hardware part and
userspace translation cache). This normally works as we normally have
contigous memory and the guest will map entire RAM for 64bit DMA.

However if we have sparse RAM (one example is a memory device), then
we will allocate TCEs which will never be used as the guest only maps
actual memory for DMA. If it is a single level TCE table, there is nothing
we can really do but if it a multilevel table, we can skip allocating
TCEs we know we won't need.

This adds ability to allocate only first level, saving memory.

This changes iommu_table::free() to avoid allocating of an extra level;
iommu_table::set() will do this when needed.

This adds @alloc parameter to iommu_table::exchange() to tell the callback
if it can allocate an extra level; the flag is set to "false" for
the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
H_TOO_HARD.

This still requires the entire table to be counted in mm::locked_vm.

To be conservative, this only does on-demand allocation when
the usespace cache table is requested which is the case of VFIO.

The example math for a system replicating a powernv setup with NVLink2
in a guest:
16GB RAM mapped at 0x0
128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000

the table to cover that all with 64K pages takes:
(((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB

If we allocate only necessary TCE levels, we will only need:
(((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
levels).
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a68bd126

powerpc/powernv: Add indirect levels to it_userspace · 090bad39

由 Alexey Kardashevskiy 提交于 7月 04, 2018

We want to support sparse memory and therefore huge chunks of DMA windows
do not need to be mapped. If a DMA window big enough to require 2 or more
indirect levels, and a DMA window is used to map all RAM (which is
a default case for 64bit window), we can actually save some memory by
not allocation TCE for regions which we are not going to map anyway.

The hardware tables alreary support indirect levels but we also keep
host-physical-to-userspace translation array which is allocated by
vmalloc() and is a flat array which might use quite some memory.

This converts it_userspace from vmalloc'ed array to a multi level table.

As the format becomes platform dependend, this replaces the direct access
to it_usespace with a iommu_table_ops::useraddrptr hook which returns
a pointer to the userspace copy of a TCE; future extension will return
NULL if the level was not allocated.

This should not change non-KVM handling of TCE tables and it_userspace
will not be allocated for non-KVM tables.
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

090bad39

KVM: PPC: Make iommu_table::it_userspace big endian · 00a5c58d

由 Alexey Kardashevskiy 提交于 7月 04, 2018

We are going to reuse multilevel TCE code for the userspace copy of
the TCE table and since it is big endian, let's make the copy big endian
too.
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

00a5c58d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功