提交 · d148c94c27a87213f995f0a7519b231719cfb919 · openanolis / cloud-kernel

19 4月, 2017 11 次提交

powerpc/perf: Support to export SIERs bit in Power9 · d148c94c

由 Madhavan Srinivasan 提交于 4月 11, 2017

Patch to export SIER bits to userspace via
perf_mem_data_src and perf_sample_data struct.
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d148c94c

powerpc/perf: Support to export SIERs bit in Power8 · 453ce7a9

由 Madhavan Srinivasan 提交于 4月 11, 2017

Patch to export SIER bits to userspace via
perf_mem_data_src and perf_sample_data struct.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

453ce7a9

powerpc/perf: Support to export MMCRA[TEC*] field to userspace · 170a315f

由 Madhavan Srinivasan 提交于 4月 11, 2017

Threshold feature when used with MMCRA [Threshold Event Counter Event],
MMCRA[Threshold Start event] and MMCRA[Threshold End event] will update
MMCRA[Threashold Event Counter Exponent] and MMCRA[Threshold Event
Counter Multiplier] with the corresponding threshold event count values.
Patch to export MMCRA[TECX/TECM] to userspace in 'weight' field of
struct perf_sample_data.
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

170a315f

powerpc/perf: Export memory hierarchy info to user space · 79e96f8f

由 Madhavan Srinivasan 提交于 4月 11, 2017

The LDST field and DATA_SRC in SIER identifies the memory hierarchy level
(eg: L1, L2 etc), from which a data-cache miss for a marked instruction
was satisfied. Use the 'perf_mem_data_src' object to export this
hierarchy level to user space.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

79e96f8f

powerpc/perf: Define big-endian version of perf_mem_data_src · 8c5073db

由 Sukadev Bhattiprolu 提交于 4月 11, 2017

perf_mem_data_src is a union that is initialized in the kernel via the ->val
field and accessed by userspace via the mem_xxx bitfields. For this to work
correctly on big endian platforms, we need a big-endian definition for the
bitfields.

Currently on a big endian system, if a user requests PERF_SAMPLE_DATA_SRC (perf
report -d), they will get the default value from perf_sample_data_init(), which
is PERF_MEM_NA. The value for PERF_MEM_NA is constructed using shifts:

  /* TLB access */
  #define PERF_MEM_TLB_NA		0x01 /* not available */
  ...
  #define PERF_MEM_TLB_SHIFT	26

  #define PERF_MEM_S(a, s) \
	(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)

  #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
		    PERF_MEM_S(LVL, NA)   |\
		    PERF_MEM_S(SNOOP, NA) |\
		    PERF_MEM_S(LOCK, NA)  |\
		    PERF_MEM_S(TLB, NA))

Which works out as:

  ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26))

Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
in CPU endian.

But then in the perf tool, the code uses the bitfields to inspect the value, and
currently the bitfields are defined using little endian ordering.

So eg. in perf_mem__tlb_scnprintf() we see:
  data_src->val = 0x5080021
             op = 0x0
            lvl = 0x0
          snoop = 0x0
           lock = 0x0
           dtlb = 0x0
           rsvd = 0x5080021

Because of the way the perf tool code is written this is still displayed to the
user as "N/A", so there is no bug visible at the UI level.

Currently there are no big endian architectures which export a meaningful
value (ie. other than PERF_MEM_NA), so the extent of the bug on big endian
platforms is that the PERF_MEM_NA value is exported incorrectly as described
above. Subsequent patches will add support on big endian powerpc for populating
the data source value.

This patch does a minimal fix of adding big endian definition of the bitfields
to match the values that are already exported by the kernel on big endian. And
it makes no change on little endian.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

8c5073db

powerpc/iommu: Do not call PageTransHuge() on tail pages · e889e96e

由 Alexey Kardashevskiy 提交于 4月 11, 2017

The CMA pages migration code does not support compound pages at
the moment so it performs few tests before proceeding to actual page
migration.

One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as
it is designed to be called on head pages only. Since we also test for
PageCompound(), and it contains PageTail() and PageHead(), we can
simplify the check by leaving just PageCompound() and therefore avoid
possible VM_BUG_ON_PAGE.

Fixes: 2e5bbb54 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e889e96e

powerpc/mmap: Any hint > 128TB searches the full VA space · 321f7d29

由 Aneesh Kumar K.V 提交于 4月 18, 2017

As part of the new large address space support, processes start out life with a
128TB virtual address space. However when calling mmap() a process can pass a
hint address, and if that hint is > 128TB the kernel will use the full 512TB
address space to try and satisfy the mmap() request.

Currently we have a check that the hint is > 128TB and < 512TB (TASK_SIZE),
which was added as an optimisation to avoid updating addr_limit unnecessarily
and also to avoid calling slice_flush_segments() on all CPUs more than
necessary.

However this has the user-visible side effect that an mmap() hint above 512TB
does not search the full address space unless a preceding mmap() used a hint
value > 128TB && < 512TB.

So fix it to treat any hint above 128TB as a hint to search the full address
space, instead of checking the hint against TASK_SIZE, we instead check if the
addr_limit is already == TASK_SIZE.

This also brings the ABI in-line with what is proposed on x86. ie, that a hint
address above 128TB up to and including (2^64)-1 is an indication to search the
full address space.

Fixes: f4ea6dcb (powerpc/mm: Enable mappings above 128TB)
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

321f7d29

cxl: Enable PCI device IDs for future IBM CXL adapters · 41e20d95

由 Matthew R. Ochs 提交于 3月 24, 2017

Add support for future IBM Coherent Accelerator (CXL) devices
with an IDs of 0x0623 and 0x0628.
Signed-off-by: NMatthew R. Ochs <mrochs@linux.vnet.ibm.com>
Signed-off-by: NUma Krishnan <ukrishn@linux.vnet.ibm.com>
Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

41e20d95

powerpc/64s: Minor fix for MCE TLB flush for radix · 95dbdf4f

由 Nicholas Piggin 提交于 4月 17, 2017

The TLB flush for radix first flushes TLB for radix configuration,
then flushes for hash configuration. The second flush is unnecessary
but does not affect correctness.

Fixes: 1a472c9d ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

95dbdf4f

powerpc/mm/radix: Use mm->task_size for boundary checking instead of addr_limit · be77e999

由 Aneesh Kumar K.V 提交于 4月 14, 2017

We don't init addr_limit correctly for 32 bit applications. So default to using
mm->task_size for boundary condition checking. We use addr_limit to only control
free space search. This makes sure that we do the right thing with 32 bit
applications.

We should consolidate the usage of TASK_SIZE/mm->task_size and
mm->context.addr_limit later.

This partially reverts commit fbfef902 (powerpc/mm: Switch some
TASK_SIZE checks to use mm_context addr_limit).

Fixes: fbfef902 ("powerpc/mm: Switch some TASK_SIZE checks to use mm_context addr_limit")
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

be77e999

powerpc/64s: Revert setting of LPCR[LPES] on POWER9 · 8d1b48ef

由 Nicholas Piggin 提交于 4月 19, 2017

The XIVE enablement patches included a change to set the LPES (Logical
Partitioning Environment Selector) bit (bit # 3) in LPCR (Logical Partitioning
Control Register) on POWER9 hosts. This bit sets external interrupts to guest
delivery mode, which uses SRR0/1. The host's EE interrupt handler is written to
expect HSRR0/1 (for earlier CPUs). This should be fine because XIVE is
configured not to deliver EEs to the host (Hypervisor Virtulization Interrupt is
used instead) so the EE handler should never be executed.

However a bug in interrupt controller code, hardware, or odd configuration of a
simulator could result in the host getting an EE incorrectly. Keeping the EE
delivery mode matching the host EE handler prevents strange crashes due to using
the wrong exception registers.

KVM will configure the LPCR to set LPES prior to running a guest so that EEs are
delivered to the guest using SRR0/1.

Fixes: 08a1e650 ("powerpc: Fixup LPCR:PECE and HEIC setting on POWER9")
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Massage change log to avoid referring to LPES0 which is now renamed LPES]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

8d1b48ef

13 4月, 2017 18 次提交

powerpc/pseries: Always enable SMP when building pseries · 270e2dc9

由 Michael Ellerman 提交于 4月 05, 2017

The pseries platform supports Power4 and later CPUs, all of which are
multithreaded and/or multicore.

In practice no one ever builds a SMP=n kernel for these machines. So as
we did for powernv, have the pseries platform imply SMP=y.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

270e2dc9

powerpc/powernv: Always enable SMP when building powernv · 40e27565

由 Michael Ellerman 提交于 4月 05, 2017

The powernv platform supports Power7 and later CPUs, all of which are
multithreaded and multicore.

As such we never build a SMP=n kernel for those machines, other than
possibly for debugging or running in a simulator.

In the debugging case we can get a similar effect by booting with
nr_cpus=1, or there's always the option of building a custom kernel with
SMP hacked out.

For running in simulators the code size reduction from building without
SMP is not particularly important, what matters is the number of
instructions executed. A quick test shows that a SMP=y kernel takes ~6%
more instructions to boot to a shell. Booting with nr_cpus=1 recovers
about half that deficit.

On the flip side, keeping the SMP=n kernel building can be a pain at
times. And although we've mostly kept it building in recent years, no
one is regularly testing that the SMP=n kernel actually boots and works
well on these machines.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

40e27565

powerpc: Allow platforms to force-enable CONFIG_SMP · ebbe9d7d

由 Michael Ellerman 提交于 4月 05, 2017

Of the 64-bit Book3S platforms, only powermac supports booting on an
actual non-SMP system. The other platforms can be built with SMP
disabled, but it doesn't make a lot of sense given the CPUs they support
are all multicore or multithreaded.

So give platforms the option of forcing SMP=y.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ebbe9d7d

powerpc: Drop include of linux/io.h from asm/io.h · 590c369e

由 Michael Ellerman 提交于 4月 13, 2017

Currently powerpc's asm/io.h includes linux/io.h, and linux/io.h
includes asm/io.h.

This can cause problems because depending on which is included first the
order of definitions between the two files will change.

The include of linux/io.h was added back in 2008 in commit b41e5fff
("[POWERPC] devres: Add devm_ioremap_prot()"). It's not entirely clear
it was needed then, but devm_ioremap_prot() has since been removed
entirely as unused, in dedd24a1 ("powerpc: Remove unused
devm_ioremap_prot()").

So it seems to be unnecessary and can potentially cause problems, so
remove the include of linux/io.h from asm/io.h
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

590c369e

powerpc/powernv: POWER9 support for msgsnd/doorbell IPI · 6b3edefe

由 Nicholas Piggin 提交于 4月 13, 2017

POWER9 requires msgsync for receiver-side synchronization, and a DD1
workaround restricts IPIs to core-local.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Drop no longer needed asm feature macro changes]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6b3edefe

powerpc/64s: Avoid a branch for ppc_msgsnd · a5adf282

由 Nicholas Piggin 提交于 4月 13, 2017

IPIs are a pretty hot path and we already have the ability to do asm feature
patching, so use it.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Change log detail]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a5adf282

powerpc: Introduce msgsnd/doorbell barrier primitives · b87ac021

由 Nicholas Piggin 提交于 4月 13, 2017

POWER9 changes requirements and adds new instructions for
synchronization.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b87ac021

powerpc: Change the doorbell IPI calling convention · b866cc21

由 Nicholas Piggin 提交于 4月 13, 2017

Change the doorbell callers to know about their msgsnd addressing,
rather than have them set a per-cpu target data tag at boot that gets
sent to the cause_ipi functions. The data is only used for doorbell IPI
functions, no other IPI types, so it makes sense to keep that detail
local to doorbell.

Have the platform code understand doorbell IPIs, rather than the
interrupt controller code understand them. Platform code can look at
capabilities it has available and decide which to use.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b866cc21

powerpc/64s: Add SCV FSCR bit for ISA v3.0 · 9b7ff0c6

由 Nicholas Piggin 提交于 4月 07, 2017

Add the bit definition and use it in facility_unavailable_exception() so we can
intelligently report the cause if we take a fault for SCV. This doesn't actually
enable SCV.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Drop whitespace changes to the existing entries, flush out change log]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9b7ff0c6

N
powerpc/64s: Add msgp facility unavailable log string · 794464f4
由 Nicholas Piggin 提交于 4月 07, 2017
```
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
```
794464f4

powerpc/mm/hash: Don't open code VMALLOC_INDEX · 5f812261

由 Aneesh Kumar K.V 提交于 4月 12, 2017

We have a #define for it, so use it.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5f812261

cxl: Add psl9 specific code · f24be42a

由 Christophe Lombard 提交于 4月 12, 2017

The new Coherent Accelerator Interface Architecture, level 2, for the
IBM POWER9 brings new content and features:
- POWER9 Service Layer
- Registers
- Radix mode
- Process element entry
- Dedicated-Shared Process Programming Model
- Translation Fault Handling
- CAPP
- Memory Context ID
    If a valid mm_struct is found the memory context id is used for each
    transaction associated with the process handle. The PSL uses the
    context ID to find the corresponding process element.
Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
[mpe: Fixup comment formatting, unsplit long strings]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f24be42a

cxl: Isolate few psl8 specific calls · abd1d99b

由 Christophe Lombard 提交于 4月 07, 2017

Point out the specific Coherent Accelerator Interface Architecture,
level 1, registers.
Code and functions specific to PSL8 (CAIA1) must be framed.
Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
[mpe: Don't split long strings, it makes them hard to grep for]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

abd1d99b

cxl: Rename some psl8 specific functions · 64663f37

由 Christophe Lombard 提交于 4月 07, 2017

Rename a few functions, changing the '_psl' suffix to '_psl8', to make
clear that the implementation is psl8 specific.
Those functions will have an equivalent implementation for the psl9 in
a later patch.
Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

64663f37

cxl: Update implementation service layer · bdd2e715

由 Christophe Lombard 提交于 4月 07, 2017

The service layer API (in cxl.h) lists some low-level functions whose
implementation is different on PSL8, PSL9 and XSL:
- Init implementation for the adapter and the afu.
- Invalidate TLB/SLB.
- Attach process for dedicated/directed models.
- Handle psl interrupts.
- Debug registers for the adapter and the afu.
- Traces.
Each environment implements its own functions, and the common code uses
them through function pointers, defined in cxl_service_layer_ops.
Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bdd2e715

cxl: Keep track of mm struct associated with a context · 6dd2d234

由 Christophe Lombard 提交于 4月 07, 2017

The mm_struct corresponding to the current task is acquired each time
an interrupt is raised. So to simplify the code, we only get the
mm_struct when attaching an AFU context to the process.
The mm_count reference is increased to ensure that the mm_struct can't
be freed. The mm_struct will be released when the context is detached.
A reference on mm_users is not kept to avoid a circular dependency if
the process mmaps its cxl mmio and forget to unmap before exiting.
The field glpid (pid of the group leader associated with the pid), of
the structure cxl_context, is removed because it's no longer useful.
Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6dd2d234

cxl: Remove unused values in bare-metal environment. · 66ef20c7

由 Christophe Lombard 提交于 4月 07, 2017

The two previously fields pid and tid, located in the structure
cxl_irq_info, are only used in the guest environment. To avoid confusion,
it's not necessary to fill the fields in the bare-metal environment.
Pid_tid is now renamed to 'reserved' to avoid undefined behavior on
bare-metal. The PSL Process and Thread Identification Register
(CXL_PSL_PID_TID_An) is only used when attaching a dedicated process
for PSL8 only. This register goes away in CAIA2.
Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

66ef20c7

cxl: Read vsec perst load image · aba81433

由 Christophe Lombard 提交于 4月 07, 2017

This bit is used to cause a flash image load for programmable
CAIA-compliant implementation. If this bit is set to ‘0’, a power
cycle of the adapter is required to load a programmable CAIA-com-
pliant implementation from flash.
This field will be used by the following patches.
Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

aba81433

12 4月, 2017 6 次提交

powerpc/mm: Fix hash table dump when memory is not contiguous · 9e4114b3

由 Rashmica Gupta 提交于 4月 10, 2017

The current behaviour of the hash table dump assumes that memory is contiguous
and iterates from the start of memory to (start + size of memory). When memory
isn't physically contiguous, this doesn't work.

If memory exists at 0-5 GB and 6-10 GB then the current approach will check if
entries exist in the hash table from 0GB to 9GB. This patch changes the
behaviour to iterate over any holes up to the end of memory.

Fixes: 1515ab93 ("powerpc/mm: Dump hash table")
Signed-off-by: NRashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9e4114b3

powerpc/mm: Add physical address to Linux page table dump · aaa22952

由 Oliver O'Halloran 提交于 3月 31, 2017

The current page table dumper scans the Linux page tables and coalesces mappings
with adjacent virtual addresses and similar PTE flags. This behaviour is
somewhat broken when you consider the IOREMAP space where entirely unrelated
mappings will appear to be virtually contiguous. This patch modifies the range
coalescing so that only ranges that are both physically and virtually contiguous
are combined. This patch also adds to the dump output the physical address at
the start of each range.

Fixes: 8eb07b18 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
[mpe: Print the physicall address with 0x like the other addresses]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

aaa22952

powerpc/mm: Fix missing _PAGE_NON_IDEMPOTENT in pgtable dump · 70538eaa

由 Oliver O'Halloran 提交于 3月 31, 2017

On Book3s we have two PTE flags used to mark cache-inhibited mappings:
_PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page table dumper
only looks at the generic _PAGE_NO_CACHE which is defined to be _PAGE_TOLERANT.
This patch modifies the dumper so both flags are shown in the dump.

Fixes: 8eb07b18 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

70538eaa

powerpc/tracing: Allow tracing of mmap syscalls · 9c355917

由 Balbir Singh 提交于 4月 12, 2017

Currently sys_mmap() and sys_mmap2() (32-bit only), are not visible to the
syscall tracing machinery. This means users are not able to see the execution of
mmap() syscalls using the syscall tracer.

Fix that by using SYSCALL_DEFINE6 for sys_mmap() and sys_mmap2() so that the
meta-data associated with these syscalls is visible to the syscall tracer.

A side-effect of this change is that the return type has changed from unsigned
long to long. However this should have no effect, the only code in the kernel
which uses the result of these syscalls is in the syscall return path, which is
written in asm and treats the result as unsigned regardless.

Example output:
  cat-3399  [001] ....   196.542410: sys_mmap(addr: 7fff922a0000, len: 20000, prot: 3, flags: 812, fd: 3, offset: 1b0000)
  cat-3399  [001] ....   196.542443: sys_mmap -> 0x7fff922a0000
  cat-3399  [001] ....   196.542668: sys_munmap(addr: 7fff922c0000, len: 6d2c)
  cat-3399  [001] ....   196.542677: sys_munmap -> 0x0
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
[mpe: Massage change log, add detail on return type change]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9c355917

powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages · 03dfee6d

由 Michael Ellerman 提交于 4月 12, 2017

Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB"),
we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This
makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to
calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong.

The end result is that the PGD (Page Global Directory, ie top level page table)
of the kernel (aka. swapper_pg_dir), is too small.

This generally doesn't lead to a crash, as we don't use the full range in normal
operation. However if we try to dump the kernel pagetables we can trigger a
crash because we walk off the end of the pgd into other memory and eventually
try to dereference something bogus:

  $ cat /sys/kernel/debug/kernel_pagetables
  Unable to handle kernel paging request for data at address 0xe8fece0000000000
  Faulting instruction address: 0xc000000000072314
  cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890]
      pc: c000000000072314: ptdump_show+0x164/0x430
      lr: c000000000072550: ptdump_show+0x3a0/0x430
     dar: e802cf0000000000
  seq_read+0xf8/0x560
  full_proxy_read+0x84/0xc0
  __vfs_read+0x6c/0x1d0
  vfs_read+0xbc/0x1b0
  SyS_read+0x6c/0x110
  system_call+0x38/0xfc

The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be
the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move
the calculation into asm-offsets.c where we can do it easily using
max().

Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

03dfee6d

Merge branch 'topic/xive' (early part) into next · 3c19d5ad

由 Michael Ellerman 提交于 4月 12, 2017

This merges the arch part of the XIVE support, leaving the final commit
with the KVM specific pieces dangling on the branch for Paul to merge
via the kvm-ppc tree.

3c19d5ad

11 4月, 2017 5 次提交

powerpc/powernv: Recover correct PACA on wakeup from a stop on P9 DD1 · 17ed4c8f

由 Gautham R. Shenoy 提交于 3月 22, 2017

POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up
from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus
the HSPRG0 of a thread waking up from can contain the paca pointer of
its sibling.

This patch implements a context recovery framework within threads of a
core, by provisioning space in paca_struct for saving every sibling
threads's paca pointers. Basically, we should be able to arrive at the
right paca pointer from any of the thread's existing paca pointer.

At bootup, during powernv idle-init, we save the paca address of every
CPU in each one its siblings paca_struct in the slot corresponding to
this CPU's index in the core.

On wakeup from a stop, the thread will determine its index in the core
from the TIR register and recover its PACA pointer by indexing into
the correct slot in the provisioned space in the current PACA.

Furthermore, ensure that the NVGPRs are restored from the stack on the
way out by setting the NAPSTATELOST in paca.

[Changelog written with inputs from svaidy@linux.vnet.ibm.com]
Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Call it a bug]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

17ed4c8f

powerpc/powernv/idle: Don't override default/deepest directly in kernel · f3b3f284

由 Gautham R. Shenoy 提交于 3月 22, 2017

Currently during idle-init on power9, if we don't find suitable stop
states in the device tree that can be used as the
default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default
stop state psscr to be used by power9_idle and deepest stop state
which is used by CPU-Hotplug.

However, if the platform firmware has not configured or enabled a stop
state, the kernel should not make any assumptions and fallback to a
default choice.

If the kernel uses a stop state that is not configured by the platform
firmware, it may lead to further failures which should be avoided.

In this patch, we modify the init code to ensure that the kernel uses
only the stop states exposed by the firmware through the device
tree. When a suitable default stop state isn't found, we disable
ppc_md.power_save for power9. Similarly, when a suitable
deepest_stop_state is not found in the device tree exported by the
firmware, fall back to the default busy-wait loop in the CPU-Hotplug
code.

[Changelog written with inputs from svaidy@linux.vnet.ibm.com]
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f3b3f284

powerpc/powernv/smp: Add busy-wait loop as fall back for CPU-Hotplug · 90061231

由 Gautham R. Shenoy 提交于 3月 22, 2017

Currently, the powernv cpu-offline function assumes that platform idle
states such as stop on POWER9, winkle/sleep/nap on POWER8 are always
available. On POWER8, it picks nap as the default state if other deep
idle states like sleep/winkle are not available and enabled in the
platform.

On POWER9, nap is not available and all idle states are managed by
STOP instruction.  The parameters to the idle state are passed through
processor stop status control register (PSSCR).  Hence as such
executing STOP would take parameters from current PSSCR. We do not
want to make any assumptions in kernel on what STOP states and PSSCR
features are configured by the platform.

Ideally platform will configure a good set of stop states that can be
used in the kernel.  We would like to start with a clean slate, if the
platform choose to not configure any state or there is an error in
platform firmware that lead to no stop states being configured or
allowed to be requested.

This patch adds a fallback method for CPU-Hotplug that is similar to
snooze loop at idle where the threads are left to spin at low priority
and hence reduce the cycles consumed.

This is a safe fallback mechanism in the case when no stop state would
be requested if the platform firmware did not configure them most
likely due to an error condition.

Requesting a stop state when the platform has not configured them or
enabled them would lead to further error conditions which could be
difficult to debug.

[Changelog written with inputs from svaidy@linux.vnet.ibm.com]
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

90061231

powerpc/powernv: Move CPU-Offline idle state invocation from smp.c to idle.c · a7cd88da

由 Gautham R. Shenoy 提交于 3月 22, 2017

Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which
transitions the CPU to the deepest available platform idle state to a
new function named pnv_cpu_offline() in powernv/idle.c. The rationale
behind this code movement is that the data required to determine the
deepest available platform state resides in powernv/idle.c.
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a7cd88da

powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes · 2c9faa76

由 Anshuman Khandual 提交于 4月 07, 2017

Add user space exported API definitions for 512KB, 1MB, 2MB, 8MB, 16MB,
1GB, 16GB non default huge page sizes to be used with mmap() system
call.
Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Reword the comment to emphasise that these are only needed to use
 the non-default huge page size, and updated the change log.]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2c9faa76

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功