提交 · 65eef33550f68e9a7f7d2dc64da94fb6cb85be2c · openanolis / cloud-kernel

22 4月, 2014 1 次提交

KVM: s390: Adding skey bit to mmu context · 65eef335

由 Dominik Dingel 提交于 1月 14, 2014

For lazy storage key handling, we need a mechanism to track if the
process ever issued a storage key operation.

This patch adds the basic infrastructure for making the storage
key handling optional, but still leaves it enabled for now by default.
Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

65eef335

03 4月, 2014 3 次提交

s390/uaccess: rework uaccess code - fix locking issues · 457f2180

由 Heiko Carstens 提交于 3月 21, 2014

The current uaccess code uses a page table walk in some circumstances,
e.g. in case of the in atomic futex operations or if running on old
hardware which doesn't support the mvcos instruction.

However it turned out that the page table walk code does not correctly
lock page tables when accessing page table entries.
In other words: a different cpu may invalidate a page table entry while
the current cpu inspects the pte. This may lead to random data corruption.

Adding correct locking however isn't trivial for all uaccess operations.
Especially copy_in_user() is problematic since that requires to hold at
least two locks, but must be protected against ABBA deadlock when a
different cpu also performs a copy_in_user() operation.

So the solution is a different approach where we change address spaces:

User space runs in primary address mode, or access register mode within
vdso code, like it currently already does.

The kernel usually also runs in home space mode, however when accessing
user space the kernel switches to primary or secondary address mode if
the mvcos instruction is not available or if a compare-and-swap (futex)
instruction on a user space address is performed.
KVM however is special, since that requires the kernel to run in home
address space while implicitly accessing user space with the sie
instruction.

So we end up with:

User space:
- runs in primary or access register mode
- cr1 contains the user asce
- cr7 contains the user asce
- cr13 contains the kernel asce

Kernel space:
- runs in home space mode
- cr1 contains the user or kernel asce
  -> the kernel asce is loaded when a uaccess requires primary or
     secondary address mode
- cr7 contains the user or kernel asce, (changed with set_fs())
- cr13 contains the kernel asce

In case of uaccess the kernel changes to:
- primary space mode in case of a uaccess (copy_to_user) and uses
  e.g. the mvcp instruction to access user space. However the kernel
  will stay in home space mode if the mvcos instruction is available
- secondary space mode in case of futex atomic operations, so that the
  instructions come from primary address space and data from secondary
  space

In case of kvm the kernel runs in home space mode, but cr1 gets switched
to contain the gmap asce before the sie instruction gets executed. When
the sie instruction is finished cr1 will be switched back to contain the
user asce.

A context switch between two processes will always load the kernel asce
for the next process in cr1. So the first exit to user space is a bit
more expensive (one extra load control register instruction) than before,
however keeps the code rather simple.

In sum this means there is no need to perform any error prone page table
walks anymore when accessing user space.

The patch seems to be rather large, however it mainly removes the
the page table walk code and restores the previously deleted "standard"
uaccess code, with a couple of changes.

The uaccess without mvcos mode can be enforced with the "uaccess_primary"
kernel parameter.
Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

457f2180

s390/mm,tlb: optimize TLB flushing for zEC12 · 1b948d6c

由 Martin Schwidefsky 提交于 4月 03, 2014

The zEC12 machines introduced the local-clearing control for the IDTE
and IPTE instruction. If the control is set only the TLB of the local
CPU is cleared of entries, either all entries of a single address space
for IDTE, or the entry for a single page-table entry for IPTE.
Without the local-clearing control the TLB flush is broadcasted to all
CPUs in the configuration, which is expensive.

The reset of the bit mask of the CPUs that need flushing after a
non-local IDTE is tricky. As TLB entries for an address space remain
in the TLB even if the address space is detached a new bit field is
required to keep track of attached CPUs vs. CPUs in the need of a
flush. After a non-local flush with IDTE the bit-field of attached CPUs
is copied to the bit-field of CPUs in need of a flush. The ordering
of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is
such that an underindication in mm_cpumask(mm) is prevented but an
overindication in mm_cpumask(mm) is possible.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

1b948d6c

s390/mm,tlb: safeguard against speculative TLB creation · 02a8f3ab

由 Martin Schwidefsky 提交于 4月 03, 2014

The principles of operations states that the CPU is allowed to create
TLB entries for an address space anytime while an ASCE is loaded to
the control register. This is true even if the CPU is running in the
kernel and the user address space is not (actively) accessed.

In theory this can affect two aspects of the TLB flush logic.
For full-mm flushes the ASCE of the dying process is still attached.
The approach to flush first with IDTE and then just free all page
tables can in theory lead to stale TLB entries. Use the batched
free of page tables for the full-mm flushes as well.

For operations that can have a stale ASCE in the control register,
e.g. a delayed update_user_asce in switch_mm, load the kernel ASCE
to prevent invalid TLBs from being created.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

02a8f3ab

21 2月, 2014 1 次提交

s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries · 53e857f3

由 Martin Schwidefsky 提交于 9月 10, 2012

Git commit 050eef36 "[S390] fix tlb flushing vs. concurrent
/proc accesses" introduced the attach counter to avoid using the
mm_users value to decide between IPTE for every PTE and lazy TLB
flushing with IDTE. That fixed the problem with mm_users but it
introduced another subtle race, fortunately one that is very hard
to hit.
The background is the requirement of the architecture that a valid
PTE may not be changed while it can be used concurrently by another
cpu. The decision between IPTE and lazy TLB flushing needs to be
done while the PTE is still valid. Now if the virtual cpu is
temporarily stopped after the decision to use lazy TLB flushing but
before the invalid bit of the PTE has been set, another cpu can attach
the mm, find that flush_mm is set, do the IDTE, return to userspace,
and recreate a TLB that uses the PTE in question. When the first,
stopped cpu continues it will change the PTE while it is attached on
another cpu. The first cpu will do another IDTE shortly after the
modification of the PTE which makes the race window quite short.

To fix this race the CPU that wants to attach the address space of a
user space thread needs to wait for the end of the PTE modification.
The number of concurrent TLB flushers for an mm is tracked in the
upper 16 bits of the attach_count and finish_arch_post_lock_switch
is used to wait for the end of the flush operation if required.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

53e857f3

24 10月, 2013 1 次提交

s390/uaccess: always run the kernel in home space · e258d719

由 Martin Schwidefsky 提交于 9月 24, 2013

Simplify the uaccess code by removing the user_mode=home option.
The kernel will now always run in the home space mode.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

e258d719

22 8月, 2013 1 次提交

s390/mm: introduce ptep_flush_lazy helper · 5c474a1e

由 Martin Schwidefsky 提交于 8月 16, 2013

Isolate the logic of IDTE vs. IPTE flushing of ptes in two functions,
ptep_flush_lazy and __tlb_flush_mm_lazy.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

5c474a1e

29 7月, 2013 1 次提交

KVM: s390: allow sie enablement for multi-threaded programs · 3eabaee9

由 Martin Schwidefsky 提交于 7月 26, 2013

Improve the code to upgrade the standard 2K page tables to 4K page tables
with PGSTEs to allow the operation to happen when the program is already
multi-threaded.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3eabaee9

26 9月, 2012 1 次提交

s390/mm: rename addressing_mode to s390_user_mode · d1b0d842

由 Heiko Carstens 提交于 9月 02, 2012

Renaming the globally visible variable "user_mode" to "addressing_mode" in
order to fix a name clash was not a good idea. (Commit 37fe1d73 "s390/mm:
rename user_mode variable to addressing_mode")
Looking at the code after a couple of weeks one thinks: addressing mode of
what?
So rename the variable again. This time to s390_user_mode. Which hopefully
makes more sense.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

d1b0d842

30 7月, 2012 1 次提交

s390/mm: rename user_mode variable to addressing_mode · 37fe1d73

由 Heiko Carstens 提交于 7月 27, 2012

Fix name clash with user_mode() define which is also used in common code.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

37fe1d73

26 7月, 2012 1 次提交

s390/mm: downgrade page table after fork of a 31 bit process · 0f6f281b

由 Martin Schwidefsky 提交于 7月 26, 2012

The downgrade of the 4 level page table created by init_new_context is
currently done only in start_thread31. If a 31 bit process forks the
new mm uses a 4 level page table, including the task size of 2<<42
that goes along with it. This is incorrect as now a 31 bit process
can map memory beyond 2GB. Define arch_dup_mmap to do the downgrade
after fork.

Cc: stable@vger.kernel.org
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

0f6f281b

20 7月, 2012 1 次提交

s390/comments: unify copyright messages and remove file names · a53c8fab

由 Heiko Carstens 提交于 7月 20, 2012

Remove the file name from the comment at top of many files. In most
cases the file name was wrong anyway, so it's rather pointless.

Also unify the IBM copyright statement. We did have a lot of sightly
different statements and wanted to change them one after another
whenever a file gets touched. However that never happened. Instead
people start to take the old/"wrong" statements to use as a template
for new files.
So unify all of them in one go.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

a53c8fab

24 5月, 2012 1 次提交

s390/headers: replace __s390x__ with CONFIG_64BIT where possible · f4815ac6

由 Heiko Carstens 提交于 5月 23, 2012

Replace __s390x__ with CONFIG_64BIT in all places that are not exported
to userspace or guarded with #ifdef __KERNEL__.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

f4815ac6

29 3月, 2012 1 次提交

Disintegrate asm/system.h for S390 · a0616cde

由 David Howells 提交于 3月 28, 2012

Disintegrate asm/system.h for S390.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
cc: linux-s390@vger.kernel.org

a0616cde

23 5月, 2011 1 次提交

[S390] Remove data execution protection · 043d0708

由 Martin Schwidefsky 提交于 5月 23, 2011

The noexec support on s390 does not rely on a bit in the page table
entry but utilizes the secondary space mode to distinguish between
memory accesses for instructions vs. data. The noexec code relies
on the assumption that the cpu will always use the secondary space
page table for data accesses while it is running in the secondary
space mode. Up to the z9-109 class machines this has been the case.
Unfortunately this is not true anymore with z10 and later machines.
The load-relative-long instructions lrl, lgrl and lgfrl access the
memory operand using the same addressing-space mode that has been
used to fetch the instruction.
This breaks the noexec mode for all user space binaries compiled
with march=z10 or later. The only option is to remove the current
noexec support.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

043d0708

10 5月, 2011 1 次提交

[S390] fix alloc_pgste check in init_new_context · badb8bb9

由 Martin Schwidefsky 提交于 5月 10, 2011

Processes started with kernel_execve from a kernel thread will have
current->mm==NULL. Reading current->mm->context.alloc_pgste will
read a more or less random bit from lowcore in this case. If the
bit turns out to be set the whole process tree started this way
will allocate page table extensions although they have no need
for it.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

badb8bb9

24 8月, 2010 1 次提交

[S390] fix tlb flushing vs. concurrent /proc accesses · 050eef36

由 Martin Schwidefsky 提交于 8月 24, 2010

The tlb flushing code uses the mm_users field of the mm_struct to
decide if each page table entry needs to be flushed individually with
IPTE or if a global flush for the mm_struct is sufficient after all page
table updates have been done. The comment for mm_users says "How many
users with user space?" but the /proc code increases mm_users after it
found the process structure by pid without creating a new user process.
Which makes mm_users useless for the decision between the two tlb
flusing methods. The current code can be confused to not flush tlb
entries by a concurrent access to /proc files if e.g. a fork is in
progres. The solution for this problem is to make the tlb flushing
logic independent from the mm_users field.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

050eef36

07 12月, 2009 1 次提交

[S390] Improve address space mode selection. · b11b5334

由 Martin Schwidefsky 提交于 12月 07, 2009

Introduce user_mode to replace the two variables switch_amode and
s390_noexec. There are three valid combinations of the old values:
  1) switch_amode == 0 && s390_noexec == 0
  2) switch_amode == 1 && s390_noexec == 0
  3) switch_amode == 1 && s390_noexec == 1
They get replaced by
  1) user_mode == HOME_SPACE_MODE
  2) user_mode == PRIMARY_SPACE_MODE
  3) user_mode == SECONDARY_SPACE_MODE
The new kernel parameter user_mode=[primary,secondary,home] lets
you choose the address space mode the user space processes should
use. In addition the CONFIG_S390_SWITCH_AMODE config option
is removed.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

b11b5334

26 3月, 2009 1 次提交

[S390] cpumask: use mm_cpumask() wrapper · 005f8eee

由 Rusty Russell 提交于 3月 26, 2009

Makes code futureproof against the impending change to mm->cpu_vm_mask.

It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

005f8eee

28 10月, 2008 1 次提交

[S390] pgtables: Fix race in enable_sie vs. page table ops · 250cf776

由 Christian Borntraeger 提交于 10月 28, 2008

The current enable_sie code sets the mm->context.pgstes bit to tell
dup_mm that the new mm should have extended page tables. This bit is also
used by the s390 specific page table primitives to decide about the page
table layout - which means context.pgstes has two meanings. This can cause
any kind of bugs. For example  - e.g. shrink_zone can call
ptep_clear_flush_young while enable_sie is running. ptep_clear_flush_young
will test for context.pgstes. Since enable_sie changed that value of the old
struct mm without changing the page table layout ptep_clear_flush_young will
do the wrong thing.
The solution is to split pgstes into two bits
- one for the allocation
- one for the current state
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

250cf776

02 8月, 2008 1 次提交
- M
  [S390] move include/asm-s390 to arch/s390/include/asm · c6557e7f
  由 Martin Schwidefsky 提交于 8月 01, 2008
```
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
```
  c6557e7f
27 4月, 2008 1 次提交

s390: KVM preparation: provide hook to enable pgstes in user pagetable · 402b0862

由 Carsten Otte 提交于 3月 25, 2008

The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.

Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

402b0862

10 2月, 2008 3 次提交

[S390] dynamic page tables. · 6252d702

由 Martin Schwidefsky 提交于 2月 09, 2008

Add support for different number of page table levels dependent
on the highest address used for a process. This will cause a 31 bit
process to use a two level page table instead of the four level page
table that is the default after the pud has been introduced. Likewise
a normal 64 bit process will use three levels instead of four. Only
if a process runs out of the 4 tera bytes which can be addressed with
a three level page table the fourth level is dynamically added. Then
the process can use up to 8 peta byte.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

6252d702

M
[S390] Add four level page tables for CONFIG_64BIT=y. · 5a216a20
由 Martin Schwidefsky 提交于 2月 09, 2008
```
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
```
5a216a20

[S390] 1K/2K page table pages. · 146e4b3c

由 Martin Schwidefsky 提交于 2月 09, 2008

This patch implements 1K/2K page table pages for s390.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

146e4b3c

26 1月, 2008 1 次提交

[S390] Fix tlb flushing with idte. · 6f457e1a

由 Martin Schwidefsky 提交于 1月 26, 2008

The clear-by-asce operation of the idte instruction gets an asce
(address-space-control-element) as argument to specify which TLBs
need to get flushed. The current code passes a plain pointer to
the start of the pgd without the additional bits which would make
the pointer an asce. The current machines don't mind the difference
but a future model might want to use the designation type control
bits in the asce as a filter for the TLBs to flush.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

6f457e1a

22 10月, 2007 1 次提交

[S390] Cleanup page table definitions. · 3610cce8

由 Martin Schwidefsky 提交于 10月 22, 2007

- De-confuse the defines for the address-space-control-elements
  and the segment/region table entries.
- Create out of line functions for page table allocation / freeing.
- Simplify get_shadow_xxx functions.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

3610cce8

03 5月, 2007 1 次提交

[PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destruction · d6dd61c8

由 Jeremy Fitzhardinge 提交于 5月 02, 2007

Add hooks to allow a paravirt implementation to track the lifetime of
an mm.  Paravirtualization requires three hooks, but only two are
needed in common code.  They are:

arch_dup_mmap, which is called when a new mmap is created at fork

arch_exit_mmap, which is called when the last process reference to an
  mm is dropped, which typically happens on exit and exec.

The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures.  It's called when an mm is first used.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: NIngo Molnar <mingo@elte.hu>

d6dd61c8

06 2月, 2007 1 次提交

[S390] noexec protection · c1821c2e

由 Gerald Schaefer 提交于 2月 05, 2007

This provides a noexec protection on s390 hardware. Our hardware does
not have any bits left in the pte for a hw noexec bit, so this is a
different approach using shadow page tables and a special addressing
mode that allows separate address spaces for code and data.

As a special feature of our "secondary-space" addressing mode, separate
page tables can be specified for the translation of data addresses
(storage operands) and instruction addresses. The shadow page table is
used for the instruction addresses and the standard page table for the
data addresses.
The shadow page table is linked to the standard page table by a pointer
in page->lru.next of the struct page corresponding to the page that
contains the standard page table (since page->private is not really
private with the pte_lock and the page table pages are not in the LRU
list).
Depending on the software bits of a pte, it is either inserted into
both page tables or just into the standard (data) page table. Pages of
a vma that does not have the VM_EXEC bit set get mapped only in the
data address space. Any try to execute code on such a page will cause a
page translation exception. The standard reaction to this is a SIGSEGV
with two exceptions: the two system call opcodes 0x0a77 (sys_sigreturn)
and 0x0aad (sys_rt_sigreturn) are allowed. They are stored by the
kernel to the signal stack frame. Unfortunately, the signal return
mechanism cannot be modified to use an SA_RESTORER because the
exception unwinding code depends on the system call opcode stored
behind the signal stack frame.

This feature requires that user space is executed in secondary-space
mode and the kernel in home-space mode, which means that the addressing
modes need to be switched and that the noexec protection only works
for user space.
After switching the addressing modes, we cannot use the mvcp/mvcs
instructions anymore to copy between kernel and user space. A new
mvcos instruction has been added to the z9 EC/BC hardware which allows
to copy between arbitrary address spaces, but on older hardware the
page tables need to be walked manually.
Signed-off-by: NGerald Schaefer <geraldsc@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

c1821c2e

09 11月, 2005 1 次提交

[PATCH] s390: "extern inline" -> "static inline" · 4448aaf0

由 Adrian Bunk 提交于 11月 08, 2005

"extern inline" -> "static inline"
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

4448aaf0

17 4月, 2005 1 次提交

Linux-2.6.12-rc2 · 1da177e4

由 Linus Torvalds 提交于 4月 16, 2005

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!

1da177e4

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功