提交 · 36409f6353fc2d7b6516e631415f938eadd92ffa · openeuler / Kernel

29 5月, 2011 1 次提交

[S390] mm: fix storage key handling · a43a9d93

由 Heiko Carstens 提交于 5月 29, 2011

page_get_storage_key() and page_set_storage_key() expect a page address
and not its page frame number. This got inconsistent with 2d42552d
"[S390] merge page_test_dirty and page_clear_dirty".

Result is that we read/write storage keys from random pages and do not
have a working dirty bit tracking at all.
E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
for example ext4 doesn't like very much and panics after a while.

Unable to handle kernel paging request at virtual user address (null)
Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 1 Not tainted 2.6.39-07551-g139f37f5-dirty #152
Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
           0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
           0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
           0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
Krnl Code: 0000000000316b04: f0a0000407f4       srp     4(11,%r0),2036,0
           0000000000316b0a: b9020022           ltgr    %r2,%r2
           0000000000316b0e: a7740015           brc     7,316b38
          >0000000000316b12: e3d0c0000024       stg     %r13,0(%r12)
           0000000000316b18: 4120c010           la      %r2,16(%r12)
           0000000000316b1c: 4130d060           la      %r3,96(%r13)
           0000000000316b20: e340d0600004       lg      %r4,96(%r13)
           0000000000316b26: c0e50002b567       brasl   %r14,36d5f4
Call Trace:
([<0000000000316a62>] jbd2_journal_file_inode+0x5e/0x138)
 [<00000000002da13c>] mpage_da_map_and_submit+0x2e8/0x42c
 [<00000000002daac2>] ext4_da_writepages+0x2da/0x504
 [<00000000002597e8>] writeback_single_inode+0xf8/0x268
 [<0000000000259f06>] writeback_sb_inodes+0xd2/0x18c
 [<000000000025a700>] writeback_inodes_wb+0x80/0x168
 [<000000000025aa92>] wb_writeback+0x2aa/0x324
 [<000000000025abde>] wb_do_writeback+0xd2/0x274
 [<000000000025ae3a>] bdi_writeback_thread+0xba/0x1c4
 [<00000000001737be>] kthread+0xa6/0xb0
 [<000000000056c1da>] kernel_thread_starter+0x6/0xc
 [<000000000056c1d4>] kernel_thread_starter+0x0/0xc
INFO: lockdep is turned off.
Last Breaking-Event-Address:
 [<0000000000316a8a>] jbd2_journal_file_inode+0x86/0x138
Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

a43a9d93

23 5月, 2011 3 次提交

[S390] refactor page table functions for better pgste support · b2fa47e6

由 Martin Schwidefsky 提交于 5月 23, 2011

Rework the architecture page table functions to access the bits in the
page table extension array (pgste). There are a number of changes:
1) Fix missing pgste update if the attach_count for the mm is <= 1.
2) For every operation that affects the invalid bit in the pte or the
rcp byte in the pgste the pcl lock needs to be acquired. The function
pgste_get_lock gets the pcl lock and returns the current pgste value
for a pte pointer. The function pgste_set_unlock stores the pgste
and releases the lock. Between these two calls the bits in the pgste
can be shuffled.
3) Define two software bits in the pte _PAGE_SWR and _PAGE_SWC to avoid
calling SetPageDirty and SetPageReferenced from pgtable.h. If the
host reference backup bit or the host change backup bit has been
set the dirty/referenced state is transfered to the pte. The common
code will pick up the state from the pte.
4) Add ptep_modify_prot_start and ptep_modify_prot_commit for mprotect.
5) Remove pgd_populate_kernel, pud_populate_kernel, pmd_populate_kernel
pgd_clear_kernel, pud_clear_kernel, pmd_clear_kernel and ptep_invalidate.
6) Rename kvm_s390_test_and_clear_page_dirty to
ptep_test_and_clear_user_dirty and add ptep_test_and_clear_user_young.
7) Define mm_exclusive() and mm_has_pgste() helper to improve readability.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

b2fa47e6

[S390] merge page_test_dirty and page_clear_dirty · 2d42552d

由 Martin Schwidefsky 提交于 5月 23, 2011

The page_clear_dirty primitive always sets the default storage key
which resets the access control bits and the fetch protection bit.
That will surprise a KVM guest that sets non-zero access control
bits or the fetch protection bit. Merge page_test_dirty and
page_clear_dirty back to a single function and only clear the
dirty bit from the storage key.

In addition move the function page_test_and_clear_dirty and
page_test_and_clear_young to page.h where they belong. This
requires to change the parameter from a struct page * to a page
frame number.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

2d42552d

[S390] Remove data execution protection · 043d0708

由 Martin Schwidefsky 提交于 5月 23, 2011

The noexec support on s390 does not rely on a bit in the page table
entry but utilizes the secondary space mode to distinguish between
memory accesses for instructions vs. data. The noexec code relies
on the assumption that the cpu will always use the secondary space
page table for data accesses while it is running in the secondary
space mode. Up to the z9-109 class machines this has been the case.
Unfortunately this is not true anymore with z10 and later machines.
The load-relative-long instructions lrl, lgrl and lgfrl access the
memory operand using the same addressing-space mode that has been
used to fetch the instruction.
This breaks the noexec mode for all user space binaries compiled
with march=z10 or later. The only option is to remove the current
noexec support.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

043d0708

27 10月, 2010 1 次提交

mm: remove pte_*map_nested() · ece0e2b6

由 Peter Zijlstra 提交于 10月 26, 2010

Since we no longer need to provide KM_type, the whole pte_*map_nested()
API is now redundant, remove it.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NChris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ece0e2b6

25 10月, 2010 5 次提交

[S390] pgtable: move pte_mkhuge() from hugetlb.h to pgtable.h · 84afdcee

由 Heiko Carstens 提交于 10月 25, 2010

All architectures besides s390 have pte_mkhuge() defined in pgtable.h.
So move the function to pgtable.h on s390 as well.
Fixes a compile error introduced with "hugetlb: hugepage migration core"
in linux-next which only happens on s390.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

84afdcee

[S390] add support for nonquiescing sske · e2b8d7af

由 Martin Schwidefsky 提交于 10月 25, 2010

Improve performance of the sske operation by using the nonquiescing
variant if the affected page has no mappings established. On machines
with no support for the new sske variant the mask bit will be ignored.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

e2b8d7af

[S390] store indication fault optimization · 92f842ea

由 Martin Schwidefsky 提交于 10月 25, 2010

Use the store indication bit in the translation exception code on
page faults to avoid the protection faults that immediatly follow
the page fault if the access has been a write.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

92f842ea

[S390] lockless get_user_pages_fast() · 80217147

由 Martin Schwidefsky 提交于 10月 25, 2010

Implement get_user_pages_fast without locking in the fastpath on s390.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

80217147

[S390] zero page cache synonyms · 238ec4ef

由 Martin Schwidefsky 提交于 10月 25, 2010

If the zero page is mapped to virtual user space addresses that differ
only in bit 2^12 or 2^13 we get L1 cache synonyms which can affect
performance. Follow the mips model and use multiple zero pages to avoid
the synonyms.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

238ec4ef

24 8月, 2010 1 次提交

[S390] fix tlb flushing vs. concurrent /proc accesses · 050eef36

由 Martin Schwidefsky 提交于 8月 24, 2010

The tlb flushing code uses the mm_users field of the mm_struct to
decide if each page table entry needs to be flushed individually with
IPTE or if a global flush for the mm_struct is sufficient after all page
table updates have been done. The comment for mm_users says "How many
users with user space?" but the /proc code increases mm_users after it
found the process structure by pid without creating a new user process.
Which makes mm_users useless for the decision between the two tlb
flusing methods. The current code can be confused to not flush tlb
entries by a concurrent access to /proc files if e.g. a fork is in
progres. The solution for this problem is to make the tlb flushing
logic independent from the mm_users field.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

050eef36

09 4月, 2010 1 次提交

[S390] increase default size of vmalloc area · 7d3f661e

由 Martin Schwidefsky 提交于 4月 09, 2010

The default size of the vmalloc area is currently 1 GB. The memory resource
controller uses about 10 MB of vmalloc space per gigabyte of memory. That
turns a system with more than ~100 GB memory unbootable with the default
vmalloc size. It costs us nothing to increase the default size to some
more adequate value, e.g. 128 GB.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

7d3f661e

21 2月, 2010 1 次提交

MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself · 4b3073e1

由 Russell King 提交于 12月 18, 2009

On VIVT ARM, when we have multiple shared mappings of the same file
in the same MM, we need to ensure that we have coherency across all
copies.  We do this via make_coherent() by making the pages
uncacheable.

This used to work fine, until we allowed highmem with highpte - we
now have a page table which is mapped as required, and is not available
for modification via update_mmu_cache().

Ralf Beache suggested getting rid of the PTE value passed to
update_mmu_cache():

  On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
  to construct a pointer to the pte again.  Passing a pte_t * is much
  more elegant.  Maybe we might even replace the pte argument with the
  pte_t?

Ben Herrenschmidt would also like the pte pointer for PowerPC:

  Passing the ptep in there is exactly what I want.  I want that
  -instead- of the PTE value, because I have issue on some ppc cases,
  for I$/D$ coherency, where set_pte_at() may decide to mask out the
  _PAGE_EXEC.

So, pass in the mapped page table pointer into update_mmu_cache(), and
remove the PTE value, updating all implementations and call sites to
suit.

Includes a fix from Stephen Rothwell:

  sparc: fix fallout from update_mmu_cache API change
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>

4b3073e1

07 12月, 2009 1 次提交

[S390] s390: use change recording override for kernel mapping · 6a985c61

由 Christian Borntraeger 提交于 12月 07, 2009

We dont need the dirty bit if a write access is done via the kernel
mapping. In that case SetPageDirty and friends are used anyway, no
need to do that a second time. We can use the change-recording
overide function for the kernel mapping, if available.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

6a985c61

12 6月, 2009 1 次提交

[S390] vmalloc: add vmalloc kernel parameter support · 239a6425

由 Heiko Carstens 提交于 6月 12, 2009

With the kernel parameter 'vmalloc=<size>' the size of the vmalloc area
can be specified. This can be used to increase or decrease the size of
the area. Works in the same way as on some other architectures.
This can be useful for features which make excessive use of vmalloc and
wouldn't work otherwise.
The default sizes remain unchanged: 96MB for 31 bit kernels and 1GB for
64 bit kernels.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

239a6425

27 11月, 2008 1 次提交

[S390] pgtable.h: Fix oops in unmap_vmas for KVM processes · 2944a5c9

由 Christian Borntraeger 提交于 11月 27, 2008

When running several kvm processes with lots of memory overcommitment,
we have seen an oops during process shutdown:
------------[ cut here ]------------
Kernel BUG at 0000000000193434 [verbose debug info unavailable]
addressing exception: 0005 [#1] PREEMPT SMP
Modules linked in: kvm sunrpc qeth_l2 dm_mod qeth ccwgroup
CPU: 10 Not tainted 2.6.28-rc4-kvm-bigiron-00521-g0ccca08-dirty #8
Process kuli (pid: 14460, task: 0000000149822338, ksp: 0000000024f57650)
Krnl PSW : 0704e00180000000 0000000000193434 (unmap_vmas+0x884/0xf10)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000000002 0000000000000000 000000051008d000 000003e05e6034e0
           00000000001933f6 00000000000001e9 0000000407259e0a 00000002be88c400
           00000200001c1000 0000000407259608 0000000407259e08 0000000024f577f0
           0000000407259e09 0000000000445fa8 00000000001933f6 0000000024f577f0
Krnl Code: 0000000000193426: eb22000c000d sllg %r2,%r2,12
           000000000019342c: a7180000 lhi %r1,0
           0000000000193430: b2290012 iske %r1,%r2
          >0000000000193434: a7110002 tmll %r1,2
           0000000000193438: a7840006 brc 8,193444
           000000000019343c: 9602c000 oi 0(%r12),2
           0000000000193440: 96806000 oi 0(%r6),128
           0000000000193444: a7110004 tmll %r1,4
Call Trace:
([<00000000001933f6>] unmap_vmas+0x846/0xf10)
[<0000000000199680>] exit_mmap+0x210/0x458
[<000000000012a8f8>] mmput+0x54/0xfc
[<000000000012f714>] exit_mm+0x134/0x144
[<000000000013120c>] do_exit+0x240/0x878
[<00000000001318dc>] do_group_exit+0x98/0xc8
[<000000000013e6b0>] get_signal_to_deliver+0x30c/0x358
[<000000000010bee0>] do_signal+0xec/0x860
[<0000000000112e30>] sysc_sigpending+0xe/0x22
[<000002000013198a>] 0x2000013198a
INFO: lockdep is turned off.
Last Breaking-Event-Address:
[<00000000001a68d0>] free_swap_and_cache+0x1a0/0x1a4
<4>---[ end trace bc19f1d51ac9db7c ]---

The faulting instruction is the storage key operation (iske) in
ptep_rcp_copy (called by pte_clear, called by unmap_vmas). iske
reads dirty and reference bit information for a physical page and
requires a valid physical address. Since we are in pte_clear, we
cannot rely on the pte containing a valid address. Fortunately we
dont need these information in pte_clear - after all there is no
mapping. The best fix is to remove the needless call to ptep_rcp_copy
that contains the iske.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

2944a5c9

28 10月, 2008 1 次提交

[S390] pgtables: Fix race in enable_sie vs. page table ops · 250cf776

由 Christian Borntraeger 提交于 10月 28, 2008

The current enable_sie code sets the mm->context.pgstes bit to tell
dup_mm that the new mm should have extended page tables. This bit is also
used by the s390 specific page table primitives to decide about the page
table layout - which means context.pgstes has two meanings. This can cause
any kind of bugs. For example  - e.g. shrink_zone can call
ptep_clear_flush_young while enable_sie is running. ptep_clear_flush_young
will test for context.pgstes. Since enable_sie changed that value of the old
struct mm without changing the page table layout ptep_clear_flush_young will
do the wrong thing.
The solution is to split pgstes into two bits
- one for the allocation
- one for the current state
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

250cf776

11 10月, 2008 1 次提交

[S390] introduce dirty bit for kvm live migration · 15e86b0c

由 Florian Funke 提交于 10月 10, 2008

This patch defines a dirty bit in the PGSTE that can be used to implement
dirty pages logging for KVM's live migration. The bit is set in the
ptep_rcp_copy function, which is called to save dirty and referenced information
from the storage key in the PGSTE. The bit can be tested and reset by KVM using
the kvm_s390_test_and_clear_page_dirty function that is introduced by this patch.
Acked-by: NCarsten Otte <cotte@de.ibm.com>
Signed-off-by: NFlorian Funke <ffunke@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

15e86b0c

02 8月, 2008 1 次提交
- M
  [S390] move include/asm-s390 to arch/s390/include/asm · c6557e7f
  由 Martin Schwidefsky 提交于 8月 01, 2008
```
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
```
  c6557e7f
14 7月, 2008 1 次提交

[S390] Add sched.h include to asm-s390/pgtable.h. · 9789db08

由 Heiko Carstens 提交于 7月 14, 2008

Some macros in pgtable.h access members from struct task_struct.
Currently always works since sched.h seems always to be included
before asm/pgtable.h. Unfortunately that is not anymore true with
Jeremy Fitzhardinge's ptep_modify_prot transaction abstraction patch.
So fix this.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

9789db08

08 7月, 2008 1 次提交

[S390] protect _PAGE_SPECIAL bit against mprotect · 138c9021

由 Nick Piggin 提交于 7月 08, 2008

Stop mprotect's pte_modify from wiping out the s390 pte_special bit, which
caused oops thereafter when vm_normal_page thought X's abnormal was normal.
Debugged-by: NRyan Hope <rmh3093@gmail.com>
Debugged-by: NZan Lynx <zlynx@acm.org>
Acked-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

138c9021

30 4月, 2008 2 次提交

[S390] Convert to SPARSEMEM & SPARSEMEM_VMEMMAP · 17f34580

由 Heiko Carstens 提交于 4月 30, 2008

Convert s390 to SPARSEMEM and SPARSEMEM_VMEMMAP. We do a select
of SPARSEMEM_VMEMMAP since it is configurable. This is because
SPARSEMEM without SPARSEMEM_VMEMMAP gives us a hell of broken
include dependencies that I don't want to fix.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

17f34580

[S390] System z large page support. · 53492b1d

由 Gerald Schaefer 提交于 4月 30, 2008

This adds hugetlbfs support on System z, using both hardware large page
support if available and software large page emulation on older hardware.
Shared (large) page tables are implemented in software emulation mode,
by using page->index of the first tail page from a compound large page
to store page table information.
Signed-off-by: NGerald Schaefer <geraldsc@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

53492b1d

28 4月, 2008 2 次提交

s390: implement pte special bit · a08cb629

由 Nick Piggin 提交于 4月 28, 2008

Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
the user mappings.

This requires the get_xip_page API to be changed to an address based one.
Improve the API layering a little bit too, while we're here.

This is required in order to support XIP filesystems on memory that isn't
backed with struct page (but memory with struct page is still supported too).
Signed-off-by: NNick Piggin <npiggin@suse.de>
Acked-by: NCarsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a08cb629

mm: introduce pte_special pte bit · 7e675137

由 Nick Piggin 提交于 4月 28, 2008

s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
model (which is more dynamic than most).  Instead, they had proposed to
implement it with an additional path through vm_normal_page(), using a bit in
the pte to determine whether or not the page should be refcounted:

vm_normal_page()
{
	...
        if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                if (vma->vm_flags & VM_MIXEDMAP) {
#ifdef s390
			if (!mixedmap_refcount_pte(pte))
				return NULL;
#else
                        if (!pfn_valid(pfn))
                                return NULL;
#endif
                        goto out;
                }
	...
}

This is fine, however if we are allowed to use a bit in the pte to determine
refcountedness, we can use that to _completely_ replace all the vma based
schemes.  So instead of adding more cases to the already complex vma-based
scheme, we can have a clearly seperate and simple pte-based scheme (and get
slightly better code generation in the process):

vm_normal_page()
{
#ifdef s390
	if (!mixedmap_refcount_pte(pte))
		return NULL;
	return pte_page(pte);
#else
	...
#endif
}

And finally, we may rather make this concept usable by any architecture rather
than making it s390 only, so implement a new type of pte state for this.
Unfortunately the old vma based code must stay, because some architectures may
not be able to spare pte bits.  This makes vm_normal_page a little bit more
ugly than we would like, but the 2 cases are clearly seperate.

So introduce a pte_special pte state, and use it in mm/memory.c.  It is
currently a noop for all architectures, so this doesn't actually result in any
compiled code changes to mm/memory.o.

BTW:
I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
The reason is that, regardless of where vm_normal_page is actually
implemented, the *abstraction* is still exactly the same. Also, while it
depends on whether the architecture has pte_special or not, that is the
only two possible cases, and it really isn't an arch specific function --
the role of the arch code should be to provide primitive functions and
accessors with which to build the core code; pte_special does that. We do
not want architectures to know or care about vm_normal_page itself, and
we definitely don't want them being able to invent something new there
out of sight of mm/ code. If we made vm_normal_page an arch function, then
we have to make vm_insert_mixed (next patch) an arch function too. So I
don't think moving it to arch code fundamentally improves any abstractions,
while it does practically make the code more difficult to follow, for both
mm and arch developers, and easier to misuse.

[akpm@linux-foundation.org: build fix]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Acked-by: NCarsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7e675137

27 4月, 2008 3 次提交

KVM: s390: Improve pgste accesses · c71799c1

由 Heiko Carstens 提交于 4月 04, 2008

There is no need to use interlocked updates when the rcp
lock is held. Therefore the simple bitops variants can be
used. This should improve performance.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

c71799c1

s390: KVM preparation: host memory management changes for s390 kvm · 5b7baf05

由 Christian Borntraeger 提交于 3月 25, 2008

This patch changes the s390 memory management defintions to use the pgste field
for dirty and reference bit tracking of host and guest code. Usually on s390,
dirty and referenced are tracked in storage keys, which belong to the physical
page. This changes with virtualization: The guest and host dirty/reference bits
are defined to be the logical OR of the values for the mapping and the physical
page. This patch implements the necessary changes in pgtable.h for s390.

There is a common code change in mm/rmap.c, the call to
page_test_and_clear_young must be moved. This is a no-op for all
architecture but s390. page_referenced checks the referenced bits for
the physiscal page and for all mappings:
o The physical page is checked with page_test_and_clear_young.
o The mappings are checked with ptep_test_and_clear_young and friends.

Without pgstes (the current implementation on Linux s390) the physical page
check is implemented but the mapping callbacks are no-ops because dirty
and referenced are not tracked in the s390 page tables. The pgstes introduces
guest and host dirty and reference bits for s390 in the host mapping. These
mapping must be checked before page_test_and_clear_young resets the reference
bit.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

5b7baf05

s390: KVM preparation: provide hook to enable pgstes in user pagetable · 402b0862

由 Carsten Otte 提交于 3月 25, 2008

The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.

Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

402b0862

10 2月, 2008 4 次提交

[S390] dynamic page tables. · 6252d702

由 Martin Schwidefsky 提交于 2月 09, 2008

Add support for different number of page table levels dependent
on the highest address used for a process. This will cause a 31 bit
process to use a two level page table instead of the four level page
table that is the default after the pud has been introduced. Likewise
a normal 64 bit process will use three levels instead of four. Only
if a process runs out of the 4 tera bytes which can be addressed with
a three level page table the fourth level is dynamically added. Then
the process can use up to 8 peta byte.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

6252d702

M
[S390] Add four level page tables for CONFIG_64BIT=y. · 5a216a20
由 Martin Schwidefsky 提交于 2月 09, 2008
```
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
```
5a216a20

[S390] 1K/2K page table pages. · 146e4b3c

由 Martin Schwidefsky 提交于 2月 09, 2008

This patch implements 1K/2K page table pages for s390.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

146e4b3c

[S390] VMEM_MAX_PHYS overflow on 31 bit. · 522d8dc0

由 Martin Schwidefsky 提交于 2月 09, 2008

With the new space saving spinlock_t and a non-debug configuration
the struct page only has 32 bytes for 31 bit s390. The causes an
overflow in the calculation of VMEM_MAX_PHYS which renders the
kernel unbootable.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

522d8dc0

05 2月, 2008 1 次提交

[S390] Remove BUILD_BUG_ON() in vmem code. · 0189103c

由 Heiko Carstens 提交于 2月 05, 2008

Remove BUILD_BUG_ON() in vmem code since it causes build failures if
the size of struct page increases. Instead calculate at compile time
the address of the highest physical address that can be added to the
1:1 mapping.
This supposed to fix a build failure with the page owner tracking leak
detector patches as reported by akpm.

page-owner-tracking-leak-detector-broken-on-s390.patch can be removed
from -mm again when this is merged.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

0189103c

26 1月, 2008 1 次提交

[S390] Change vmalloc defintions · 5fd9c6e2

由 Christian Borntraeger 提交于 1月 26, 2008

Currently the vmalloc area starts at a dynamic address depending on
the memory size. There was also an 8MB security hole after the
physical memory to catch out-of-bounds accesses.
We can simplify the code by putting the vmalloc area explicitely at
the top of the kernel mapping and setting the vmalloc size to a fixed
value of 128MB/128GB for 31bit/64bit systems. Part of the vmalloc
area will be used for the vmem_map. This leaves an area of 96MB/1GB
for normal vmalloc allocations.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

5fd9c6e2

17 12月, 2007 1 次提交

[S390] pud_present/pmd_present bug. · 0d017923

由 Martin Schwidefsky 提交于 12月 17, 2007

Git commit 3610cce8 (yeah my own :-/)
introduced a bug in regard to pud/pmd table entries.
If the address of the page table refered to by a pud/pmd value happens
to have zeroes in the lower 32 bits, pud_present and pmd_present return
false. The obvious effect is that this triggers the BUG_ON in exit_mmap
because some ptes will not get released on process end.  Worse is that
the next fault for memory covered by that pud/pmd will allocate another
pmd/pte table and populate the pud/pmd entry. The old page table
entries hanging below this entry are lost!

The fix is simple, properly check against 0. The check is added for
pud_none/pmd_none as well even if these two functions work because
the invalid bit is in the lower 32 bits.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

0d017923

22 10月, 2007 3 次提交

[S390] 4level-fixup cleanup · 190a1d72

由 Martin Schwidefsky 提交于 10月 22, 2007

Get independent from asm-generic/4level-fixup.h
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

190a1d72

[S390] Cleanup page table definitions. · 3610cce8

由 Martin Schwidefsky 提交于 10月 22, 2007

- De-confuse the defines for the address-space-control-elements
  and the segment/region table entries.
- Create out of line functions for page table allocation / freeing.
- Simplify get_shadow_xxx functions.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

3610cce8

[S390] tlb flush fix. · ba8a9229

由 Martin Schwidefsky 提交于 10月 22, 2007

The current tlb flushing code for page table entries violates the
s390 architecture in a small detail. The relevant section from the
principles of operation (SA22-7832-02 page 3-47):

   "A valid table entry must not be changed while it is attached
   to any CPU and may be used for translation by that CPU except to
   (1) invalidate the entry by using INVALIDATE PAGE TABLE ENTRY or
   INVALIDATE DAT TABLE ENTRY, (2) alter bits 56-63 of a page-table
   entry, or (3) make a change by means of a COMPARE AND SWAP AND
   PURGE instruction that purges the TLB."

That means if one thread of a multithreaded applciation uses a vma
while another thread does an unmap on it, the page table entries of
that vma needs to get removed with IPTE, IDTE or CSP. In some strange
and rare situations a cpu could check-stop (die) because a entry has
been pushed out of the TLB that is still needed to complete a
(milli-coded) instruction. I've never seen it happen with the current
code on any of the supported machines, so right now this is a
theoretical problem. But I want to fix it nevertheless, to avoid
headaches in the futures.

To get this implemented correctly without changing common code the
primitives ptep_get_and_clear, ptep_get_and_clear_full and
ptep_set_wrprotect need to use the IPTE instruction to invalidate the
pte before the new pte value gets stored. If IPTE is always used for
the three primitives three important operations will have a performace
hit: fork, mprotect and exit_mmap. Time for some workarounds:

* 1: ptep_get_and_clear_full is used in unmap_vmas to remove page
tables entries in a batched tlb gather operation. If the mmu_gather
context passed to unmap_vmas has been started with full_mm_flush==1
or if only one cpu is online or if the only user of a mm_struct is the
current process then the fullmm indication in the mmu_gather context is
set to one. All TLBs for mm_struct are flushed by the tlb_gather_mmu
call. No new TLBs can be created while the unmap is in progress. In
this case ptep_get_and_clear_full clears the ptes with a simple store.

* 2: ptep_get_and_clear is used in change_protection to clear the
ptes from the page tables before they are reentered with the new
access flags. At the end of the update flush_tlb_range clears the
remaining TLBs. In general the ptep_get_and_clear has to issue IPTE
for each pte and flush_tlb_range is a nop. But if there is only one
user of the mm_struct then ptep_get_and_clear uses simple stores
to do the update and flush_tlb_range will flush the TLBs.

* 3: Similar to 2, ptep_set_wrprotect is used in copy_page_range
for a fork to make all ptes of a cow mapping read-only. At the end of
of copy_page_range dup_mmap will flush the TLBs with a call to
flush_tlb_mm.  Check for mm->mm_users and if there is only one user
avoid using IPTE in ptep_set_wrprotect and let flush_tlb_mm clear the
TLBs.

Overall for single threaded programs the tlb flush code now performs
better, for multi threaded programs it is slightly worse. In particular
exit_mmap() now does a single IDTE for the mm and then just frees every
page cache reference and every page table page directly without a delay
over the mmu_gather structure.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

ba8a9229

12 10月, 2007 1 次提交

[S390] Make vmalloc area start at address > 4GB. · e39394b8

由 Heiko Carstens 提交于 10月 12, 2007

Prevent that modules get loaded at addresses below 4GB to
prevent exchanging system call table entries.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

e39394b8

18 7月, 2007 1 次提交

mm: remove ptep_test_and_clear_dirty and ptep_clear_flush_dirty · e21ea246

由 Martin Schwidefsky 提交于 7月 17, 2007

Nobody is using ptep_test_and_clear_dirty and ptep_clear_flush_dirty.  Remove
the functions from all architectures.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e21ea246

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功