提交 · e476d3129100ba18daea2224f38fdd7195118d4b · openeuler / Kernel

10 8月, 2016 1 次提交

s390/pageattr: handle numpages parameter correctly · 4d81aaa5

由 Heiko Carstens 提交于 8月 09, 2016

Both set_memory_ro() and set_memory_rw() will modify the page
attributes of at least one page, even if the numpages parameter is
zero.

The author expected that calling these functions with numpages == zero
would never happen. However with the new 444d13ff ("modules: add
ro_after_init support") feature this happens frequently.

Therefore do the right thing and make these two functions return
gracefully if nothing should be done.

Fixes crashes on module load like this one:

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 000003ff80008000 TEID: 000003ff80008407
Fault in home space mode while using kernel ASCE.
AS:0000000000d18007 R3:00000001e6aa4007 S:00000001e6a10800 P:00000001e34ee21d
Oops: 0004 ilc:3 [#1] SMP
Modules linked in: x_tables
CPU: 10 PID: 1 Comm: systemd Not tainted 4.7.0-11895-g3fa9045 #4
Hardware name: IBM              2964 N96              703              (LPAR)
task: 00000001e9118000 task.stack: 00000001e9120000
Krnl PSW : 0704e00180000000 00000000005677f8 (rb_erase+0xf0/0x4d0)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 000003ff80008b20 000003ff80008b20 000003ff80008b70 0000000000b9d608
           000003ff80008b20 0000000000000000 00000001e9123e88 000003ff80008950
           00000001e485ab40 000003ff00000000 000003ff80008b00 00000001e4858480
           0000000100000000 000003ff80008b68 00000000001d5998 00000001e9123c28
Krnl Code: 00000000005677e8: ec1801c3007c        cgij    %r1,0,8,567b6e
           00000000005677ee: e32010100020        cg      %r2,16(%r1)
          #00000000005677f4: a78401c2            brc     8,567b78
          >00000000005677f8: e35010080024        stg     %r5,8(%r1)
           00000000005677fe: ec5801af007c        cgij    %r5,0,8,567b5c
           0000000000567804: e30050000024        stg     %r0,0(%r5)
           000000000056780a: ebacf0680004        lmg     %r10,%r12,104(%r15)
           0000000000567810: 07fe                bcr     15,%r14
Call Trace:
([<000003ff80008900>] __this_module+0x0/0xffffffffffffd700 [x_tables])
([<0000000000264fd4>] do_init_module+0x12c/0x220)
([<00000000001da14a>] load_module+0x24e2/0x2b10)
([<00000000001da976>] SyS_finit_module+0xbe/0xd8)
([<0000000000803b26>] system_call+0xd6/0x264)
Last Breaking-Event-Address:
 [<000000000056771a>] rb_erase+0x12/0x4d0
 Kernel panic - not syncing: Fatal exception: panic_on_oops
Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reported-and-tested-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
Fixes: e8a97e42 ("s390/pageattr: allow kernel page table splitting")
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

4d81aaa5

31 7月, 2016 1 次提交

s390/mm: clean up pte/pmd encoding · bc29b7ac

由 Gerald Schaefer 提交于 7月 18, 2016

The hugetlbfs pte<->pmd conversion functions currently assume that the pmd
bit layout is consistent with the pte layout, which is not really true.

The SW read and write bits are encoded as the sequence "wr" in a pte, but
in a pmd it is "rw". The hugetlbfs conversion assumes that the sequence
is identical in both cases, which results in swapped read and write bits
in the pmd. In practice this is not a problem, because those pmd bits are
only relevant for THP pmds and not for hugetlbfs pmds. The hugetlbfs code
works on (fake) ptes, and the converted pte bits are correct.

There is another variation in pte/pmd encoding which affects dirty
prot-none ptes/pmds. In this case, a pmd has both its HW read-only and
invalid bit set, while it is only the invalid bit for a pte. This also has
no effect in practice, but it should better be consistent.

This patch fixes both inconsistencies by changing the SW read/write bit
layout for pmds as well as the PAGE_NONE encoding for ptes. It also makes
the hugetlbfs conversion functions more robust by introducing a
move_set_bit() macro that uses the pte/pmd bit #defines instead of
constant shifts.
Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

bc29b7ac

27 7月, 2016 1 次提交

mm: do not pass mm_struct into handle_mm_fault · dcddffd4

由 Kirill A. Shutemov 提交于 7月 26, 2016

We always have vma->vm_mm around.

Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dcddffd4

13 7月, 2016 1 次提交

s390/mm: fix gmap tlb flush issues · f0454029

由 David Hildenbrand 提交于 7月 07, 2016

__tlb_flush_asce() should never be used if multiple asce belong to a mm.

As this function changes mm logic determining if local or global tlb
flushes will be neded, we might end up flushing only the gmap asce on all
CPUs and a follow up mm asce flushes will only flush on the local CPU,
although that asce ran on multiple CPUs.

The missing tlb flushes will provoke strange faults in user space and even
low address protections in user space, crashing the kernel.

Fixes: 1b948d6c ("s390/mm,tlb: optimize TLB flushing for zEC12")
Cc: stable@vger.kernel.org # 3.15+
Reported-by: NSascha Silbe <silbe@linux.vnet.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

f0454029

06 7月, 2016 1 次提交

s390/mm: add support for 2GB hugepages · d08de8e2

由 Gerald Schaefer 提交于 7月 04, 2016

This adds support for 2GB hugetlbfs pages on s390.
Reviewed-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

d08de8e2

28 6月, 2016 1 次提交

s390/mm: use basic block for essa inline assembly · 931641c6

由 Heiko Carstens 提交于 6月 20, 2016

Use only simple inline assemblies which consist of a single basic
block if the register asm construct is being used.

Otherwise gcc would generate broken code if the compiler option
--sanitize-coverage=trace-pc would be used.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

931641c6

25 6月, 2016 1 次提交

s390: get rid of superfluous __GFP_REPEAT · 10d58bf2

由 Michal Hocko 提交于 6月 24, 2016

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.

page_table_alloc then uses the flag for a single page allocation.  This
means that this flag has never been actually useful here because it has
always been used only for PAGE_ALLOC_COSTLY requests.

Link: http://lkml.kernel.org/r/1464599699-30131-14-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

10d58bf2

20 6月, 2016 18 次提交

KVM: s390: backup the currently enabled gmap when scheduled out · 37d9df98

由 David Hildenbrand 提交于 3月 11, 2015

Nested virtualization will have to enable own gmaps. Current code
would enable the wrong gmap whenever scheduled out and back in,
therefore resulting in the wrong gmap being enabled.

This patch reenables the last enabled gmap, therefore avoiding having to
touch vcpu->arch.gmap when enabling a different gmap.
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

37d9df98

s390/mm: don't fault everything in read-write in gmap_pte_op_fixup() · 01f71917

由 David Hildenbrand 提交于 6月 13, 2016

Let's not fault in everything in read-write but limit it to read-only
where possible.

When restricting access rights, we already have the required protection
level in our hands. When reading from guest 2 storage (gmap_read_table),
it is obviously PROT_READ. When shadowing a pte, the required protection
level is given via the guest 2 provided pte.

Based on an initial patch by Martin Schwidefsky.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

01f71917

s390/mm: allow to check if a gmap shadow is valid · 5b6c963b

由 David Hildenbrand 提交于 5月 27, 2016

It will be very helpful to have a mechanism to check without any locks
if a given gmap shadow is still valid and matches the given properties.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

5b6c963b

s390/mm: remember the int code for the last gmap fault · 4a494439

由 David Hildenbrand 提交于 3月 08, 2016

For nested virtualization, we want to know if we are handling a protection
exception, because these can directly be forwarded to the guest without
additional checks.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

4a494439

s390/mm: limit number of real-space gmap shadows · 717c0555

由 David Hildenbrand 提交于 5月 02, 2016

We have no known user of real-space designation and only support it to
be architecture compliant.

Gmap shadows with real-space designation are never unshadowed
automatically, as there is nothing to protect for the top level table.

So let's simply limit the number of such shadows to one by removing
existing ones on creation of another one.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

717c0555

s390/mm: support real-space for gmap shadows · 3218f709

由 David Hildenbrand 提交于 4月 18, 2016

We can easily support real-space designation just like EDAT1 and EDAT2.
So guest2 can provide for guest3 an asce with the real-space control being
set.

We simply have to allocate the biggest page table possible and fake all
levels.

There is no protection to consider. If we exceed guest memory, vsie code
will inject an addressing exception (via program intercept). In the future,
we could limit the fake table level to the gmap page table.

As the top level page table can never go away, such gmap shadows will never
get unshadowed, we'll have to come up with another way to limit the number
of kept gmap shadows.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

3218f709

s390/mm: support EDAT2 for gmap shadows · 18b89809

由 David Hildenbrand 提交于 4月 18, 2016

If the guest is enabled for EDAT2, we can easily create shadows for
guest2 -> guest3 provided tables that make use of EDAT2.

If guest2 references a 2GB page, this memory looks consecutive for guest2,
but it does not have to be so for us. Therefore we have to create fake
segment and page tables.

This works just like EDAT1 support, so page tables are removed when the
parent table (r3t table entry) is changed.

We don't hve to care about:
- ACCF-Validity Control in RTTE
- Access-Control Bits in RTTE
- Fetch-Protection Bit in RTTE
- Common-Region Bit in RTTE

Just like for EDAT1, all bits might be dropped and there is no guaranteed
that they are active.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

18b89809

s390/mm: support EDAT1 for gmap shadows · fd8d4e3a

由 David Hildenbrand 提交于 4月 18, 2016

If the guest is enabled for EDAT1, we can easily create shadows for
guest2 -> guest3 provided tables that make use of EDAT1.

If guest2 references a 1MB page, this memory looks consecutive for guest2,
but it might not be so for us. Therefore we have to create fake page tables.

We can easily add that to our existing infrastructure. The invalidation
mechanism will make sure that fake page tables are removed when the parent
table (sgt table entry) is changed.

As EDAT1 also introduced protection on all page table levels, we have to
also shadow these correctly.

We don't have to care about:
- ACCF-Validity Control in STE
- Access-Control Bits in STE
- Fetch-Protection Bit in STE
- Common-Segment Bit in STE

As all bits might be dropped and there is no guaranteed that they are
active ("unpredictable whether the CPU uses these bits", "may be used").
Without using EDAT1 in the shadow ourselfes (STE-format control == 0),
simply shadowing these bits would not be enough. They would be ignored.

Please note that we are using the "fake" flag to make this look consistent
with further changes (EDAT2, real-space designation support) and don't let
the shadow functions handle fc=1 stes.

In the future, with huge pages in the host, gmap_shadow_pgt() could simply
try to map a huge host page if "fake" is set to one and indicate via return
value that no lower fake tables / shadow ptes are required.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

fd8d4e3a

s390/mm: prepare for EDAT1/EDAT2 support in gmap shadow · 5b062bd4

由 David Hildenbrand 提交于 3月 08, 2016

In preparation for EDAT1/EDAT2 support for gmap shadows, we have to store
the requested edat level in the gmap shadow.

The edat level used during shadow translation is a property of the gmap
shadow. Depending on that level, the gmap shadow will look differently for
the same guest tables. We have to store it internally in order to support
it later.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

5b062bd4

s390/mm: fix races on gmap_shadow creation · 0f7f8489

由 David Hildenbrand 提交于 3月 08, 2016

Before any thread is allowed to use a gmap_shadow, it has to be fully
initialized. However, for invalidation to work properly, we have to
register the new gmap_shadow before we protect the parent gmap table.

Because locking is tricky, and we have to avoid duplicate gmaps, let's
introduce an initialized field, that signalizes other threads if that
gmap_shadow can already be used or if they have to retry.

Let's properly return errors using ERR_PTR() instead of simply returning
NULL, so a caller can properly react on the error.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

0f7f8489

s390/mm: avoid races on region/segment/page table shadowing · 998f637c

由 David Hildenbrand 提交于 3月 08, 2016

We have to unlock sg->guest_table_lock in order to call
gmap_protect_rmap(). If we sleep just before that call, another VCPU
might pick up that shadowed page table (while it is not protected yet)
and use it.

In order to avoid these races, we have to introduce a third state -
"origin set but still invalid" for an entry. This way, we can avoid
another thread already using the entry before the table is fully protected.
As soon as everything is set up, we can clear the invalid bit - if we
had no race with the unshadowing code.
Suggested-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

998f637c

s390/mm: shadow pages with real guest requested protection · a9d23e71

由 David Hildenbrand 提交于 3月 08, 2016

We really want to avoid manually handling protection for nested
virtualization. By shadowing pages with the protection the guest asked us
for, the SIE can handle most protection-related actions for us (e.g.
special handling for MVPG) and we can directly forward protection
exceptions to the guest.

PTEs will now always be shadowed with the correct _PAGE_PROTECT flag.
Unshadowing will take care of any guest changes to the parent PTE and
any host changes to the host PTE. If the host PTE doesn't have the
fitting access rights or is not available, we have to fix it up.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

a9d23e71

s390/mm: flush tlb of shadows in all situations · eea3678d

由 David Hildenbrand 提交于 4月 15, 2016

For now, the tlb of shadow gmap is only flushed when the parent is removed,
not when it is removed upfront. Therefore other shadow gmaps can reuse the
tables without the tlb getting flushed.

Fix this by simply flushing the tlb
1. Before the shadow tables are removed (analogouos to other unshadow functions)
2. When the gmap is freed and therefore the top level pages are freed.
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

eea3678d

s390/mm: add shadow gmap support · 4be130a0

由 Martin Schwidefsky 提交于 3月 08, 2016

For a nested KVM guest the outer KVM host needs to create shadow
page tables for the nested guest. This patch adds the basic support
to the guest address space (gmap) code.

For each guest address space the inner KVM host creates, the first
outer KVM host needs to create shadow page tables. The address space
is identified by the ASCE loaded into the control register 1 at the
time the inner SIE instruction for the second nested KVM guest is
executed. The outer KVM host creates the shadow tables starting with
the table identified by the ASCE on a on-demand basis. The outer KVM
host will get repeated faults for all the shadow tables needed to
run the second KVM guest.

While a shadow page table for the second KVM guest is active the access
to the origin region, segment and page tables needs to be restricted
for the first KVM guest. For region and segment and page tables the first
KVM guest may read the memory, but write attempt has to lead to an
unshadow. This is done using the page invalid and read-only bits in the
page table of the first KVM guest. If the first guest re-accesses one of
the origin pages of a shadow, it gets a fault and the affected parts of
the shadow page table hierarchy needs to be removed again.

PGSTE tables don't have to be shadowed, as all interpretation assist can't
deal with the invalid bits in the shadow pte being set differently than
the original ones provided by the first KVM guest.

Many bug fixes and improvements by David Hildenbrand.
Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

4be130a0

s390/mm: add reference counter to gmap structure · 6ea427bb

由 Martin Schwidefsky 提交于 3月 08, 2016

Let's use a reference counter mechanism to control the lifetime of
gmap structures. This will be needed for further changes related to
gmap shadows.
Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

6ea427bb

s390/mm: extended gmap pte notifier · b2d73b2a

由 Martin Schwidefsky 提交于 3月 08, 2016

The current gmap pte notifier forces a pte into to a read-write state.
If the pte is invalidated the gmap notifier is called to inform KVM
that the mapping will go away.

Extend this approach to allow read-write, read-only and no-access
as possible target states and call the pte notifier for any change
to the pte.

This mechanism is used to temporarily set specific access rights for
a pte without doing the heavy work of a true mprotect call.
Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

b2d73b2a

s390/mm: use RCU for gmap notifier list and the per-mm gmap list · 8ecb1a59

由 Martin Schwidefsky 提交于 3月 08, 2016

The gmap notifier list and the gmap list in the mm_struct change rarely.
Use RCU to optimize the reader of these lists.
Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

8ecb1a59

s390/kvm: page table invalidation notifier · 414d3b07

由 Martin Schwidefsky 提交于 3月 08, 2016

Pass an address range to the page table invalidation notifier
for KVM. This allows to notify changes that affect a larger
virtual memory area, e.g. for 1MB pages.
Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

414d3b07

14 6月, 2016 1 次提交

s390/mm: fix compile for PAGE_DEFAULT_KEY != 0 · de3fa841

由 Heiko Carstens 提交于 6月 14, 2016

The usual problem for code that is ifdef'ed out is that it doesn't
compile after a while. That's also the case for the storage key
initialisation code, if it would be used (set PAGE_DEFAULT_KEY to
something not zero):

./arch/s390/include/asm/page.h: In function 'storage_key_init_range':
./arch/s390/include/asm/page.h:36:2: error: implicit declaration of function '__storage_key_init_range'

Since the code itself has been useful for debugging purposes several
times, remove the ifdefs and make sure the code gets compiler
coverage. The cost for this is eight bytes.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

de3fa841

13 6月, 2016 14 次提交

s390: avoid extable collisions · 6c22c986

由 Heiko Carstens 提交于 6月 10, 2016

We have some inline assemblies where the extable entry points to a
label at the end of an inline assembly which is not followed by an
instruction.

On the other hand we have also inline assemblies where the extable
entry points to the first instruction of an inline assembly.

If a first type inline asm (extable point to empty label at the end)
would be directly followed by a second type inline asm (extable points
to first instruction) then we would have two different extable entries
that point to the same instruction but would have a different target
address.

This can lead to quite random behaviour, depending on sorting order.

I verified that we currently do not have such collisions within the
kernel. However to avoid such subtle bugs add a couple of nop
instructions to those inline assemblies which contain an extable that
points to an empty label.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

6c22c986

s390: add proper __ro_after_init support · d07a980c

由 Heiko Carstens 提交于 6月 07, 2016

On s390 __ro_after_init is currently mapped to __read_mostly which
means that data marked as __ro_after_init will not be protected.

Reason for this is that the common code __ro_after_init implementation
is x86 centric: the ro_after_init data section was added to rodata,
since x86 enables write protection to kernel text and rodata very
late. On s390 we have write protection for these sections enabled with
the initial page tables. So adding the ro_after_init data section to
rodata does not work on s390.

In order to make __ro_after_init work properly on s390 move the
ro_after_init data, right behind rodata. Unlike the rodata section it
will be marked read-only later after all init calls happened.

This s390 specific implementation adds new __start_ro_after_init and
__end_ro_after_init labels. Everything in between will be marked
read-only after the init calls happened. In addition to the
__ro_after_init data move also the exception table there, since from a
practical point of view it fits the __ro_after_init requirements.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

d07a980c

s390/mm: simplify the TLB flushing code · 64f31d58

由 Martin Schwidefsky 提交于 5月 25, 2016

ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to
decide between a lazy TLB flush vs an immediate TLB flush. The field
contains two 16-bit counters, the number of CPUs that have the mm
attached and can create TLB entries for it and the number of CPUs in
the middle of a page table update.

The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions
use the attach counter and a mask check with mm_cpumask(mm) to decide
between a local flush local of the current CPU and a global flush.

For all these functions the decision between lazy vs immediate and
local vs global TLB flush can be based on CPU masks. There are two
masks: the mm->context.cpu_attach_mask with the CPUs that are actively
using the mm, and the mm_cpumask(mm) with the CPUs that have used the
mm since the last full flush. The decision between lazy vs immediate
flush is based on the mm->context.cpu_attach_mask, to decide between
local vs global flush the mm_cpumask(mm) is used.

With this patch all checks will use the CPU masks, the old counter
mm->context.attach_count with its two 16-bit values is turned into a
single counter mm->context.flush_count that keeps track of the number
of CPUs with incomplete page table updates. The sole user of this
counter is finish_arch_post_lock_switch() which waits for the end of
all page table updates.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

64f31d58

s390/mm: fix vunmap vs finish_arch_post_lock_switch · a9809407

由 Martin Schwidefsky 提交于 6月 06, 2016

The vunmap_pte_range() function calls ptep_get_and_clear() without any
locking. ptep_get_and_clear() uses ptep_xchg_lazy()/ptep_flush_direct()
for the page table update. ptep_flush_direct requires that preemption
is disabled, but without any locking this is not the case. If the kernel
preempts the task while the attach_counter is increased an endless loop
in finish_arch_post_lock_switch() will occur the next time the task is
scheduled.

Add explicit preempt_disable()/preempt_enable() calls to the relevant
functions in arch/s390/mm/pgtable.c.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

a9809407

s390/mm: align swapper_pg_dir to 16k · 0ccb32c9

由 Heiko Carstens 提交于 5月 28, 2016

The segment/region table that is part of the kernel image must be
properly aligned to 16k in order to make the crdte inline assembly
work.
Otherwise it will calculate a wrong segment/region table start address
and access incorrect memory locations if the swapper_pg_dir is not
aligned to 16k.

Therefore define BSS_FIRST_SECTIONS in order to put the swapper_pg_dir
at the beginning of the bss section and also align the bss section to
16k just like other architectures did.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

0ccb32c9

s390/pgtable: add mapping statistics · 37cd944c

由 Heiko Carstens 提交于 5月 20, 2016

Add statistics that show how memory is mapped within the kernel
identity mapping. This is more or less the same like git
commit ce0c0e50 ("x86, generic: CPA add statistics about state
of direct mapping v4") for x86.

I also intentionally copied the lower case "k" within DirectMap4k vs
the upper case "M" and "G" within the two other lines. Let's have
consistent inconsistencies across architectures.

The output of /proc/meminfo now contains these additional lines:

DirectMap4k:        2048 kB
DirectMap1M:     3991552 kB
DirectMap2G:     4194304 kB

The implementation on s390 is lockless unlike the x86 version, since I
assume changes to the kernel mapping are a very rare event. Therefore
it really doesn't matter if these statistics could potentially be
inconsistent if read while kernel pages tables are being changed.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

37cd944c

s390/vmem: simplify vmem code for read-only mappings · bab247ff

由 Heiko Carstens 提交于 5月 10, 2016

For the kernel identity mapping map everything read-writeable and
subsequently call set_memory_ro() to make the ro section read-only.
This simplifies the code a lot.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

bab247ff

s390/pageattr: allow kernel page table splitting · e8a97e42

由 Heiko Carstens 提交于 5月 17, 2016

set_memory_ro() and set_memory_rw() currently only work on 4k
mappings, which is good enough for module code aka the vmalloc area.

However we stumbled already twice into the need to make this also work
on larger mappings:
- the ro after init patch set
- the crash kernel resize code

Therefore this patch implements automatic kernel page table splitting
if e.g. set_memory_ro() would be called on parts of a 2G mapping.
This works quite the same as the x86 code, but is much simpler.

In order to make this work and to be architecturally compliant we now
always use the csp, cspg or crdte instructions to replace valid page
table entries. This means that set_memory_ro() and set_memory_rw()
will be much more expensive than before. In order to avoid huge
latencies the code contains a couple of cond_resched() calls.

The current code only splits page tables, but does not merge them if
it would be possible.  The reason for this is that currently there is
no real life scenarion where this would really happen. All current use
cases that I know of only change access rights once during the life
time. If that should change we can still implement kernel page table
merging at a later time.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

e8a97e42

s390/mm: always use PAGE_KERNEL when mapping pages · 3e76ee99

由 Heiko Carstens 提交于 5月 19, 2016

Always use PAGE_KERNEL when re-enabling pages within the kernel
mapping due to debug pagealloc. Without using this pgprot value
pte_mkwrite() and pte_wrprotect() won't work on such mappings after an
unmap -> map cycle anymore.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

3e76ee99

s390/vmem: make use of pte_clear() · 5aa29975

由 Heiko Carstens 提交于 5月 17, 2016

Use pte_clear() instead of open-coding it.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

5aa29975

s390/pgtable: get rid of _REGION3_ENTRY_RO · c126aa83

由 Heiko Carstens 提交于 5月 20, 2016

_REGION3_ENTRY_RO is a duplicate of _REGION_ENTRY_PROTECT.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

c126aa83

s390/vmem: introduce and use SEGMENT_KERNEL and REGION3_KERNEL · 2dffdcba

由 Heiko Carstens 提交于 5月 11, 2016

Instead of open-coded SEGMENT_KERNEL and REGION3_KERNEL assignments use
defines. Also to make e.g. pmd_wrprotect() work on the kernel mapping
a couple more flags must be set. Therefore add the missing flags also.

In order to make everything symmetrical this patch also adds software
dirty, young, read and write bits for region 3 table entries.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

2dffdcba

s390/vmem: align segment and region tables to 16k · 2e9996fc

由 Heiko Carstens 提交于 5月 13, 2016

Usually segment and region tables are 16k aligned due to the way the
buddy allocator works.  This is not true for the vmem code which only
asks for a 4k alignment. In order to be consistent enforce a 16k
alignment here as well.

This alignment will be assumed and therefore is required by the
pageattr code.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

2e9996fc

KVM: s390/mm: Fix CMMA reset during reboot · 1c343f7b

由 Christian Borntraeger 提交于 6月 13, 2016

commit 1e133ab2 ("s390/mm: split arch/s390/mm/pgtable.c") factored
out the page table handling code from __gmap_zap and  __s390_reset_cmma
into ptep_zap_unused and added a simple flag that tells which one of the
function (reset or not) is to be made. This also changed the behaviour,
as it also zaps unused page table entries on reset.
Turns out that this is wrong as s390_reset_cmma uses the page walker,
which DOES NOT take the ptl lock.

The most simple fix is to not do the zapping part on reset (which uses
the walker)
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Fixes: 1e133ab2 ("s390/mm: split arch/s390/mm/pgtable.c")
Cc: stable@vger.kernel.org # 4.6+
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

1c343f7b

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功