- 16 12月, 2015 1 次提交
-
-
由 Dominik Dingel 提交于
While the userspace interface requests the maximum size the gmap code expects to get a maximum address. This error resulted in bigger page tables than necessary for some guest sizes, e.g. a 2GB guest used 3 levels instead of 2. At the same time we introduce KVM_S390_NO_MEM_LIMIT, which allows in a bright future that a guest spans the complete 64 bit address space. We also switch to TASK_MAX_SIZE for the initial memory size, this is a cosmetic change as the previous size also resulted in a 4 level pagetable creation. Reported-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
- 19 8月, 2015 1 次提交
-
-
由 Martin Schwidefsky 提交于
With the removal of the dynamic reallocation of page tables for KVM (see git commit 0b46e0a3) the page table allocation / freeing code can be simplified. The page table free code can now use the alloc_pgste bit in the mm context to decide if a page table is 2K or 4K, there is no mix of different sized page tables anymore. This eliminates the need to use "page->_mapcount == 0" to check for 4K page table. Use the lower two bits in page->_mapcount to indicate which 2K fragments of the 4K page are in use. As 31-bit support is gone, remove the two defines ALLOC_ORDER and FRAG_MASK and use the constants directly where appropriate. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 18 7月, 2015 2 次提交
-
-
由 Dominik Dingel 提交于
Heiko noticed that the current check for hugepage support on s390 is a little bit too harsh as systems which do not support will crash. The reason is that pageblock_order can now get negative when we set HPAGE_SHIFT to 0. To avoid all this and to avoid opening another can of worms with enabling HUGETLB_PAGE_SIZE_VARIABLE I think it would be best to simply allow architectures to define their own hugepages_supported(). Revert bea41197 ("s390/mm: make hugepages_supported a boot time decision") in preparation. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dominik Dingel 提交于
Heiko noticed that the current check for hugepage support on s390 is a little bit too harsh as systems which do not support will crash. The reason is that pageblock_order can now get negative when we set HPAGE_SHIFT to 0. To avoid all this and to avoid opening another can of worms with enabling HUGETLB_PAGE_SIZE_VARIABLE I think it would be best to simply allow architectures to define their own hugepages_supported(). This patch (of 4): revert commit cf54e2fc ("s390/mm: change HPAGE_SHIFT type to int") in preparation. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 6月, 2015 2 次提交
-
-
由 Dominik Dingel 提交于
With making HPAGE_SHIFT an unsigned integer we also accidentally changed pageblock_order. In order to avoid compiler warnings we make HPAGE_SHFIT an int again. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Suggested-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dominik Dingel 提交于
There is a potential bug with KVM and hugetlbfs if the hardware does not support hugepages (EDAT1). We fix this by making EDAT1 a hard requirement for hugepages and therefore removing and simplifying code. As s390, with the sw-emulated hugepages, was the only user of arch_prepare/release_hugepage I also removed theses calls from common and other architecture code. This patch (of 5): By dropping support for hugepages on machines which do not have the hardware feature EDAT1, we fix a potential s390 KVM bug. The bug would happen if a guest is backed by hugetlbfs (not supported currently), but does not get pagetables with PGSTE. This would lead to random memory overwrites. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 4月, 2015 1 次提交
-
-
由 Martin Schwidefsky 提交于
Replacing a 2K page table with a 4K page table while a VMA is active for the affected memory region is fundamentally broken. Rip out the page table reallocation code and replace it with a simple system control 'vm.allocate_pgste'. If the system control is set the page tables for all processes are allocated as full 4K pages, even for processes that do not need it. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 25 3月, 2015 1 次提交
-
-
由 Heiko Carstens 提交于
Remove the 31 bit support in order to reduce maintenance cost and effectively remove dead code. Since a couple of years there is no distribution left that comes with a 31 bit kernel. The 31 bit kernel also has been broken since more than a year before anybody noticed. In addition I added a removal warning to the kernel shown at ipl for 5 minutes: a960062e ("s390: add 31 bit warning message") which let everybody know about the plan to remove 31 bit code. We didn't get any response. Given that the last 31 bit only machine was introduced in 1999 let's remove the code. Anybody with 31 bit user space code can still use the compat mode. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 08 1月, 2015 2 次提交
-
-
由 Heiko Carstens 提交于
Get rid of warnings like this one: warning: constant 0xffe0000000000000 is so big it is unsigned long Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
-
由 Martin Schwidefsky 提交于
pmd_to_page() is only available if USE_SPLIT_PMD_PTLOCKS is defined. The use of pmd_to_page in the gmap code can cause compile errors if NR_CPUS is smaller than SPLIT_PTLOCK_CPUS. Do not use pmd_to_page outside of USE_SPLIT_PMD_PTLOCKS sections. Reported-by: NMike Frysinger <vapier@gentoo.org> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 28 11月, 2014 1 次提交
-
-
由 Jason J. Herne 提交于
Define get_guest_storage_key which can be used to get the value of a guest storage key. This compliments the functionality provided by the helper function set_guest_storage_key. Both functions are needed for live migration of s390 guests that use storage keys. Signed-off-by: NJason J. Herne <jjherne@linux.vnet.ibm.com> Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
- 03 11月, 2014 1 次提交
-
-
由 Martin Schwidefsky 提交于
The page table lock is acquired with a call to get_locked_pte, replace the plain spin_unlock with the correct unlock function pte_unmap_unlock. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 28 10月, 2014 1 次提交
-
-
由 Jason J. Herne 提交于
In set_guest_storage_key, we really want to reference the mm struct given as a parameter to the function. So replace the current->mm reference with the mm struct passed in by the caller. Signed-off-by: NJason J. Herne <jjherne@us.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
- 27 10月, 2014 4 次提交
-
-
由 Dominik Dingel 提交于
After fixup_user_fault does not fail we have a writeable pte. That pte might transform but it should not vanish. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Dominik Dingel 提交于
When storage keys are enabled unmerge already merged pages and prevent new pages from being merged. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Dominik Dingel 提交于
As soon as storage keys are enabled we need to stop working on zero page mappings to prevent inconsistencies between storage keys and pgste. Otherwise following data corruption could happen: 1) guest enables storage key 2) guest sets storage key for not mapped page X -> change goes to PGSTE 3) guest reads from page X -> as X was not dirty before, the page will be zero page backed, storage key from PGSTE for X will go to storage key for zero page 4) guest sets storage key for not mapped page Y (same logic as above 5) guest reads from page Y -> as Y was not dirty before, the page will be zero page backed, storage key from PGSTE for Y will got to storage key for zero page overwriting storage key for X While holding the mmap sem, we are safe against changes on entries we already fixed, as every fault would need to take the mmap_sem (read). Other vCPUs executing storage key instructions will get a one time interception and be serialized also with mmap_sem. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Dominik Dingel 提交于
Replace the s390 specific page table walker for the pgste updates with a call to the common code walk_page_range function. There are now two pte modification functions, one for the reset of the CMMA state and another one for the initialization of the storage keys. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 15 10月, 2014 1 次提交
-
-
由 Dominik Dingel 提交于
pte_unmap works on page table entry pointers, derefencing should be avoided. As on s390 pte_unmap is a NOP, this is more a cleanup if we want to supply later such function. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Reviewed-by: NThomas Huth <thuth@linux.vnet.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 29 8月, 2014 1 次提交
-
-
由 Christian Borntraeger 提交于
commit ab3f285f ("KVM: s390/mm: try a cow on read only pages for key ops")' misaligned a code block. Let's fixup the indentation. Reported-by: NBen Hutchings <ben@decadent.org.uk> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 26 8月, 2014 2 次提交
-
-
由 Martin Schwidefsky 提交于
Add an addressing limit to the gmap address spaces and only allocate the page table levels that are needed for the given limit. The limit is fixed and can not be changed after a gmap has been created. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
由 Martin Schwidefsky 提交于
Store the target address for the gmap segments in a radix tree instead of using invalid segment table entries. gmap_translate becomes a simple radix_tree_lookup, gmap_fault is split into the address translation with gmap_translate and the part that does the linking of the gmap shadow page table with the process page table. A second radix tree is used to keep the pointers to the segment table entries for segments that are mapped in the guest address space. On unmap of a segment the pointer is retrieved from the radix tree and is used to carry out the segment invalidation in the gmap shadow page table. As the radix tree can only store one pointer, each host segment may only be mapped to exactly one guest location. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
- 25 8月, 2014 3 次提交
-
-
由 Martin Schwidefsky 提交于
Make the order of arguments for the gmap calls more consistent, if the gmap pointer is passed it is always the first argument. In addition distinguish between guest address and user address by naming the variables gaddr for a guest address and vmaddr for a user address. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
由 Martin Schwidefsky 提交于
Revert git commit c3a23b9874c1 ("remove unnecessary parameter from gmap_do_ipte_notify"). Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
由 Christian Borntraeger 提交于
The PFMF instruction handler blindly wrote the storage key even if the page was mapped R/O in the host. Lets try a COW before continuing and bail out in case of errors. Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Cc: stable@vger.kernel.org
-
- 01 8月, 2014 2 次提交
-
-
由 Martin Schwidefsky 提交于
The large segment table entry format has block of bits for the ACC/F values for the large page. These bits are valid only if another bit (AV bit 0x10000) of the segment table entry is set. The ACC/F bits do not have a meaning if the AV bit is off. This allows to put the THP splitting bit, the segment young bit and the new segment dirty bit into the ACC/F bits as long as the AV bit stays off. The dirty and young information is only available if the pmd is large. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Christian Borntraeger 提交于
commit ec66ad66 (s390/mm: enable split page table lock for PMD level) activated the split pmd lock for s390. Turns out that we missed one place: We also have to take the pmd lock instead of the page table lock when we reallocate the page tables (==> changing entries in the PMD) during sie enablement. Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 20 5月, 2014 1 次提交
-
-
由 Martin Schwidefsky 提交于
Always switch to the kernel ASCE in switch_mm. Load the secondary space ASCE in finish_arch_post_lock_switch after checking that any pending page table operations have completed. The primary ASCE is loaded in entry[64].S. With this the update_primary_asce call can be removed from the switch_to macro and from the start of switch_mm function. Remove the load_primary argument from update_user_asce/clear_user_asce, rename update_user_asce to set_user_asce and rename update_primary_asce to load_kernel_asce. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 16 5月, 2014 1 次提交
-
-
由 Martin Schwidefsky 提交于
Use the mm semaphore to serialize multiple invocations of s390_enable_skey. The second CPU faulting on a storage key operation needs to wait for the completion of the page table update. Taking the mm semaphore writable has the positive side-effect that it prevents any host faults from taking place which does have implications on keys vs PGSTE. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
- 22 4月, 2014 4 次提交
-
-
由 Dominik Dingel 提交于
For live migration kvm needs to test and clear the dirty bit of guest pages. That for is ptep_test_and_clear_user_dirty, to be sure we are not racing with other code, we protect the pte. This needs to be done within the architecture memory management code. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
由 Martin Schwidefsky 提交于
Switch the user dirty bit detection used for migration from the hardware provided host change-bit in the pgste to a fault based detection method. This reduced the dependency of the host from the storage key to a point where it becomes possible to enable the RCP bypass for KVM guests. The fault based dirty detection will only indicate changes caused by accesses via the guest address space. The hardware based method can detect all changes, even those caused by I/O or accesses via the kernel page table. The KVM/qemu code needs to take this into account. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
由 Dominik Dingel 提交于
Introduce a new function s390_enable_skey(), which enables storage key handling via setting the use_skey flag in the mmu context. This function is only useful within the context of kvm. Note that enabling storage keys will cause a one-time hickup when walking the page table; however, it saves us special effort for cases like clear reset while making it possible for us to be architecture conform. s390_enable_skey() takes the page table lock to prevent reseting storage keys triggered from multiple vcpus. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
由 Dominik Dingel 提交于
page_table_reset_pgste() already does a complete page table walk to reset the pgste. Enhance it to initialize the storage keys to PAGE_DEFAULT_KEY if requested by the caller. This will be used for lazy storage key handling. Also provide an empty stub for !CONFIG_PGSTE Lets adopt the current code (diag 308) to not clear the keys. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
- 08 4月, 2014 1 次提交
-
-
由 Alex Thorlton 提交于
The main motivation behind this patch is to provide a way to disable THP for jobs where the code cannot be modified, and using a malloc hook with madvise is not an option (i.e. statically allocated data). This patch allows us to do just that, without affecting other jobs running on the system. We need to do this sort of thing for jobs where THP hurts performance, due to the possibility of increased remote memory accesses that can be created by situations such as the following: When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will be handed out, and the THP will be stuck on whatever node the chunk was originally referenced from. If many remote nodes need to do work on that same chunk, they'll be making remote accesses. With THP disabled, 4K pages can be handed out to separate nodes as they're needed, greatly reducing the amount of remote accesses to memory. This patch is based on some of my work combined with some suggestions/patches given by Oleg Nesterov. The main goal here is to add a prctl switch to allow us to disable to THP on a per mm_struct basis. Here's a bit of test data with the new patch in place... First with the flag unset: # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Setting thp_disabled for this task... thp_disable: 0 Set thp_disabled state to 0 Process pid = 18027 PF/ MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/ TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES 512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23 Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g': 273719072.841402 task-clock # 641.026 CPUs utilized [100.00%] 1,008,986 context-switches # 0.000 M/sec [100.00%] 7,717 CPU-migrations # 0.000 M/sec [100.00%] 1,698,932 page-faults # 0.000 M/sec 355,222,544,890,379 cycles # 1.298 GHz [100.00%] 536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%] 409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%] 148,286,797,266,411 instructions # 0.42 insns per cycle # 3.62 stalled cycles per insn [100.00%] 27,061,793,159,503 branches # 98.867 M/sec [100.00%] 1,188,655,196 branch-misses # 0.00% of all branches 427.001706337 seconds time elapsed Now with the flag set: # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Setting thp_disabled for this task... thp_disable: 1 Set thp_disabled state to 1 Process pid = 144957 PF/ MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/ TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES 512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23 Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g': 138789390.540183 task-clock # 641.959 CPUs utilized [100.00%] 534,205 context-switches # 0.000 M/sec [100.00%] 4,595 CPU-migrations # 0.000 M/sec [100.00%] 63,133,119 page-faults # 0.000 M/sec 147,977,747,269,768 cycles # 1.066 GHz [100.00%] 200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%] 105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%] 180,916,213,503,160 instructions # 1.22 insns per cycle # 1.11 stalled cycles per insn [100.00%] 26,999,511,005,868 branches # 194.536 M/sec [100.00%] 714,066,351 branch-misses # 0.00% of all branches 216.196778807 seconds time elapsed As with previous versions of the patch, We're getting about a 2x performance increase here. Here's a link to the test case I used, along with the little wrapper to activate the flag: http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz This patch (of 3): Revert commit 8e72033f and add in code to fix up any issues caused by the revert. The revert is necessary because hugepage_madvise would return -EINVAL when VM_NOHUGEPAGE is set, which will break subsequent chunks of this patch set. Here's a snip of an e-mail from Gerald detailing the original purpose of this code, and providing justification for the revert: "The intent of commit 8e72033f was to guard against any future programming errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings, which would crash the kernel. Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f was to be reverted, because that check will also prevent a kernel crash in the case described above, it will now send a SIGSEGV instead. This would now also allow to do the madvise on other parts, if needed, so it is a more flexible approach. One could also say that it would have been better to do it this way right from the beginning..." Signed-off-by: NAlex Thorlton <athorlton@sgi.com> Suggested-by: NOleg Nesterov <oleg@redhat.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 4月, 2014 3 次提交
-
-
由 Heiko Carstens 提交于
The current uaccess code uses a page table walk in some circumstances, e.g. in case of the in atomic futex operations or if running on old hardware which doesn't support the mvcos instruction. However it turned out that the page table walk code does not correctly lock page tables when accessing page table entries. In other words: a different cpu may invalidate a page table entry while the current cpu inspects the pte. This may lead to random data corruption. Adding correct locking however isn't trivial for all uaccess operations. Especially copy_in_user() is problematic since that requires to hold at least two locks, but must be protected against ABBA deadlock when a different cpu also performs a copy_in_user() operation. So the solution is a different approach where we change address spaces: User space runs in primary address mode, or access register mode within vdso code, like it currently already does. The kernel usually also runs in home space mode, however when accessing user space the kernel switches to primary or secondary address mode if the mvcos instruction is not available or if a compare-and-swap (futex) instruction on a user space address is performed. KVM however is special, since that requires the kernel to run in home address space while implicitly accessing user space with the sie instruction. So we end up with: User space: - runs in primary or access register mode - cr1 contains the user asce - cr7 contains the user asce - cr13 contains the kernel asce Kernel space: - runs in home space mode - cr1 contains the user or kernel asce -> the kernel asce is loaded when a uaccess requires primary or secondary address mode - cr7 contains the user or kernel asce, (changed with set_fs()) - cr13 contains the kernel asce In case of uaccess the kernel changes to: - primary space mode in case of a uaccess (copy_to_user) and uses e.g. the mvcp instruction to access user space. However the kernel will stay in home space mode if the mvcos instruction is available - secondary space mode in case of futex atomic operations, so that the instructions come from primary address space and data from secondary space In case of kvm the kernel runs in home space mode, but cr1 gets switched to contain the gmap asce before the sie instruction gets executed. When the sie instruction is finished cr1 will be switched back to contain the user asce. A context switch between two processes will always load the kernel asce for the next process in cr1. So the first exit to user space is a bit more expensive (one extra load control register instruction) than before, however keeps the code rather simple. In sum this means there is no need to perform any error prone page table walks anymore when accessing user space. The patch seems to be rather large, however it mainly removes the the page table walk code and restores the previously deleted "standard" uaccess code, with a couple of changes. The uaccess without mvcos mode can be enforced with the "uaccess_primary" kernel parameter. Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Martin Schwidefsky 提交于
The zEC12 machines introduced the local-clearing control for the IDTE and IPTE instruction. If the control is set only the TLB of the local CPU is cleared of entries, either all entries of a single address space for IDTE, or the entry for a single page-table entry for IPTE. Without the local-clearing control the TLB flush is broadcasted to all CPUs in the configuration, which is expensive. The reset of the bit mask of the CPUs that need flushing after a non-local IDTE is tricky. As TLB entries for an address space remain in the TLB even if the address space is detached a new bit field is required to keep track of attached CPUs vs. CPUs in the need of a flush. After a non-local flush with IDTE the bit-field of attached CPUs is copied to the bit-field of CPUs in need of a flush. The ordering of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is such that an underindication in mm_cpumask(mm) is prevented but an overindication in mm_cpumask(mm) is possible. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Martin Schwidefsky 提交于
The principles of operations states that the CPU is allowed to create TLB entries for an address space anytime while an ASCE is loaded to the control register. This is true even if the CPU is running in the kernel and the user address space is not (actively) accessed. In theory this can affect two aspects of the TLB flush logic. For full-mm flushes the ASCE of the dying process is still attached. The approach to flush first with IDTE and then just free all page tables can in theory lead to stale TLB entries. Use the batched free of page tables for the full-mm flushes as well. For operations that can have a stale ASCE in the control register, e.g. a delayed update_user_asce in switch_mm, load the kernel ASCE to prevent invalid TLBs from being created. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 21 3月, 2014 2 次提交
-
-
由 Dominik Dingel 提交于
Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Dominik Dingel 提交于
Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 21 2月, 2014 2 次提交
-
-
由 Martin Schwidefsky 提交于
Add the pgtable_pmd_page_ctor/pgtable_pmd_page_dtor calls to the pmd allocation and free functions and enable ARCH_ENABLE_SPLIT_PMD_PTLOCK for 64 bit. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
由 Martin Schwidefsky 提交于
The guest page state needs to be reset to stable for all pages on initial program load via diagnose 0x308. Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-