- 13 7月, 2017 13 次提交
-
-
由 Ladi Prosek 提交于
The backwards_tsc_observed global introduced in commit 16a96021 is never reset to false. If a VM happens to be running while the host is suspended (a common source of the TSC jumping backwards), master clock will never be enabled again for any VM. In contrast, if no VM is running while the host is suspended, master clock is unaffected. This is inconsistent and unnecessarily strict. Let's track the backwards_tsc_observed variable separately and let each VM start with a clean slate. Real world impact: My Windows VMs get slower after my laptop undergoes a suspend/resume cycle. The only way to get the perf back is unloading and reloading the kvm module. Signed-off-by: NLadi Prosek <lprosek@redhat.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Janakarajan Natarajan 提交于
Enable the Virtual VMLOAD VMSAVE feature. This is done by setting bit 1 at position B8h in the vmcb. The processor must have nested paging enabled, be in 64-bit mode and have support for the Virtual VMLOAD VMSAVE feature for the bit to be set in the vmcb. Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Janakarajan Natarajan 提交于
Define a new cpufeature definition for Virtual VMLOAD VMSAVE. Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Janakarajan Natarajan 提交于
Rename the lbr_ctl variable to better reflect the purpose of the field - provide support for virtualization extensions. Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Janakarajan Natarajan 提交于
The lbr_ctl variable in the vmcb control area is used to enable or disable Last Branch Record (LBR) virtualization. However, this is to be done using only bit 0 of the variable. To correct this and to prepare for a new feature, change the current usage to work only on a particular bit. Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Ladi Prosek 提交于
kvm_skip_emulated_instruction handles the singlestep debug exception which is something we almost always want. This commit (specifically the change in rdmsr_interception) makes the debug.flat KVM unit test pass on AMD. Two call sites still call skip_emulated_instruction directly: * In svm_queue_exception where it's used only for moving the rip forward * In task_switch_interception which is analogous to handle_task_switch in VMX Signed-off-by: NLadi Prosek <lprosek@redhat.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Radim Krčmář 提交于
kvm_vm_release() did not have slots_lock when calling kvm_io_bus_unregister_dev() and this went unnoticed until 4a12f951 ("KVM: mark kvm->busses as rcu protected") added dynamic checks. Luckily, there should be no race at that point: ============================= WARNING: suspicious RCU usage 4.12.0.kvm+ #0 Not tainted ----------------------------- ./include/linux/kvm_host.h:479 suspicious rcu_dereference_check() usage! lockdep_rcu_suspicious+0xc5/0x100 kvm_io_bus_unregister_dev+0x173/0x190 [kvm] kvm_free_pit+0x28/0x80 [kvm] kvm_arch_sync_events+0x2d/0x30 [kvm] kvm_put_kvm+0xa7/0x2a0 [kvm] kvm_vm_release+0x21/0x30 [kvm] Reviewed-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Jim Mattson 提交于
vmx_complete_atomic_exit should call kvm_machine_check for any VM-entry failure due to a machine-check event. Such an exit should be recognized solely by its basic exit reason (i.e. the low 16 bits of the VMCS exit reason field). None of the other VMCS exit information fields contain valid information when the VM-exit is due to "VM-entry failure due to machine-check event". Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NXiao Guangrong <xiaoguangrong@tencent.com> [Changed VM_EXIT_INTR_INFO condition to better describe its reason.] Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Radim Krčmář 提交于
kvm master clock usually has a different frequency than the kernel boot clock. This is not a problem until the master clock is updated; update uses the current kernel boot clock to compute new kvm clock, which erases any kvm clock cycles that might have built up due to frequency difference over a long period. KVM_SET_CLOCK is one of places where we can safely update master clock as the guest-visible clock is going to be shifted anyway. The problem with current code is that it updates the kvm master clock after updating the offset. If the master clock was enabled before calling KVM_SET_CLOCK, then it might have built up a significant delta from kernel boot clock. In the worst case, the time set by userspace would be shifted by so much that it couldn't have been set at any point during KVM_SET_CLOCK. To fix this, move kvm_gen_update_masterclock() before computing kvmclock_offset, which means that the master clock and kernel boot clock will be sufficiently close together. Another solution would be to replace get_kvmclock_ns() with "ktime_get_boot_ns() + ka->kvmclock_offset", which is marginally more accurate, but would break symmetry with KVM_GET_CLOCK. Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Jim Mattson 提交于
Inconsistencies result from shadowing only accesses to the full 64-bits of a 64-bit VMCS field, but not shadowing accesses to the high 32-bits of the field. The "high" part of a 64-bit field should be shadowed whenever the full 64-bit field is shadowed. Signed-off-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
Allow the L1 guest to specify the last page of addressable guest physical memory for an L2 MSR permission bitmap. Also remove the vmcs12_read_any() check that should never fail. Fixes: 3af18d9c ("KVM: nVMX: Prepare for using hardware MSR bitmap") Signed-off-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
According to the SDM, if the "use I/O bitmaps" VM-execution control is 1, bits 11:0 of each I/O-bitmap address must be 0. Neither address should set any bits beyond the processor's physical-address width. Signed-off-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
The VMCS launch state is not set to "launched" unless the VMLAUNCH actually succeeds. VMLAUNCH failure includes VM-exits with bit 31 set. Note that this change does not address the general problem that a failure to launch/resume vmcs02 (i.e. vmx->fail) is not handled correctly. Signed-off-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 10 7月, 2017 1 次提交
-
-
由 Paolo Bonzini 提交于
This exit ended up being reported, but the currently exposed data does not provide much of a starting point for debugging. In the reported case, the vmexit was an EPT misconfiguration (MMIO access). Let userspace report ethe exit qualification and, if relevant, the GPA. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 07 7月, 2017 6 次提交
-
-
由 Punit Agrawal 提交于
A poisoned or migrated hugepage is stored as a swap entry in the page tables. On architectures that support hugepages consisting of contiguous page table entries (such as on arm64) this leads to ambiguity in determining the page table entry to return in huge_pte_offset() when a poisoned entry is encountered. Let's remove the ambiguity by adding a size parameter to convey additional information about the requested address. Also fixup the definition/usage of huge_pte_offset() throughout the tree. Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.comSigned-off-by: NPunit Agrawal <punit.agrawal@arm.com> Acked-by: NSteve Capper <steve.capper@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE) Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS) Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
This moves the #ifdef in C code to a Kconfig dependency. Also we move the gigantic_page_supported() function to be arch specific. This allows architectures to conditionally enable runtime allocation of gigantic huge page. Architectures like ppc64 supports different gigantic huge page size (16G and 1G) based on the translation mode selected. This provides an opportunity for ppc64 to enable runtime allocation only w.r.t 1G hugepage. No functional change in this patch. Link: http://lkml.kernel.org/r/1494995292-4443-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
arch_add_memory gets for_device argument which then controls whether we want to create memblocks for created memory sections. Simplify the logic by telling whether we want memblocks directly rather than going through pointless negation. This also makes the api easier to understand because it is clear what we want rather than nothing telling for_device which can mean anything. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Tested-by: NDan Williams <dan.j.williams@intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
The current memory hotplug implementation relies on having all the struct pages associate with a zone/node during the physical hotplug phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the vast majority of cases this means that they are added to ZONE_NORMAL. This has been so since 9d99aaa3 ("[PATCH] x86_64: Support memory hotadd without sparsemem") and it wasn't a big deal back then because movable onlining didn't exist yet. Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable onlining 511c2aba ("mm, memory-hotplug: dynamic configure movable memory and portion memory") and then things got more complicated. Rather than reconsidering the zone association which was no longer needed (because the memory hotplug already depended on SPARSEMEM) a convoluted semantic of zone shifting has been developed. Only the currently last memblock or the one adjacent to the zone_movable can be onlined movable. This essentially means that the online type changes as the new memblocks are added. Let's simulate memory hot online manually $ echo 0x100000000 > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable $ echo online_movable > /sys/devices/system/memory/memory34/state $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This is an awkward semantic because an udev event is sent as soon as the block is onlined and an udev handler might want to online it based on some policy (e.g. association with a node) but it will inherently race with new blocks showing up. This patch changes the physical online phase to not associate pages with any zone at all. All the pages are just marked reserved and wait for the onlining phase to be associated with the zone as per the online request. There are only two requirements - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses the latter one is not an inherent requirement and can be changed in the future. It preserves the current behavior and made the code slightly simpler. This is subject to change in future. This means that the same physical online steps as above will lead to the following state: Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Implementation: The current move_pfn_range is reimplemented to check the above requirements (allow_online_pfn_range) and then updates the respective zone (move_pfn_range_to_zone), the pgdat and links all the pages in the pfn range with the zone/node. __add_pages is updated to not require the zone and only initializes sections in the range. This allowed to simplify the arch_add_memory code (s390 could get rid of quite some of code). devm_memremap_pages is the only user of arch_add_memory which relies on the zone association because it only hooks into the memory hotplug only half way. It uses it to associate the new memory with ZONE_DEVICE but doesn't allow it to be {on,off}lined via sysfs. This means that this particular code path has to call move_pfn_range_to_zone explicitly. The original zone shifting code is kept in place and will be removed in the follow up patch for an easier review. Please note that this patch also changes the original behavior when offlining a memory block adjacent to another zone (Normal vs. Movable) used to allow to change its movable type. This will be handled later. [richard.weiyang@gmail.com: simplify zone_intersects()] Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com [richard.weiyang@gmail.com: remove duplicate call for set_page_links] Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com [akpm@linux-foundation.org: remove unused local `i'] Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NWei Yang <richard.weiyang@gmail.com> Tested-by: NDan Williams <dan.j.williams@intel.com> Tested-by: NReza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Device memory hotplug hooks into regular memory hotplug only half way. It needs memory sections to track struct pages but there is no need/desire to associate those sections with memory blocks and export them to the userspace via sysfs because they cannot be onlined anyway. This is currently expressed by for_device argument to arch_add_memory which then makes sure to associate the given memory range with ZONE_DEVICE. register_new_memory then relies on is_zone_device_section to distinguish special memory hotplug from the regular one. While this works now, later patches in this series want to move __add_zone outside of arch_add_memory path so we have to come up with something else. Add want_memblock down the __add_pages path and use it to control whether the section->memblock association should be done. arch_add_memory then just trivially want memblock for everything but for_device hotplug. remove_memory_section doesn't need is_zone_device_section either. We can simply skip all the memblock specific cleanup if there is no memblock for the given section. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Tested-by: NDan Williams <dan.j.williams@intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
Patch series "THP swap: Delay splitting THP during swapping out", v11. This patchset is to optimize the performance of Transparent Huge Page (THP) swap. Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swap out even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help improve the performance of the THP swap. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead on the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. This patchset is the first step for the THP swap support. The plan is to delay splitting THP step by step, finally avoid splitting THP during the THP swapping out and swap out/in the THP as a whole. As the first step, in this patchset, the splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. With the patchset, the swap out throughput improves 15.5% (from about 3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case with 8 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case creates 8 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. This patch (of 5): In this patch, splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP (Transparent Huge Page) and adding the THP into the swap cache. This will batch the corresponding operation, thus improve THP swap out throughput. This is the first step for the THP swap optimization. The plan is to delay splitting the THP step by step and avoid splitting the THP finally. In this patch, one swap cluster is used to hold the contents of each THP swapped out. So, the size of the swap cluster is changed to that of the THP (Transparent Huge Page) on x86_64 architecture (512). For other architectures which want such THP swap optimization, ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for the architecture. In effect, this will enlarge swap cluster size by 2 times on x86_64. Which may make it harder to find a free cluster when the swap space becomes fragmented. So that, this may reduce the continuous swap space allocation and sequential write in theory. The performance test in 0day shows no regressions caused by this. In the future of THP swap optimization, some information of the swapped out THP (such as compound map count) will be recorded in the swap_cluster_info data structure. The mem cgroup swap accounting functions are enhanced to support charge or uncharge a swap cluster backing a THP as a whole. The swap cluster allocate/free functions are added to allocate/free a swap cluster for a THP. A fair simple algorithm is used for swap cluster allocation, that is, only the first swap device in priority list will be tried to allocate the swap cluster. The function will fail if the trying is not successful, and the caller will fallback to allocate a single swap slot instead. This works good enough for normal cases. If the difference of the number of the free swap clusters among multiple swap devices is significant, it is possible that some THPs are split earlier than necessary. For example, this could be caused by big size difference among multiple swap devices. The swap cache functions is enhanced to support add/delete THP to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be enhanced in the future with multi-order radix tree. But because we will split the THP soon during swapping out, that optimization doesn't make much sense for this first step. The THP splitting functions are enhanced to support to split THP in swap cache during swapping out. The page lock will be held during allocating the swap cluster, adding the THP into the swap cache and splitting the THP. So in the code path other than swapping out, if the THP need to be split, the PageSwapCache(THP) will be always false. The swap cluster is only available for SSD, so the THP swap optimization in this patchset has no effect for HDD. [ying.huang@intel.com: fix two issues in THP optimize patch] Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size] Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option] Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h] Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 7月, 2017 4 次提交
-
-
由 Chen Yu 提交于
Add the real e820_tabel_firmware[] that will not be modified by the kernel or the EFI boot stub under any circumstance. In addition to that modify the code so that e820_table_firmwarep[] is exposed via sysfs to represent the real firmware memory layout, rather than exposing the e820_table_kexec[] table. This fixes a hibernation bug/warning, which uses e820_table_kexec[] to check RAM layout consistency across hibernation/resume: The suspend kernel: [ 0.000000] e820: update [mem 0x76671018-0x76679457] usable ==> usable The resume kernel: [ 0.000000] e820: update [mem 0x7666f018-0x76677457] usable ==> usable ... [ 15.752088] PM: Using 3 thread(s) for decompression. [ 15.752088] PM: Loading and decompressing image data (471870 pages)... [ 15.764971] Hibernate inconsistent memory map detected! [ 15.770833] PM: Image mismatch: architecture specific data Actually it is safe to restore these pages because E820_TYPE_RAM and E820_TYPE_RESERVED_KERN are treated the same during hibernation, so the original e820 table provided by the bootloader is used for hibernation MD5 fingerprint checking. The side effect is that, this newly introduced variable might increase the kernel size at compile time. Suggested-by: NIngo Molnar <mingo@redhat.com> Signed-off-by: NChen Yu <yu.c.chen@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xunlei Pang <xlpang@redhat.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Chen Yu 提交于
Currently the e820_table_firmware[] table is mainly used by the kexec, and it is not what it's supposed to be - despite its name it might be modified by the kernel. So change its name to e820_table_kexec[]. In the next patch we will introduce the real e820_table_firmware[] table. No functional change. Signed-off-by: NChen Yu <yu.c.chen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xunlei Pang <xlpang@redhat.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Chen Yu 提交于
The following commit in 2013: 77ea8c94 ("x86: Reserve setup_data ranges late after parsing memmap cmdline") has fixed the issue of losing setup_data information by deferring the e820_reserve_setup_data() call until the early params have been parsed. But this also introduced a new problem that, during early params parsing, the kexec kernel might fake a mptable and saves it into the e820_table_firmware[] table (without saving the mptable to the e820_table[]), however the subsequent invoking of e820_reserve_setup_data() will overwrite the e820_table_firmware[] according to the e820_table[], thus the fake mptable information is lost. Fix this issue by updating the e820_table_firmware[] according to the setup_data information, but without overwriting it. Signed-off-by: NChen Yu <yu.c.chen@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xunlei Pang <xlpang@redhat.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mikulas Patocka 提交于
The pat_enabled() logic is broken on CPUs which do not support PAT and where the initialization code fails to call pat_init(). Due to that the enabled flag stays true and pat_enabled() returns true wrongfully. As a consequence the mappings, e.g. for Xorg, are set up with the wrong caching mode and the required MTRR setups are omitted. To cure this the following changes are required: 1) Make pat_enabled() return true only if PAT initialization was invoked and successful. 2) Invoke init_cache_modes() unconditionally in setup_arch() and remove the extra callsites in pat_disable() and the pat disabled code path in pat_init(). Also rename __pat_enabled to pat_disabled to reflect the real purpose of this variable. Fixes: 9cd25aac ("x86/mm/pat: Emulate PAT when it is disabled") Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Bernhard Held <berny156@gmx.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: "Luis R. Rodriguez" <mcgrof@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1707041749300.3456@file01.intranet.prod.int.rdu2.redhat.com
-
- 04 7月, 2017 2 次提交
-
-
由 Colin Ian King 提交于
The functions handle_uv2_busy, uv_flush_send_and_wait and find_another_by_swack are local to the source, so make them static. Also remove normal_busy as it is no longer used. Fixes various smatch warnings, such as: "symbol 'find_another_by_swack' was not declared. Should it be static?" "symbol 'handle_uv2_busy' was not declared. Should it be static?" Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Mike Travis <mike.travis@hpe.com> Cc: Andrew Banman <abanman@hpe.com> Cc: kernel-janitors@vger.kernel.org Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Link: http://lkml.kernel.org/r/20170704083129.10559-1-colin.king@canonical.com
-
由 Haozhong Zhang 提交于
It's easier for host applications, such as QEMU, if they can always access guest MSR_IA32_BNDCFGS in VMCS, even though MPX is disabled in guest cpuid. Cc: stable@vger.kernel.org Signed-off-by: NHaozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 03 7月, 2017 7 次提交
-
-
由 Peter Feiner 提交于
EPT A/D was enabled in the vmcs02 EPTP regardless of the vmcs12's EPTP value. The problem is that enabling A/D changes the behavior of L2's x86 page table walks as seen by L1. With A/D enabled, x86 page table walks are always treated as EPT writes. Commit ae1e2d10 ("kvm: nVMX: support EPT accessed/dirty bits", 2017-03-30) tried to work around this problem by clearing the write bit in the exit qualification for EPT violations triggered by page walks. However, that fixup introduced the opposite bug: page-table walks that actually set x86 A/D bits were *missing* the write bit in the exit qualification. This patch fixes the problem by disabling EPT A/D in the shadow MMU when EPT A/D is disabled in vmcs12's EPTP. Signed-off-by: NPeter Feiner <pfeiner@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
Userspace application can do a hypercall through /dev/xen/privcmd, and some for some hypercalls argument is a pointers to user-provided structure. When SMAP is supported and enabled, hypervisor can't access. So, lets allow it. The same applies to HYPERVISOR_dm_op, where additionally privcmd driver carefully verify buffer addresses. Cc: stable@vger.kernel.org Signed-off-by: NMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: NJuergen Gross <jgross@suse.com> Signed-off-by: NJuergen Gross <jgross@suse.com>
-
由 Gustavo A. R. Silva 提交于
Remove unnecessary variable mfn in function xen_foreach_remap_area() and, refactor the code. Variable mfn at line 518:mfn = xen_remap_buf.mfns[i]; is only being used to store a value to be passed as an argument to the xen_update_mem_tables() function. This value can be passed directly, which makes variable mfn unnecessary. Also, value assigned to variable mfn at line 534:mfn = xen_remap_mfn; is never used. Addresses-Coverity-ID: 1260110 Signed-off-by: NGustavo A. R. Silva <garsilva@embeddedor.com> Reviewed-by: NJuergen Gross <jgross@suse.com> Signed-off-by: NJuergen Gross <jgross@suse.com>
-
由 Peter Feiner 提交于
Adds the plumbing to disable A/D bits in the MMU based on a new role bit, ad_disabled. When A/D is disabled, the MMU operates as though A/D aren't available (i.e., using access tracking faults instead). To avoid SP -> kvm_mmu_page.role.ad_disabled lookups all over the place, A/D disablement is now stored in the SPTE. This state is stored in the SPTE by tweaking the use of SPTE_SPECIAL_MASK for access tracking. Rather than just setting SPTE_SPECIAL_MASK when an access-tracking SPTE is non-present, we now always set SPTE_SPECIAL_MASK for access-tracking SPTEs. Signed-off-by: NPeter Feiner <pfeiner@google.com> [Use role.ad_disabled even for direct (non-shadow) EPT page tables. Add documentation and a few MMU_WARN_ONs. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Peter Feiner 提交于
Specify both a mask (i.e., bits to consider) and a value (i.e., pattern of bits that indicates a special PTE) for mmio SPTEs. On Intel, this lets us pack even more information into the (SPTE_SPECIAL_MASK | EPT_VMX_RWX_MASK) mask we use for access tracking liberating all (SPTE_SPECIAL_MASK | (non-misconfigured-RWX)) values. Signed-off-by: NPeter Feiner <pfeiner@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Peter Feiner 提交于
The MMU always has hardware A bits or access tracking support, thus it's unnecessary to handle the scenario where we have neither. Signed-off-by: NPeter Feiner <pfeiner@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jork Loeser 提交于
Update the Hyper-V vPCI driver to use the Server-2016 version of the vPCI protocol, fixing MSI creation and retargeting issues. Signed-off-by: NJork Loeser <jloeser@microsoft.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> Reviewed-by: NK. Y. Srinivasan <kys@microsoft.com> Acked-by: NK. Y. Srinivasan <kys@microsoft.com>
-
- 02 7月, 2017 1 次提交
-
-
由 Oliver O'Halloran 提交于
Currently ZONE_DEVICE depends on X86_64 and this will get unwieldly as new architectures (and platforms) get ZONE_DEVICE support. Move to an arch selected Kconfig option to save us the trouble. Cc: linux-mm@kvack.org Acked-by: NIngo Molnar <mingo@kernel.org> Acked-by: NBalbir Singh <bsingharora@gmail.com> Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 01 7月, 2017 2 次提交
-
-
由 Vikas Shivappa 提交于
If mount fails, the kn_info directory is not freed causing memory leak. Add the missing error handling path. Fixes: 4e978d06 ("x86/intel_rdt: Add "info" files to resctrl file system") Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: vikas.shivappa@intel.com Cc: andi.kleen@intel.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1498503368-20173-3-git-send-email-vikas.shivappa@linux.intel.com
-
由 Kai-Heng Feng 提交于
On an AMD Carrizo laptop, when EHCI runtime PM is enabled, EHCI ports do not assert PME# for device plug/unplug events while in D3. As Alan Stern points out [1], the PME signal is not enabled when controller is in D3, therefore it's not being woken up when new devices get plugged in. Testing shows PME signal works when the EHCI power state is D2. Clear the PCI_PM_CAP_PME_D3 and PCI_PM_CAP_PME_D3cold bits in dev->pme_support to indicate the device will not assert PME# from those states. [1] http://lkml.kernel.org/r/Pine.LNX.4.44L0.1706121010010.2092-100000@iolanthe.rowland.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=196091 Link: https://support.amd.com/TechDocs/46837.pdf (Section 23) Link: https://support.amd.com/TechDocs/42413.pdf (Appendix A2) Signed-off-by: NKai-Heng Feng <kai.heng.feng@canonical.com> [bhelgaas: changelog, add parens in quirk] Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
-
- 30 6月, 2017 4 次提交
-
-
由 Nick Desaulniers 提交于
The macro insn_fetch marks the 'type' argument as having a specified alignment. Type attributes can only be applied to structs, unions, or enums, but insn_fetch is only ever invoked with integral types, so Clang produces 19 -Wignored-attributes warnings for this source file. Signed-off-by: NNick Desaulniers <nick.desaulniers@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Josh Poimboeuf 提交于
In preparation for an objtool rewrite which will have broader checks, whitelist functions and files which cause problems because they do unusual things with the stack. These whitelists serve as a TODO list for which functions and files don't yet have undwarf unwinder coverage. Eventually most of the whitelists can be removed in favor of manual CFI hint annotations or objtool improvements. Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/7f934a5d707a574bda33ea282e9478e627fb1829.1498659915.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The comment describes the old explicit IPI-based flush logic, which is long gone. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/55e44997e56086528140c5180f8337dc53fb7ffc.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
It was historically possible to have two concurrent TLB flushes targetting the same CPU: one initiated locally and one initiated remotely. This can now cause an OOPS in leave_mm() at arch/x86/mm/tlb.c:47: if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) BUG(); with this call trace: flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline] flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317 Without reentrancy, this OOPS is impossible: leave_mm() is only called if we're not in TLBSTATE_OK, but then we're unexpectedly in TLBSTATE_OK in leave_mm(). This can be caused by flush_tlb_func_remote() happening between the two checks and calling leave_mm(), resulting in two consecutive leave_mm() calls on the same CPU with no intervening switch_mm() calls. We never saw this OOPS before because the old leave_mm() implementation didn't put us back in TLBSTATE_OK, so the assertion didn't fire. Nadav noticed the reentrancy issue in a different context, but neither of us realized that it caused a problem yet. Reported-by: NLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NNadav Amit <nadav.amit@gmail.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Fixes: 3d28ebce ("x86/mm: Rework lazy TLB to track the actual loaded mm") Link: http://lkml.kernel.org/r/855acf733268d521c9f2e191faee2dcc23a29729.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-