- 17 12月, 2018 10 次提交
-
-
由 Suraj Jitindar Singh 提交于
Allow for a device which is being emulated at L0 (the host) for an L1 guest to be passed through to a nested (L2) guest. The existing kvmppc_hv_emulate_mmio function can be used here. The main challenge is that for a load the result must be stored into the L2 gpr, not an L1 gpr as would normally be the case after going out to qemu to complete the operation. This presents a challenge as at this point the L2 gpr state has been written back into L1 memory. To work around this we store the address in L1 memory of the L2 gpr where the result of the load is to be stored and use the new io_gpr value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for which completion must be done when returning back into the kernel. Then in kvmppc_complete_mmio_load() the resultant value is written into L1 memory at the location of the indicated L2 gpr. Note that we don't currently let an L1 guest emulate a device for an L2 guest which is then passed through to an L3 guest. Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Suraj Jitindar Singh 提交于
The functions kvmppc_st and kvmppc_ld are used to access guest memory from the host using a guest effective address. They do so by translating through the process table to obtain a guest real address and then using kvm_read_guest or kvm_write_guest to make the access with the guest real address. This method of access however only works for L1 guests and will give the incorrect results for a nested guest. We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to perform the access for a nested guesti (and a L1 guest). So attempt this method first and fall back to the old method if this fails and we aren't running a nested guest. At this stage there is no fall back method to perform the access for a nested guest and this is left as a future improvement. For now we will return to the nested guest and rely on the fact that a translation should be faulted in before retrying the access. Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Suraj Jitindar Singh 提交于
The kvmppc_ops struct is used to store function pointers to kvm implementation specific functions. Introduce two new functions load_from_eaddr and store_to_eaddr to be used to load from and store to a guest effective address respectively. Also implement these for the kvm-hv module. If we are using the radix mmu then we can call the functions to access quadrant 1 and 2. Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Suraj Jitindar Singh 提交于
The POWER9 radix mmu has the concept of quadrants. The quadrant number is the two high bits of the effective address and determines the fully qualified address to be used for the translation. The fully qualified address consists of the effective lpid, the effective pid and the effective address. This gives then 4 possible quadrants 0, 1, 2, and 3. When accessing these quadrants the fully qualified address is obtained as follows: Quadrant | Hypervisor | Guest -------------------------------------------------------------------------- | EA[0:1] = 0b00 | EA[0:1] = 0b00 0 | effLPID = 0 | effLPID = LPIDR | effPID = PIDR | effPID = PIDR -------------------------------------------------------------------------- | EA[0:1] = 0b01 | 1 | effLPID = LPIDR | Invalid Access | effPID = PIDR | -------------------------------------------------------------------------- | EA[0:1] = 0b10 | 2 | effLPID = LPIDR | Invalid Access | effPID = 0 | -------------------------------------------------------------------------- | EA[0:1] = 0b11 | EA[0:1] = 0b11 3 | effLPID = 0 | effLPID = LPIDR | effPID = 0 | effPID = 0 -------------------------------------------------------------------------- In the Guest; Quadrant 3 is normally used to address the operating system since this uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to be switched. Quadrant 0 is normally used to address user space since the effLPID and effPID are taken from the corresponding registers. In the Host; Quadrant 0 and 3 are used as above, however the effLPID is always 0 to address the host. Quadrants 1 and 2 can be used by the host to address guest memory using a guest effective address. Since the effLPID comes from the LPID register, the host loads the LPID of the guest it would like to access (and the PID of the process) and can perform accesses to a guest effective address. This means quadrant 1 can be used to address the guest user space and quadrant 2 can be used to address the guest operating system from the hypervisor, using a guest effective address. Access to the quadrants can cause a Hypervisor Data Storage Interrupt (HDSI) due to being unable to perform partition scoped translation. Previously this could only be generated from a guest and so the code path expects us to take the KVM trampoline in the interrupt handler. This is no longer the case so we modify the handler to call bad_page_fault() to check if we were expecting this fault so we can handle it gracefully and just return with an error code. In the hash mmu case we still raise an unknown exception since quadrants aren't defined for the hash mmu. Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Suraj Jitindar Singh 提交于
There exists a function kvm_is_radix() which is used to determine if a kvm instance is using the radix mmu. However this only applies to the first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can be used to determine if the current execution context of the vcpu is radix, accounting for if the vcpu is running a nested guest. Currently all nested guests must be radix but this may change in the future. Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Suraj Jitindar Singh 提交于
The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the availability of in kernel tce acceleration for vfio. However it is currently the case that this is only available on a powernv machine, not for a pseries machine. Thus make this capability dependent on having the cpu feature CPU_FTR_HVMODE. [paulus@ozlabs.org - fixed compilation for Book E.] Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Paul Mackerras 提交于
This adds code to flush the partition-scoped page tables for a radix guest when dirty tracking is turned on or off for a memslot. Only the guest real addresses covered by the memslot are flushed. The reason for this is to get rid of any 2M PTEs in the partition-scoped page tables that correspond to host transparent huge pages, so that page dirtiness is tracked at a system page (4k or 64k) granularity rather than a 2M granularity. The page tables are also flushed when turning dirty tracking off so that the memslot's address space can be repopulated with THPs if possible. To do this, we add a new function kvmppc_radix_flush_memslot(). Since this does what's needed for kvmppc_core_flush_memslot_hv() on a radix guest, we now make kvmppc_core_flush_memslot_hv() call the new kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix() for each page in the memslot. This has the effect of fixing a bug in that kvmppc_core_flush_memslot_hv() was previously calling kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which is required to be held. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Paul Mackerras 提交于
This adds 'const' to the declarations for the struct kvm_memory_slot pointer parameters of some functions, which will make it possible to call those functions from kvmppc_core_commit_memory_region_hv() in the next patch. This also fixes some comments about locking. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Paul Mackerras 提交于
For radix guests, this makes KVM map guest memory as individual pages when dirty page logging is enabled for the memslot corresponding to the guest real address. Having a separate partition-scoped PTE for each system page mapped to the guest means that we have a separate dirty bit for each page, thus making the reported dirty bitmap more accurate. Without this, if part of guest memory is backed by transparent huge pages, the dirty status is reported at a 2MB granularity rather than a 64kB (or 4kB) granularity for that part, causing userspace to have to transmit more data when migrating the guest. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Bharata B Rao 提交于
Currently, kvm_arch_commit_memory_region() gets called with a parameter indicating what type of change is being made to the memslot, but it doesn't pass it down to the platform-specific memslot commit functions. This adds the `change' parameter to the lower-level functions so that they can use it in future. [paulus@ozlabs.org - fix book E also.] Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 14 12月, 2018 3 次提交
-
-
由 Suraj Jitindar Singh 提交于
When booting a kvm-pr guest on a POWER9 machine the following message is observed: "qemu-system-ppc64: KVM does not support 1TiB segments which guest expects" This is because the guest is expecting to be able to use 1T segments however we don't indicate support for it. This is because we don't set the BOOK3S_HFLAG_MULTI_PGSIZE flag in the hflags in kvmppc_set_pvr_pr() on POWER9. POWER9 does indeed have support for 1T segments, so add a case for POWER9 to the switch statement to ensure it is set. Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Yangtao Li 提交于
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: NYangtao Li <tiny.windzz@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Paul Mackerras 提交于
Testing has revealed an occasional crash which appears to be caused by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv. The symptom is a NULL pointer dereference in __find_linux_pte() called from kvm_unmap_radix() with kvm->arch.pgtable == NULL. Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear kvm->arch.pgtable (via kvmppc_free_radix()) before setting kvm->arch.radix to NULL, and there is nothing to prevent kvm_unmap_hva_range_hv() or the other MMU callback functions from being called concurrently with kvmppc_switch_mmu_to_hpt() or kvmppc_switch_mmu_to_radix(). This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock around the assignments to kvm->arch.radix, and makes sure that the partition-scoped radix tree or HPT is only freed after changing kvm->arch.radix. This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure that the clearing of each rmap array (one per memslot) doesn't happen concurrently with use of the array in the kvm_unmap_hva_range_hv() or the other MMU callbacks. Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 27 11月, 2018 14 次提交
-
-
由 Jim Mattson 提交于
Previously, we only called indirect_branch_prediction_barrier on the logical CPU that freed a vmcb. This function should be called on all logical CPUs that last loaded the vmcb in question. Fixes: 15d45071 ("KVM/x86: Add IBPB support") Reported-by: NNeel Natu <neelnatu@google.com> Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Junaid Shahid 提交于
When a guest page table is updated via an emulated write, kvm_mmu_pte_write() is called to update the shadow PTE using the just written guest PTE value. But if two emulated guest PTE writes happened concurrently, it is possible that the guest PTE and the shadow PTE end up being out of sync. Emulated writes do not mark the shadow page as unsync-ed, so this inconsistency will not be resolved even by a guest TLB flush (unless the page was marked as unsync-ed at some other point). This is fixed by re-reading the current value of the guest PTE after the MMU lock has been acquired instead of just using the value that was written prior to calling kvm_mmu_pte_write(). Signed-off-by: NJunaid Shahid <junaids@google.com> Reviewed-by: NWanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
vmcs12 represents the per-CPU cache of L1 active vmcs12. This cache can be loaded by one of the following: 1) Guest making a vmcs12 active by exeucting VMPTRLD 2) Guest specifying eVMCS in VP assist page and executing VMLAUNCH/VMRESUME. Either way, vmcs12 should have revision_id of VMCS12_REVISION. Which is not equal to eVMCS revision_id which specifies used VersionNumber of eVMCS struct (e.g. KVM_EVMCS_VERSION). Specifically, this causes an issue in restoring a nested VM state because vmx_set_nested_state() verifies that vmcs12->revision_id is equal to VMCS12_REVISION which was not true in case vmcs12 was populated from an eVMCS by vmx_get_nested_state() which calls copy_enlightened_to_vmcs12(). Reviewed-by: NDarren Kenny <darren.kenny@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
According to TLFS section 16.11.2 Enlightened VMCS, the first u32 field of eVMCS should specify eVMCS VersionNumber. This version should be in the range of supported eVMCS versions exposed to guest via CPUID.0x4000000A.EAX[0:15]. The range which KVM expose to guest in this CPUID field should be the same as the value returned in vmcs_version by nested_enable_evmcs(). According to the above, eVMCS VMPTRLD should verify that version specified in given eVMCS is in the supported range. However, current code mistakenly verfies this field against VMCS12_REVISION. One can also see that when KVM use eVMCS, it makes sure that alloc_vmcs_cpu() sets allocated eVMCS revision_id to KVM_EVMCS_VERSION. Obvious fix should just change eVMCS VMPTRLD to verify first u32 field of eVMCS is equal to KVM_EVMCS_VERSION. However, it turns out that Microsoft Hyper-V fails to comply to their own invented interface: When Hyper-V use eVMCS, it just sets first u32 field of eVMCS to revision_id specified in MSR_IA32_VMX_BASIC (In our case: VMCS12_REVISION). Instead of used eVMCS version number which is one of the supported versions specified in CPUID.0x4000000A.EAX[0:15]. To overcome Hyper-V bug, we accept either a supported eVMCS version or VMCS12_REVISION as valid values for first u32 field of eVMCS. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: NMark Kanda <mark.kanda@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Leonid Shatz 提交于
Since commit e79f245d ("X86/KVM: Properly update 'tsc_offset' to represent the running guest"), vcpu->arch.tsc_offset meaning was changed to always reflect the tsc_offset value set on active VMCS. Regardless if vCPU is currently running L1 or L2. However, above mentioned commit failed to also change kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly. This is because vmx_write_tsc_offset() could set the tsc_offset value in active VMCS to given offset parameter *plus vmcs12->tsc_offset*. However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset to given offset parameter. Without taking into account the possible addition of vmcs12->tsc_offset. (Same is true for SVM case). Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return actually set tsc_offset in active VMCS and modify kvm_vcpu_write_tsc_offset() to set returned value in vcpu->arch.tsc_offset. In addition, rename write_tsc_offset() callback to write_l1_tsc_offset() to make it clear that it is meant to set L1 TSC offset. Fixes: e79f245d ("X86/KVM: Properly update 'tsc_offset' to represent the running guest") Reviewed-by: NLiran Alon <liran.alon@oracle.com> Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com> Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: NLeonid Shatz <leonid.shatz@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Yi Wang 提交于
The inline keyword which is not at the beginning of the function declaration may trigger the following build warnings, so let's fix it: arch/x86/kvm/vmx.c:1309:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] arch/x86/kvm/vmx.c:5947:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] arch/x86/kvm/vmx.c:5985:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] arch/x86/kvm/vmx.c:6023:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: NYi Wang <wang.yi59@zte.com.cn> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Yi Wang 提交于
We get the following warnings about empty statements when building with 'W=1': arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] Rework the debug helper macro to get rid of these warnings. Signed-off-by: NYi Wang <wang.yi59@zte.com.cn> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
When guest transitions from/to long-mode by modifying MSR_EFER.LMA, the list of shared MSRs to be saved/restored on guest<->host transitions is updated (See vmx_set_efer() call to setup_msrs()). On every entry to guest, vcpu_enter_guest() calls vmx_prepare_switch_to_guest(). This function should also take care of setting the shared MSRs to be saved/restored. However, the function does nothing in case we are already running with loaded guest state (vmx->loaded_cpu_state != NULL). This means that even when guest modifies MSR_EFER.LMA which results in updating the list of shared MSRs, it isn't being taken into account by vmx_prepare_switch_to_guest() because it happens while we are running with loaded guest state. To fix above mentioned issue, add a flag to mark that the list of shared MSRs has been updated and modify vmx_prepare_switch_to_guest() to set shared MSRs when running with host state *OR* list of shared MSRs has been updated. Note that this issue was mistakenly introduced by commit 678e315e ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base") because previously vmx_set_efer() always called vmx_load_host_state() which resulted in vmx_prepare_switch_to_guest() to set shared MSRs. Fixes: 678e315e ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base") Reported-by: NEyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com> Reviewed-by: NLiam Merwick <liam.merwick@oracle.com> Reviewed-by: NJim Mattson <jmattson@google.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
kvm_pv_clock_pairing() allocates local var "struct kvm_clock_pairing clock_pairing" on stack and initializes all it's fields besides padding (clock_pairing.pad[]). Because clock_pairing var is written completely (including padding) to guest memory, failure to init struct padding results in kernel info-leak. Fix the issue by making sure to also init the padding with zeroes. Fixes: 55dd00a7 ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall") Reported-by: syzbot+a8ef68d71211ba264f56@syzkaller.appspotmail.com Reviewed-by: NMark Kanda <mark.kanda@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
Consider the case that userspace enables KVM_CAP_HYPERV_ENLIGHTENED_VMCS twice: 1) kvm_vcpu_ioctl_enable_cap() is called to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS which calls nested_enable_evmcs(). 2) nested_enable_evmcs() sets enlightened_vmcs_enabled to true and fills vmcs_version which is then copied to userspace. 3) kvm_vcpu_ioctl_enable_cap() is called again to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS which calls nested_enable_evmcs(). 4) This time nested_enable_evmcs() just returns 0 as enlightened_vmcs_enabled is already true. *Without filling vmcs_version*. 5) kvm_vcpu_ioctl_enable_cap() continues as usual and copies *uninitialized* vmcs_version to userspace which leads to kernel info-leak. Fix this issue by simply changing nested_enable_evmcs() to always fill vmcs_version output argument. Even when enlightened_vmcs_enabled is already set to true. Note that SVM's nested_enable_evmcs() should not be modified because it always returns a non-zero value (-ENODEV) which results in kvm_vcpu_ioctl_enable_cap() skipping the copy of vmcs_version to userspace (as it should). Fixes: 57b119da ("KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability") Reported-by: syzbot+cfbc368e283d381f8cef@syzkaller.appspotmail.com Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wei Wang 提交于
There is a race condition when accessing kvm->arch.apic_access_page_done. Due to it, x86_set_memory_region will fail when creating the second vcpu for a svm guest. Add a mutex_lock to serialize the accesses to apic_access_page_done. This lock is also used by vmx for the same purpose. Signed-off-by: NWei Wang <wawei@amazon.de> Signed-off-by: NAmadeusz Juskowiak <ajusk@amazon.de> Signed-off-by: NJulian Stecklina <jsteckli@amazon.de> Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: NJoerg Roedel <jroedel@suse.de> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wanpeng Li 提交于
Reported by syzkaller: BUG: unable to handle kernel NULL pointer dereference at 00000000000001c8 PGD 80000003ec4da067 P4D 80000003ec4da067 PUD 3f7bfa067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 7 PID: 5059 Comm: debug Tainted: G OE 4.19.0-rc5 #16 RIP: 0010:__lock_acquire+0x1a6/0x1990 Call Trace: lock_acquire+0xdb/0x210 _raw_spin_lock+0x38/0x70 kvm_ioapic_scan_entry+0x3e/0x110 [kvm] vcpu_enter_guest+0x167e/0x1910 [kvm] kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm] kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm] do_vfs_ioctl+0xa5/0x690 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x83/0x6e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT6 msr and triggers scan ioapic logic to load synic vectors into EOI exit bitmap. However, irqchip is not initialized by this simple testcase, ioapic/apic objects should not be accessed. This can be triggered by the following program: #define _GNU_SOURCE #include <endian.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff}; int main(void) { syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0); long res = 0; memcpy((void*)0x20000040, "/dev/kvm", 9); res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000040, 0, 0); if (res != -1) r[0] = res; res = syscall(__NR_ioctl, r[0], 0xae01, 0); if (res != -1) r[1] = res; res = syscall(__NR_ioctl, r[1], 0xae41, 0); if (res != -1) r[2] = res; memcpy( (void*)0x20000080, "\x01\x00\x00\x00\x00\x5b\x61\xbb\x96\x00\x00\x40\x00\x00\x00\x00\x01\x00" "\x08\x00\x00\x00\x00\x00\x0b\x77\xd1\x78\x4d\xd8\x3a\xed\xb1\x5c\x2e\x43" "\xaa\x43\x39\xd6\xff\xf5\xf0\xa8\x98\xf2\x3e\x37\x29\x89\xde\x88\xc6\x33" "\xfc\x2a\xdb\xb7\xe1\x4c\xac\x28\x61\x7b\x9c\xa9\xbc\x0d\xa0\x63\xfe\xfe" "\xe8\x75\xde\xdd\x19\x38\xdc\x34\xf5\xec\x05\xfd\xeb\x5d\xed\x2e\xaf\x22" "\xfa\xab\xb7\xe4\x42\x67\xd0\xaf\x06\x1c\x6a\x35\x67\x10\x55\xcb", 106); syscall(__NR_ioctl, r[2], 0x4008ae89, 0x20000080); syscall(__NR_ioctl, r[2], 0xae80, 0); return 0; } This patch fixes it by bailing out scan ioapic if ioapic is not initialized in kernel. Reported-by: NWei Wu <ww9210@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wei Wu <ww9210@gmail.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wanpeng Li 提交于
Reported by syzkaller: BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 PGD 800000040410c067 P4D 800000040410c067 PUD 40410d067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 2567 Comm: poc Tainted: G OE 4.19.0-rc5 #16 RIP: 0010:kvm_pv_send_ipi+0x94/0x350 [kvm] Call Trace: kvm_emulate_hypercall+0x3cc/0x700 [kvm] handle_vmcall+0xe/0x10 [kvm_intel] vmx_handle_exit+0xc1/0x11b0 [kvm_intel] vcpu_enter_guest+0x9fb/0x1910 [kvm] kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm] kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm] do_vfs_ioctl+0xa5/0x690 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x83/0x6e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe The reason is that the apic map has not yet been initialized, the testcase triggers pv_send_ipi interface by vmcall which results in kvm->arch.apic_map is dereferenced. This patch fixes it by checking whether or not apic map is NULL and bailing out immediately if that is the case. Fixes: 4180bf1b (KVM: X86: Implement "send IPI" hypercall) Reported-by: NWei Wu <ww9210@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wei Wu <ww9210@gmail.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Luiz Capitulino 提交于
Apparently, the ple_gap parameter was accidentally removed by commit c8e88717. Add it back. Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com> Cc: stable@vger.kernel.org Fixes: c8e88717Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 26 11月, 2018 1 次提交
-
-
由 Paolo Bonzini 提交于
Merge tag 'kvm-ppc-fixes-4.20-1' of https://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD PPC KVM fixes for 4.20 This has a single 1-line patch which fixes a bug in the recently-merged nested HV KVM support.
-
- 19 11月, 2018 12 次提交
-
-
由 Linus Torvalds 提交于
-
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm由 Linus Torvalds 提交于
Pull libnvdimm fixes from Dan Williams: "A small batch of fixes for v4.20-rc3. The overflow continuation fix addresses something that has been broken for several releases. Arguably it could wait even longer, but it's a one line fix and this finishes the last of the known address range scrub bug reports. The revert addresses a lockdep regression. The unit tests are not critical to fix, but no reason to hold this fix back. Summary: - Address Range Scrub overflow continuation handling has been broken since it was initially merged. It was only recently that error injection and platform-BIOS support enabled this corner case to be exercised. - The recent attempt to provide more isolation for the kernel Address Range Scrub state machine from userapace initiated sessions triggers a lockdep report. Revert and try again at the next merge window. - Fix a kasan reported buffer overflow in libnvdimm unit test infrastrucutre (nfit_test)" * tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: Revert "acpi, nfit: Further restrict userspace ARS start requests" acpi, nfit: Fix ARS overflow continuation tools/testing/nvdimm: Fix the array size for dimm devices.
-
由 Linus Torvalds 提交于
Merge misc fixes from Andrew Morton: "16 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/memblock.c: fix a typo in __next_mem_pfn_range() comments mm, page_alloc: check for max order in hot path scripts/spdxcheck.py: make python3 compliant tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn mm/vmstat.c: fix NUMA statistics updates mm/gup.c: fix follow_page_mask() kerneldoc comment ocfs2: free up write context when direct IO failed scripts/faddr2line: fix location of start_kernel in comment mm: don't reclaim inodes with many attached pages mm, memory_hotplug: check zone_movable in has_unmovable_pages mm/swapfile.c: use kvzalloc for swap_info_struct allocation MAINTAINERS: update OMAP MMC entry hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! kernel/sched/psi.c: simplify cgroup_move_task() z3fold: fix possible reclaim races
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip由 Linus Torvalds 提交于
Pull scheduler fix from Ingo Molnar: "Fix an exec() related scalability/performance regression, which was caused by incorrectly calculating load and migrating tasks on exec() when they shouldn't be" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix cpu_util_wake() for 'execl' type workloads
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip由 Linus Torvalds 提交于
Pull perf fixes from Ingo Molnar: "Fix uncore PMU enumeration for CofeeLake CPUs" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Support CoffeeLake 8th CBOX perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip由 Linus Torvalds 提交于
Pull EFI fixes from Ingo Molnar: "Misc fixes: two warning splat fixes, a leak fix and persistent memory allocation fixes for ARM" * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi: Permit calling efi_mem_reserve_persistent() from atomic context efi/arm: Defer persistent reservations until after paging_init() efi/arm/libstub: Pack FDT after populating it efi/arm: Revert deferred unmap of early memmap mapping efi: Fix debugobjects warning on 'efi_rts_work'
-
git://git.armlinux.org.uk/~rmk/linux-arm由 Linus Torvalds 提交于
Pull ARM spectre updates from Russell King: "These are the currently known final bits that resolve the Spectre issues. big.Little systems used to be sufficiently identical in that there were no differences between individual CPUs in the system that mattered to the kernel. With the advent of the Spectre problem, the CPUs now have differences in how the workaround is applied. As a result of previous Spectre patches, these systems ended up reporting quite a lot of: "CPUx: Spectre v2: incorrect context switching function, system vulnerable" messages due to the action of the big.Little switcher causing the CPUs to be re-initialised regularly. This series resolves that issue by making the CPU vtable unique to each CPU. However, since this is used very early, before per-cpu is setup, per-cpu can't be used. We also have a problem that two of the methods are not called from preempt-safe paths, but thankfully these remain identical between all CPUs in the system. To make sure, we validate that these are identical during boot" * 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: spectre-v2: per-CPU vtables to work around big.Little systems ARM: add PROC_VTABLE and PROC_TABLE macros ARM: clean up per-processor check_bugs method call ARM: split out processor lookup ARM: make lookup_processor_type() non-__init
-
由 Chen Chang 提交于
Link: http://lkml.kernel.org/r/20181107100247.13359-1-rainccrun@gmail.comSigned-off-by: NChen Chang <rainccrun@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Konstantin has noticed that kvmalloc might trigger the following warning: WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60 [...] Call Trace: fragmentation_index+0x76/0x90 compaction_suitable+0x4f/0xf0 shrink_node+0x295/0x310 node_reclaim+0x205/0x250 get_page_from_freelist+0x649/0xad0 __alloc_pages_nodemask+0x12a/0x2a0 kmalloc_large_node+0x47/0x90 __kmalloc_node+0x22b/0x2e0 kvmalloc_node+0x3e/0x70 xt_alloc_table_info+0x3a/0x80 [x_tables] do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables] nf_setsockopt+0x44/0x60 SyS_setsockopt+0x6f/0xc0 do_syscall_64+0x67/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 the problem is that we only check for an out of bound order in the slow path and the node reclaim might happen from the fast path already. This is fixable by making sure that kvmalloc doesn't ever use kmalloc for requests that are larger than KMALLOC_MAX_SIZE but this also shows that the code is rather fragile. A recent UBSAN report just underlines that by the following report UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19 shift exponent 51 is too large for 32-bit type 'int' CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xd2/0x148 lib/dump_stack.c:113 ubsan_epilogue+0x12/0x94 lib/ubsan.c:159 __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425 __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117 zone_watermark_fast mm/page_alloc.c:3216 [inline] get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300 __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370 alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093 alloc_pages include/linux/gfp.h:509 [inline] __get_free_pages+0x12/0x60 mm/page_alloc.c:4414 dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156 raw_cmd_copyin drivers/block/floppy.c:3159 [inline] raw_cmd_ioctl drivers/block/floppy.c:3206 [inline] fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544 fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571 __blkdev_driver_ioctl block/ioctl.c:303 [inline] blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601 block_ioctl+0x105/0x150 fs/block_dev.c:1883 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687 ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702 __do_sys_ioctl fs/ioctl.c:709 [inline] __se_sys_ioctl fs/ioctl.c:707 [inline] __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707 do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Note that this is not a kvmalloc path. It is just that the fast path really depends on having sanitzed order as well. Therefore move the order check to the fast path. Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: NKyungtae Kim <kt0755@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Byoungyoung Lee <lifeasageek@gmail.com> Cc: "Dae R. Jeong" <threeearcat@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Uwe Kleine-König 提交于
Without this change the following happens when using Python3 (3.6.6): $ echo "GPL-2.0" | python3 scripts/spdxcheck.py - FAIL: 'str' object has no attribute 'decode' Traceback (most recent call last): File "scripts/spdxcheck.py", line 253, in <module> parser.parse_lines(sys.stdin, args.maxlines, '-') File "scripts/spdxcheck.py", line 171, in parse_lines line = line.decode(locale.getpreferredencoding(False), errors='ignore') AttributeError: 'str' object has no attribute 'decode' So as the line is already a string, there is no need to decode it and the line can be dropped. /usr/bin/python on Arch is Python 3. So this would indeed be worth going into 4.19. Link: http://lkml.kernel.org/r/20181023070802.22558-1-u.kleine-koenig@pengutronix.deSigned-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Joe Perches <joe@perches.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yufen Yu 提交于
Other filesystems such as ext4, f2fs and ubifs all return ENXIO when lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset. man 2 lseek says : EINVAL whence is not valid. Or: the resulting file offset would be : negative, or beyond the end of a seekable device. : : ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond : the end of the file. Make tmpfs return ENXIO under these circumstances as well. After this, tmpfs also passes xfstests's generic/448. [akpm@linux-foundation.org: rewrite changelog] Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.comSigned-off-by: NYufen Yu <yuyufen@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arnd Bergmann 提交于
gcc-8 complains about the prototype for this function: lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes] This is actually a GCC's bug. In GCC internals __ubsan_handle_builtin_unreachable() declared with both 'noreturn' and 'const' attributes instead of only 'noreturn': https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84210 Workaround this by removing the noreturn attribute. [aryabinin: add information about GCC bug in changelog] Link: http://lkml.kernel.org/r/20181107144516.4587-1-aryabinin@virtuozzo.comSigned-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: NOlof Johansson <olof@lixom.net> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-