1. 22 9月, 2021 9 次提交
    • H
      kvm: fix wrong exception emulation in check_rdtsc · e9337c84
      Hou Wenlong 提交于
      According to Intel's SDM Vol2 and AMD's APM Vol3, when
      CR4.TSD is set, use rdtsc/rdtscp instruction above privilege
      level 0 should trigger a #GP.
      
      Fixes: d7eb8203 ("KVM: SVM: Add intercept checks for remaining group7 instructions")
      Signed-off-by: NHou Wenlong <houwenlong93@linux.alibaba.com>
      Message-Id: <1297c0dd3f1bb47a6d089f850b629c7aa0247040.1629257115.git.houwenlong93@linux.alibaba.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e9337c84
    • S
      KVM: SEV: Pin guest memory for write for RECEIVE_UPDATE_DATA · 50c03801
      Sean Christopherson 提交于
      Require the target guest page to be writable when pinning memory for
      RECEIVE_UPDATE_DATA.  Per the SEV API, the PSP writes to guest memory:
      
        The result is then encrypted with GCTX.VEK and written to the memory
        pointed to by GUEST_PADDR field.
      
      Fixes: 15fb7de1 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
      Cc: stable@vger.kernel.org
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210914210951.2994260-2-seanjc@google.com>
      Reviewed-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NPeter Gonda <pgonda@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      50c03801
    • M
      KVM: SVM: fix missing sev_decommission in sev_receive_start · f1815e0a
      Mingwei Zhang 提交于
      DECOMMISSION the current SEV context if binding an ASID fails after
      RECEIVE_START.  Per AMD's SEV API, RECEIVE_START generates a new guest
      context and thus needs to be paired with DECOMMISSION:
      
           The RECEIVE_START command is the only command other than the LAUNCH_START
           command that generates a new guest context and guest handle.
      
      The missing DECOMMISSION can result in subsequent SEV launch failures,
      as the firmware leaks memory and might not able to allocate more SEV
      guest contexts in the future.
      
      Note, LAUNCH_START suffered the same bug, but was previously fixed by
      commit 934002cd ("KVM: SVM: Call SEV Guest Decommission if ASID
      binding fails").
      
      Cc: Alper Gun <alpergun@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: John Allen <john.allen@amd.com>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vipin Sharma <vipinsh@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NMarc Orr <marcorr@google.com>
      Acked-by: NBrijesh Singh <brijesh.singh@amd.com>
      Fixes: af43cbbf ("KVM: SVM: Add support for KVM_SEV_RECEIVE_START command")
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210912181815.3899316-1-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1815e0a
    • P
      KVM: SEV: Acquire vcpu mutex when updating VMSA · bb18a677
      Peter Gonda 提交于
      The update-VMSA ioctl touches data stored in struct kvm_vcpu, and
      therefore should not be performed concurrently with any VCPU ioctl
      that might cause KVM or the processor to use the same data.
      
      Adds vcpu mutex guard to the VMSA updating code. Refactors out
      __sev_launch_update_vmsa() function to deal with per vCPU parts
      of sev_launch_update_vmsa().
      
      Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
      Signed-off-by: NPeter Gonda <pgonda@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <20210915171755.3773766-1-pgonda@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb18a677
    • Y
      KVM: nVMX: fix comments of handle_vmon() · ed7023a1
      Yu Zhang 提交于
      "VMXON pointer" is saved in vmx->nested.vmxon_ptr since
      commit 3573e22c ("KVM: nVMX: additional checks on
      vmxon region"). Also, handle_vmptrld() & handle_vmclear()
      now have logic to check the VMCS pointer against the VMXON
      pointer.
      
      So just remove the obsolete comments of handle_vmon().
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Message-Id: <20210908171731.18885-1-yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed7023a1
    • H
      KVM: x86: Handle SRCU initialization failure during page track init · eb7511bf
      Haimin Zhang 提交于
      Check the return of init_srcu_struct(), which can fail due to OOM, when
      initializing the page track mechanism.  Lack of checking leads to a NULL
      pointer deref found by a modified syzkaller.
      Reported-by: NTCS Robot <tcs_robot@tencent.com>
      Signed-off-by: NHaimin Zhang <tcs_kernel@tencent.com>
      Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com>
      [Move the call towards the beginning of kvm_arch_init_vm. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eb7511bf
    • S
      KVM: VMX: Remove defunct "nr_active_uret_msrs" field · cd36ae87
      Sean Christopherson 提交于
      Remove vcpu_vmx.nr_active_uret_msrs and its associated comment, which are
      both defunct now that KVM keeps the list constant and instead explicitly
      tracks which entries need to be loaded into hardware.
      
      No functional change intended.
      
      Fixes: ee9d22e0 ("KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210908002401.1947049-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cd36ae87
    • S
      KVM: x86: Clear KVM's cached guest CR3 at RESET/INIT · 03a6e840
      Sean Christopherson 提交于
      Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT.
      Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT.  For
      RESET, this is a nop as vcpu is zero-allocated.  For INIT, the bug has
      likely escaped notice because no firmware/kernel puts its page tables root
      at PA=0, let alone relies on INIT to get the desired CR3 for such page
      tables.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210921000303.400537-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03a6e840
    • S
      KVM: x86: Mark all registers as avail/dirty at vCPU creation · 7117003f
      Sean Christopherson 提交于
      Mark all registers as available and dirty at vCPU creation, as the vCPU has
      obviously not been loaded into hardware, let alone been given the chance to
      be modified in hardware.  On SVM, reading from "uninitialized" hardware is
      a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and
      hardware does not allow for arbitrary field encoding schemes.
      
      On VMX, backing memory for VMCSes is also zero allocated, but true
      initialization of the VMCS _technically_ requires VMWRITEs, as the VMX
      architectural specification technically allows CPU implementations to
      encode fields with arbitrary schemes.  E.g. a CPU could theoretically store
      the inverted value of every field, which would result in VMREAD to a
      zero-allocated field returns all ones.
      
      In practice, only the AR_BYTES fields are known to be manipulated by
      hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX)
      does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...).  In
      other words, this is technically a bug fix, but practically speakings it's
      a glorified nop.
      
      Failure to mark registers as available has been a lurking bug for quite
      some time.  The original register caching supported only GPRs (+RIP, which
      is kinda sorta a GPR), with the masks initialized at ->vcpu_reset().  That
      worked because the two cacheable registers, RIP and RSP, are generally
      speaking not read as side effects in other flows.
      
      Arguably, commit aff48baa ("KVM: Fetch guest cr3 from hardware on
      demand") was the first instance of failure to mark regs available.  While
      _just_ marking CR3 available during vCPU creation wouldn't have fixed the
      VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0()
      unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not
      reading GUEST_CR3 when it's available would have avoided VMREAD to a
      technically-uninitialized VMCS.
      
      Fixes: aff48baa ("KVM: Fetch guest cr3 from hardware on demand")
      Fixes: 6de4f3ad ("KVM: Cache pdptrs")
      Fixes: 6de12732 ("KVM: VMX: Optimize vmx_get_rflags()")
      Fixes: 2fb92db1 ("KVM: VMX: Cache vmcs segment fields")
      Fixes: bd31fe49 ("KVM: VMX: Add proper cache tracking for CR0")
      Fixes: f98c1e77 ("KVM: VMX: Add proper cache tracking for CR4")
      Fixes: 5addc235 ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags")
      Fixes: 87915858 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210921000303.400537-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7117003f
  2. 19 9月, 2021 1 次提交
  3. 15 9月, 2021 3 次提交
    • J
      xen: fix usage of pmd_populate in mremap for pv guests · 36c9b592
      Juergen Gross 提交于
      Commit 0881ace2 ("mm/mremap: use pmd/pud_poplulate to update page
      table entries") introduced a regression when running as Xen PV guest.
      
      Today pmd_populate() for Xen PV assumes that the PFN inserted is
      referencing a not yet used page table. In case of move_normal_pmd()
      this is not true, resulting in WARN splats like:
      
      [34321.304270] ------------[ cut here ]------------
      [34321.304277] WARNING: CPU: 0 PID: 23628 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x176/0x1a0
      [34321.304288] Modules linked in:
      [34321.304291] CPU: 0 PID: 23628 Comm: apt-get Not tainted 5.14.1-20210906-doflr-mac80211debug+ #1
      [34321.304294] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
      [34321.304296] RIP: e030:xen_mc_flush+0x176/0x1a0
      [34321.304300] Code: 89 45 18 48 c1 e9 3f 48 89 ce e9 20 ff ff ff e8 60 03 00 00 66 90 5b 5d 41 5c 41 5d c3 48 c7 45 18 ea ff ff ff be 01 00 00 00 <0f> 0b 8b 55 00 48 c7 c7 10 97 aa 82 31 db 49 c7 c5 38 97 aa 82 65
      [34321.304303] RSP: e02b:ffffc90000a97c90 EFLAGS: 00010002
      [34321.304305] RAX: ffff88807d416398 RBX: ffff88807d416350 RCX: ffff88807d416398
      [34321.304306] RDX: 0000000000000001 RSI: 0000000000000001 RDI: deadbeefdeadf00d
      [34321.304308] RBP: ffff88807d416300 R08: aaaaaaaaaaaaaaaa R09: ffff888006160cc0
      [34321.304309] R10: deadbeefdeadf00d R11: ffffea000026a600 R12: 0000000000000000
      [34321.304310] R13: ffff888012f6b000 R14: 0000000012f6b000 R15: 0000000000000001
      [34321.304320] FS:  00007f5071177800(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
      [34321.304322] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
      [34321.304323] CR2: 00007f506f542000 CR3: 00000000160cc000 CR4: 0000000000000660
      [34321.304326] Call Trace:
      [34321.304331]  xen_alloc_pte+0x294/0x320
      [34321.304334]  move_pgt_entry+0x165/0x4b0
      [34321.304339]  move_page_tables+0x6fa/0x8d0
      [34321.304342]  move_vma.isra.44+0x138/0x500
      [34321.304345]  __x64_sys_mremap+0x296/0x410
      [34321.304348]  do_syscall_64+0x3a/0x80
      [34321.304352]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [34321.304355] RIP: 0033:0x7f507196301a
      [34321.304358] Code: 73 01 c3 48 8b 0d 76 0e 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 19 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 46 0e 0c 00 f7 d8 64 89 01 48
      [34321.304360] RSP: 002b:00007ffda1eecd38 EFLAGS: 00000246 ORIG_RAX: 0000000000000019
      [34321.304362] RAX: ffffffffffffffda RBX: 000056205f950f30 RCX: 00007f507196301a
      [34321.304363] RDX: 0000000001a00000 RSI: 0000000001900000 RDI: 00007f506dc56000
      [34321.304364] RBP: 0000000001a00000 R08: 0000000000000010 R09: 0000000000000004
      [34321.304365] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f506dc56060
      [34321.304367] R13: 00007f506dc56000 R14: 00007f506dc56060 R15: 000056205f950f30
      [34321.304368] ---[ end trace a19885b78fe8f33e ]---
      [34321.304370] 1 of 2 multicall(s) failed: cpu 0
      [34321.304371]   call  2: op=12297829382473034410 arg=[aaaaaaaaaaaaaaaa] result=-22
      
      Fix that by modifying xen_alloc_ptpage() to only pin the page table in
      case it wasn't pinned already.
      
      Fixes: 0881ace2 ("mm/mremap: use pmd/pud_poplulate to update page table entries")
      Cc: <stable@vger.kernel.org>
      Reported-by: NSander Eikelenboom <linux@eikelenboom.it>
      Tested-by: NSander Eikelenboom <linux@eikelenboom.it>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Link: https://lore.kernel.org/r/20210908073640.11299-1-jgross@suse.comSigned-off-by: NJuergen Gross <jgross@suse.com>
      36c9b592
    • J
      xen: reset legacy rtc flag for PV domU · f68aa100
      Juergen Gross 提交于
      A Xen PV guest doesn't have a legacy RTC device, so reset the legacy
      RTC flag. Otherwise the following WARN splat will occur at boot:
      
      [    1.333404] WARNING: CPU: 1 PID: 1 at /home/gross/linux/head/drivers/rtc/rtc-mc146818-lib.c:25 mc146818_get_time+0x1be/0x210
      [    1.333404] Modules linked in:
      [    1.333404] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G        W         5.14.0-rc7-default+ #282
      [    1.333404] RIP: e030:mc146818_get_time+0x1be/0x210
      [    1.333404] Code: c0 64 01 c5 83 fd 45 89 6b 14 7f 06 83 c5 64 89 6b 14 41 83 ec 01 b8 02 00 00 00 44 89 63 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 30 0e ef 82 4c 89 e6 e8 71 2a 24 00 48 c7 c0 ff ff
      [    1.333404] RSP: e02b:ffffc90040093df8 EFLAGS: 00010002
      [    1.333404] RAX: 00000000000000ff RBX: ffffc90040093e34 RCX: 0000000000000000
      [    1.333404] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000000000d
      [    1.333404] RBP: ffffffff82ef0e30 R08: ffff888005013e60 R09: 0000000000000000
      [    1.333404] R10: ffffffff82373e9b R11: 0000000000033080 R12: 0000000000000200
      [    1.333404] R13: 0000000000000000 R14: 0000000000000002 R15: ffffffff82cdc6d4
      [    1.333404] FS:  0000000000000000(0000) GS:ffff88807d440000(0000) knlGS:0000000000000000
      [    1.333404] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    1.333404] CR2: 0000000000000000 CR3: 000000000260a000 CR4: 0000000000050660
      [    1.333404] Call Trace:
      [    1.333404]  ? wakeup_sources_sysfs_init+0x30/0x30
      [    1.333404]  ? rdinit_setup+0x2b/0x2b
      [    1.333404]  early_resume_init+0x23/0xa4
      [    1.333404]  ? cn_proc_init+0x36/0x36
      [    1.333404]  do_one_initcall+0x3e/0x200
      [    1.333404]  kernel_init_freeable+0x232/0x28e
      [    1.333404]  ? rest_init+0xd0/0xd0
      [    1.333404]  kernel_init+0x16/0x120
      [    1.333404]  ret_from_fork+0x1f/0x30
      
      Cc: <stable@vger.kernel.org>
      Fixes: 8d152e7a ("x86/rtc: Replace paravirt rtc check with platform legacy quirk")
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Link: https://lore.kernel.org/r/20210903084937.19392-3-jgross@suse.comSigned-off-by: NJuergen Gross <jgross@suse.com>
      f68aa100
    • L
      memblock: introduce saner 'memblock_free_ptr()' interface · 77e02cf5
      Linus Torvalds 提交于
      The boot-time allocation interface for memblock is a mess, with
      'memblock_alloc()' returning a virtual pointer, but then you are
      supposed to free it with 'memblock_free()' that takes a _physical_
      address.
      
      Not only is that all kinds of strange and illogical, but it actually
      causes bugs, when people then use it like a normal allocation function,
      and it fails spectacularly on a NULL pointer:
      
         https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/
      
      or just random memory corruption if the debug checks don't catch it:
      
         https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/
      
      I really don't want to apply patches that treat the symptoms, when the
      fundamental cause is this horribly confusing interface.
      
      I started out looking at just automating a sane replacement sequence,
      but because of this mix or virtual and physical addresses, and because
      people have used the "__pa()" macro that can take either a regular
      kernel pointer, or just the raw "unsigned long" address, it's all quite
      messy.
      
      So this just introduces a new saner interface for freeing a virtual
      address that was allocated using 'memblock_alloc()', and that was kept
      as a regular kernel pointer.  And then it converts a couple of users
      that are obvious and easy to test, including the 'xbc_nodes' case in
      lib/bootconfig.c that caused problems.
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Fixes: 40caa127 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77e02cf5
  4. 14 9月, 2021 2 次提交
    • T
      x86/mce: Avoid infinite loop for copy from user recovery · 81065b35
      Tony Luck 提交于
      There are two cases for machine check recovery:
      
      1) The machine check was triggered by ring3 (application) code.
         This is the simpler case. The machine check handler simply queues
         work to be executed on return to user. That code unmaps the page
         from all users and arranges to send a SIGBUS to the task that
         triggered the poison.
      
      2) The machine check was triggered in kernel code that is covered by
         an exception table entry. In this case the machine check handler
         still queues a work entry to unmap the page, etc. but this will
         not be called right away because the #MC handler returns to the
         fix up code address in the exception table entry.
      
      Problems occur if the kernel triggers another machine check before the
      return to user processes the first queued work item.
      
      Specifically, the work is queued using the ->mce_kill_me callback
      structure in the task struct for the current thread. Attempting to queue
      a second work item using this same callback results in a loop in the
      linked list of work functions to call. So when the kernel does return to
      user, it enters an infinite loop processing the same entry for ever.
      
      There are some legitimate scenarios where the kernel may take a second
      machine check before returning to the user.
      
      1) Some code (e.g. futex) first tries a get_user() with page faults
         disabled. If this fails, the code retries with page faults enabled
         expecting that this will resolve the page fault.
      
      2) Copy from user code retries a copy in byte-at-time mode to check
         whether any additional bytes can be copied.
      
      On the other side of the fence are some bad drivers that do not check
      the return value from individual get_user() calls and may access
      multiple user addresses without noticing that some/all calls have
      failed.
      
      Fix by adding a counter (current->mce_count) to keep track of repeated
      machine checks before task_work() is called. First machine check saves
      the address information and calls task_work_add(). Subsequent machine
      checks before that task_work call back is executed check that the address
      is in the same page as the first machine check (since the callback will
      offline exactly one page).
      
      Expected worst case is four machine checks before moving on (e.g. one
      user access with page faults disabled, then a repeat to the same address
      with page faults enabled ... repeat in copy tail bytes). Just in case
      there is some code that loops forever enforce a limit of 10.
      
       [ bp: Massage commit message, drop noinstr, fix typo, extend panic
         messages. ]
      
      Fixes: 5567d11c ("x86/mce: Send #MC singal from task work")
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
      81065b35
    • W
      x86/uaccess: Fix 32-bit __get_user_asm_u64() when CC_HAS_ASM_GOTO_OUTPUT=y · a69ae291
      Will Deacon 提交于
      Commit 865c50e1 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
      added an optimised version of __get_user_asm() for x86 using 'asm goto'.
      
      Like the non-optimised code, the 32-bit implementation of 64-bit
      get_user() expands to a pair of 32-bit accesses.  Unlike the
      non-optimised code, the _original_ pointer is incremented to copy the
      high word instead of loading through a new pointer explicitly
      constructed to point at a 32-bit type.  Consequently, if the pointer
      points at a 64-bit type then we end up loading the wrong data for the
      upper 32-bits.
      
      This was observed as a mount() failure in Android targeting i686 after
      b0cfcdd9 ("d_path: make 'prepend()' fill up the buffer exactly on
      overflow") because the call to copy_from_kernel_nofault() from
      prepend_copy() ends up in __get_kernel_nofault() and casts the source
      pointer to a 'u64 __user *'.  An attempt to mount at "/debug_ramdisk"
      therefore ends up failing trying to mount "/debumdismdisk".
      
      Use the existing '__gu_ptr' source pointer to unsigned int for 32-bit
      __get_user_asm_u64() instead of the original pointer.
      
      Cc: Bill Wendling <morbo@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reported-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Fixes: 865c50e1 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
      Signed-off-by: NWill Deacon <will@kernel.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Tested-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a69ae291
  5. 11 9月, 2021 1 次提交
  6. 09 9月, 2021 5 次提交
    • A
      arch: remove compat_alloc_user_space · a7a08b27
      Arnd Bergmann 提交于
      All users of compat_alloc_user_space() and copy_in_user() have been
      removed from the kernel, only a few functions in sparc remain that can be
      changed to calling arch_copy_in_user() instead.
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-7-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7a08b27
    • A
      compat: remove some compat entry points · 59ab844e
      Arnd Bergmann 提交于
      These are all handled correctly when calling the native system call entry
      point, so remove the special cases.
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ab844e
    • M
      x86/mm: Fix kern_addr_valid() to cope with existing but not present entries · 34b1999d
      Mike Rapoport 提交于
      Jiri Olsa reported a fault when running:
      
        # cat /proc/kallsyms | grep ksys_read
        ffffffff8136d580 T ksys_read
        # objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
      
        /proc/kcore:     file format elf64-x86-64
      
        Segmentation fault
      
        general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
        CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
        RIP: 0010:kern_addr_valid
        Call Trace:
         read_kcore
         ? rcu_read_lock_sched_held
         ? rcu_read_lock_sched_held
         ? rcu_read_lock_sched_held
         ? trace_hardirqs_on
         ? rcu_read_lock_sched_held
         ? lock_acquire
         ? lock_acquire
         ? rcu_read_lock_sched_held
         ? lock_acquire
         ? rcu_read_lock_sched_held
         ? rcu_read_lock_sched_held
         ? rcu_read_lock_sched_held
         ? lock_release
         ? _raw_spin_unlock
         ? __handle_mm_fault
         ? rcu_read_lock_sched_held
         ? lock_acquire
         ? rcu_read_lock_sched_held
         ? lock_release
         proc_reg_read
         ? vfs_read
         vfs_read
         ksys_read
         do_syscall_64
         entry_SYSCALL_64_after_hwframe
      
      The fault happens because kern_addr_valid() dereferences existent but not
      present PMD in the high kernel mappings.
      
      Such PMDs are created when free_kernel_image_pages() frees regions larger
      than 2Mb. In this case, a part of the freed memory is mapped with PMDs and
      the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
      mark the PMD as not present rather than wipe it completely.
      
      Have kern_addr_valid() check whether higher level page table entries are
      present before trying to dereference them to fix this issue and to avoid
      similar issues in the future.
      
      Stable backporting note:
      ------------------------
      
      Note that the stable marking is for all active stable branches because
      there could be cases where pagetable entries exist but are not valid -
      see 9a14aefc ("x86: cpa, fix lookup_address"), for example. So make
      sure to be on the safe side here and use pXY_present() accessors rather
      than pXY_none() which could #GP when accessing pages in the direct map.
      
      Also see:
      
        c40a56a7 ("x86/mm/init: Remove freed kernel image areas from alias mapping")
      
      for more info.
      Reported-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Tested-by: NJiri Olsa <jolsa@redhat.com>
      Cc: <stable@vger.kernel.org>	# 4.4+
      Link: https://lkml.kernel.org/r/20210819132717.19358-1-rppt@kernel.org
      34b1999d
    • Z
      configs: remove the obsolete CONFIG_INPUT_POLLDEV · 4cb398fe
      Zenghui Yu 提交于
      This CONFIG option was removed in commit 278b13ce ("Input: remove
      input_polled_dev implementation") so there's no point to keep it in
      defconfigs any longer.
      
      Get rid of the leftover for all arches.
      
      Link: https://lkml.kernel.org/r/20210726074741.1062-1-yuzenghui@huawei.comSigned-off-by: NZenghui Yu <yuzenghui@huawei.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cb398fe
    • D
      mm/memory_hotplug: remove nid parameter from arch_remove_memory() · 65a2aa5f
      David Hildenbrand 提交于
      The parameter is unused, let's remove it.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
      Acked-by: Heiko Carstens <hca@linux.ibm.com>	[s390]
      Reviewed-by: NPankaj Gupta <pankaj.gupta@ionos.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65a2aa5f
  7. 06 9月, 2021 12 次提交
    • Z
      KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted · d9130a2d
      Zelin Deng 提交于
      When MSR_IA32_TSC_ADJUST is written by guest due to TSC ADJUST feature
      especially there's a big tsc warp (like a new vCPU is hot-added into VM
      which has been up for a long time), tsc_offset is added by a large value
      then go back to guest. This causes system time jump as tsc_timestamp is
      not adjusted in the meantime and pvclock monotonic character.
      To fix this, just notify kvm to update vCPU's guest time before back to
      guest.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NZelin Deng <zelin.deng@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <1619576521-81399-2-git-send-email-zelin.deng@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9130a2d
    • P
      KVM: MMU: mark role_regs and role accessors as maybe unused · 4ac21457
      Paolo Bonzini 提交于
      It is reasonable for these functions to be used only in some configurations,
      for example only if the host is 64-bits (and therefore supports 64-bit
      guests).  It is also reasonable to keep the role_regs and role accessors
      in sync even though some of the accessors may be used only for one of the
      two sets (as is the case currently for CR4.LA57)..
      
      Because clang reports warnings for unused inlines declared in a .c file,
      mark both sets of accessors as __maybe_unused.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4ac21457
    • L
      x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait · a40b2fd0
      Lai Jiangshan 提交于
      Commit f4e61f0c ("x86/kvm: Fix broken irq restoration in kvm_wait")
      replaced "local_irq_restore() when IRQ enabled" with "local_irq_enable()
      when IRQ enabled" to suppress a warnning.
      
      Although there is no similar debugging warnning for doing local_irq_enable()
      when IRQ enabled as doing local_irq_restore() in the same IRQ situation.  But
      doing local_irq_enable() when IRQ enabled is no less broken as doing
      local_irq_restore() and we'd better avoid it.
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210814035129.154242-1-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a40b2fd0
    • S
      KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page · 1148bfc4
      Sean Christopherson 提交于
      Move "lpage_disallowed_link" out of the first 64 bytes, i.e. out of the
      first cache line, of kvm_mmu_page so that "spt" and to a lesser extent
      "gfns" land in the first cache line.  "lpage_disallowed_link" is accessed
      relatively infrequently compared to "spt", which is accessed any time KVM
      is walking and/or manipulating the shadow page tables.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210901221023.1303578-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1148bfc4
    • S
      KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality · ca41c34c
      Sean Christopherson 提交于
      Move "tdp_mmu_page" into the 1-byte void left by the recently removed
      "mmio_cached" so that it resides in the first 64 bytes of kvm_mmu_page,
      i.e. in the same cache line as the most commonly accessed fields.
      
      Don't bother wrapping tdp_mmu_page in CONFIG_X86_64, including the field in
      32-bit builds doesn't affect the size of kvm_mmu_page, and a future patch
      can always wrap the field in the unlikely event KVM gains a 1-byte flag
      that is 32-bit specific.
      
      Note, the size of kvm_mmu_page is also unchanged on CONFIG_X86_64=y due
      to it previously sharing an 8-byte chunk with write_flooding_count.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210901221023.1303578-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ca41c34c
    • S
      Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()" · e7177339
      Sean Christopherson 提交于
      Revert a misguided illegal GPA check when "translating" a non-nested GPA.
      The check is woefully incomplete as it does not fill in @exception as
      expected by all callers, which leads to KVM attempting to inject a bogus
      exception, potentially exposing kernel stack information in the process.
      
       WARNING: CPU: 0 PID: 8469 at arch/x86/kvm/x86.c:525 exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
       CPU: 1 PID: 8469 Comm: syz-executor531 Not tainted 5.14.0-rc7-syzkaller #0
       RIP: 0010:exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
       Call Trace:
        x86_emulate_instruction+0xef6/0x1460 arch/x86/kvm/x86.c:7853
        kvm_mmu_page_fault+0x2f0/0x1810 arch/x86/kvm/mmu/mmu.c:5199
        handle_ept_misconfig+0xdf/0x3e0 arch/x86/kvm/vmx/vmx.c:5336
        __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6021 [inline]
        vmx_handle_exit+0x336/0x1800 arch/x86/kvm/vmx/vmx.c:6038
        vcpu_enter_guest+0x2a1c/0x4430 arch/x86/kvm/x86.c:9712
        vcpu_run arch/x86/kvm/x86.c:9779 [inline]
        kvm_arch_vcpu_ioctl_run+0x47d/0x1b20 arch/x86/kvm/x86.c:10010
        kvm_vcpu_ioctl+0x49e/0xe50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3652
      
      The bug has escaped notice because practically speaking the GPA check is
      useless.  The GPA check in question only comes into play when KVM is
      walking guest page tables (or "translating" CR3), and KVM already handles
      illegal GPA checks by setting reserved bits in rsvd_bits_mask for each
      PxE, or in the case of CR3 for loading PTDPTRs, manually checks for an
      illegal CR3.  This particular failure doesn't hit the existing reserved
      bits checks because syzbot sets guest.MAXPHYADDR=1, and IA32 architecture
      simply doesn't allow for such an absurd MAXPHYADDR, e.g. 32-bit paging
      doesn't define any reserved PA bits checks, which KVM emulates by only
      incorporating the reserved PA bits into the "high" bits, i.e. bits 63:32.
      
      Simply remove the bogus check.  There is zero meaningful value and no
      architectural justification for supporting guest.MAXPHYADDR < 32, and
      properly filling the exception would introduce non-trivial complexity.
      
      This reverts commit ec7771ab.
      
      Fixes: ec7771ab ("KVM: x86: mmu: Add guest physical address check in translate_gpa()")
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+200c08e88ae818f849ce@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210831164224.1119728-2-seanjc@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e7177339
    • J
      KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page · 678a305b
      Jia He 提交于
      After reverting and restoring the fast tlb invalidation patch series,
      the mmio_cached is not removed. Hence a unused field is left in
      kvm_mmu_page.
      
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: NJia He <justin.he@arm.com>
      Message-Id: <20210830145336.27183-1-justin.he@arm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      678a305b
    • E
      kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710 · 1dbaf04c
      Eduardo Habkost 提交于
      Support for 710 VCPUs was tested by Red Hat since RHEL-8.4,
      so increase KVM_SOFT_MAX_VCPUS to 710.
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Message-Id: <20210903211600.2002377-4-ehabkost@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1dbaf04c
    • E
      kvm: x86: Increase MAX_VCPUS to 1024 · 074c82c8
      Eduardo Habkost 提交于
      Increase KVM_MAX_VCPUS to 1024, so we can test larger VMs.
      
      I'm not changing KVM_SOFT_MAX_VCPUS yet because I'm afraid it
      might involve complicated questions around the meaning of
      "supported" and "recommended" in the upstream tree.
      KVM_SOFT_MAX_VCPUS will be changed in a separate patch.
      
      For reference, visible effects of this change are:
      - KVM_CAP_MAX_VCPUS will now return 1024 (of course)
      - Default value for CPUID[HYPERV_CPUID_IMPLEMENT_LIMITS (00x40000005)].EAX
        will now be 1024
      - KVM_MAX_VCPU_ID will change from 1151 to 4096
      - Size of struct kvm will increase from 19328 to 22272 bytes
        (in x86_64)
      - Size of struct kvm_ioapic will increase from 1780 to 5084 bytes
        (in x86_64)
      - Bitmap stack variables that will grow:
        - At kvm_hv_flush_tlb() kvm_hv_send_ipi(),
          vp_bitmap[] and vcpu_bitmap[] will now be 128 bytes long
        - vcpu_bitmap at bioapic_write_indirect() will be 128 bytes long
          once patch "KVM: x86: Fix stack-out-of-bounds memory access
          from ioapic_write_indirect()" is applied
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Message-Id: <20210903211600.2002377-3-ehabkost@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      074c82c8
    • E
      kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS · 4ddacd52
      Eduardo Habkost 提交于
      Instead of requiring KVM_MAX_VCPU_ID to be manually increased
      every time we increase KVM_MAX_VCPUS, set it to 4*KVM_MAX_VCPUS.
      This should be enough for CPU topologies where Cores-per-Package
      and Packages-per-Socket are not powers of 2.
      
      In practice, this increases KVM_MAX_VCPU_ID from 1023 to 1152.
      The only side effect of this change is making some fields in
      struct kvm_ioapic larger, increasing the struct size from 1628 to
      1780 bytes (in x86_64).
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Message-Id: <20210903211600.2002377-2-ehabkost@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4ddacd52
    • M
      KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation · 81b4b56d
      Maxim Levitsky 提交于
      If we are emulating an invalid guest state, we don't have a correct
      exit reason, and thus we shouldn't do anything in this function.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210826095750.1650467-2-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 95b5a48c ("KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn", 2019-06-18)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      81b4b56d
    • S
      KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host · a717a780
      Sean Christopherson 提交于
      Include pml5_root in the set of special roots if and only if the host,
      and thus NPT, is using 5-level paging.  mmu_alloc_special_roots() expects
      special roots to be allocated as a bundle, i.e. they're either all valid
      or all NULL.  But for pml5_root, that expectation only holds true if the
      host uses 5-level paging, which causes KVM to WARN about pml5_root being
      NULL when the other special roots are valid.
      
      The silver lining of 4-level vs. 5-level NPT being tied to the host
      kernel's paging level is that KVM's shadow root level is constant; unlike
      VMX's EPT, KVM can't choose 4-level NPT based on guest.MAXPHYADDR.  That
      means KVM can still expect pml5_root to be bundled with the other special
      roots, it just needs to be conditioned on the shadow root level.
      
      Fixes: cb0f722a ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host")
      Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210824005824.205536-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a717a780
  8. 04 9月, 2021 5 次提交
  9. 03 9月, 2021 2 次提交