- 22 9月, 2021 9 次提交
-
-
由 Hou Wenlong 提交于
According to Intel's SDM Vol2 and AMD's APM Vol3, when CR4.TSD is set, use rdtsc/rdtscp instruction above privilege level 0 should trigger a #GP. Fixes: d7eb8203 ("KVM: SVM: Add intercept checks for remaining group7 instructions") Signed-off-by: NHou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <1297c0dd3f1bb47a6d089f850b629c7aa0247040.1629257115.git.houwenlong93@linux.alibaba.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Require the target guest page to be writable when pinning memory for RECEIVE_UPDATE_DATA. Per the SEV API, the PSP writes to guest memory: The result is then encrypted with GCTX.VEK and written to the memory pointed to by GUEST_PADDR field. Fixes: 15fb7de1 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210914210951.2994260-2-seanjc@google.com> Reviewed-by: NBrijesh Singh <brijesh.singh@amd.com> Reviewed-by: NPeter Gonda <pgonda@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Mingwei Zhang 提交于
DECOMMISSION the current SEV context if binding an ASID fails after RECEIVE_START. Per AMD's SEV API, RECEIVE_START generates a new guest context and thus needs to be paired with DECOMMISSION: The RECEIVE_START command is the only command other than the LAUNCH_START command that generates a new guest context and guest handle. The missing DECOMMISSION can result in subsequent SEV launch failures, as the firmware leaks memory and might not able to allocate more SEV guest contexts in the future. Note, LAUNCH_START suffered the same bug, but was previously fixed by commit 934002cd ("KVM: SVM: Call SEV Guest Decommission if ASID binding fails"). Cc: Alper Gun <alpergun@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: David Rienjes <rientjes@google.com> Cc: Marc Orr <marcorr@google.com> Cc: John Allen <john.allen@amd.com> Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vipin Sharma <vipinsh@google.com> Cc: stable@vger.kernel.org Reviewed-by: NMarc Orr <marcorr@google.com> Acked-by: NBrijesh Singh <brijesh.singh@amd.com> Fixes: af43cbbf ("KVM: SVM: Add support for KVM_SEV_RECEIVE_START command") Signed-off-by: NMingwei Zhang <mizhang@google.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210912181815.3899316-1-mizhang@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Peter Gonda 提交于
The update-VMSA ioctl touches data stored in struct kvm_vcpu, and therefore should not be performed concurrently with any VCPU ioctl that might cause KVM or the processor to use the same data. Adds vcpu mutex guard to the VMSA updating code. Refactors out __sev_launch_update_vmsa() function to deal with per vCPU parts of sev_launch_update_vmsa(). Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Signed-off-by: NPeter Gonda <pgonda@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20210915171755.3773766-1-pgonda@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Yu Zhang 提交于
"VMXON pointer" is saved in vmx->nested.vmxon_ptr since commit 3573e22c ("KVM: nVMX: additional checks on vmxon region"). Also, handle_vmptrld() & handle_vmclear() now have logic to check the VMCS pointer against the VMXON pointer. So just remove the obsolete comments of handle_vmon(). Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com> Message-Id: <20210908171731.18885-1-yu.c.zhang@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Haimin Zhang 提交于
Check the return of init_srcu_struct(), which can fail due to OOM, when initializing the page track mechanism. Lack of checking leads to a NULL pointer deref found by a modified syzkaller. Reported-by: NTCS Robot <tcs_robot@tencent.com> Signed-off-by: NHaimin Zhang <tcs_kernel@tencent.com> Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com> [Move the call towards the beginning of kvm_arch_init_vm. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Remove vcpu_vmx.nr_active_uret_msrs and its associated comment, which are both defunct now that KVM keeps the list constant and instead explicitly tracks which entries need to be loaded into hardware. No functional change intended. Fixes: ee9d22e0 ("KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list") Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210908002401.1947049-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT. Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT. For RESET, this is a nop as vcpu is zero-allocated. For INIT, the bug has likely escaped notice because no firmware/kernel puts its page tables root at PA=0, let alone relies on INIT to get the desired CR3 for such page tables. Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210921000303.400537-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Mark all registers as available and dirty at vCPU creation, as the vCPU has obviously not been loaded into hardware, let alone been given the chance to be modified in hardware. On SVM, reading from "uninitialized" hardware is a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and hardware does not allow for arbitrary field encoding schemes. On VMX, backing memory for VMCSes is also zero allocated, but true initialization of the VMCS _technically_ requires VMWRITEs, as the VMX architectural specification technically allows CPU implementations to encode fields with arbitrary schemes. E.g. a CPU could theoretically store the inverted value of every field, which would result in VMREAD to a zero-allocated field returns all ones. In practice, only the AR_BYTES fields are known to be manipulated by hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX) does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...). In other words, this is technically a bug fix, but practically speakings it's a glorified nop. Failure to mark registers as available has been a lurking bug for quite some time. The original register caching supported only GPRs (+RIP, which is kinda sorta a GPR), with the masks initialized at ->vcpu_reset(). That worked because the two cacheable registers, RIP and RSP, are generally speaking not read as side effects in other flows. Arguably, commit aff48baa ("KVM: Fetch guest cr3 from hardware on demand") was the first instance of failure to mark regs available. While _just_ marking CR3 available during vCPU creation wouldn't have fixed the VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0() unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not reading GUEST_CR3 when it's available would have avoided VMREAD to a technically-uninitialized VMCS. Fixes: aff48baa ("KVM: Fetch guest cr3 from hardware on demand") Fixes: 6de4f3ad ("KVM: Cache pdptrs") Fixes: 6de12732 ("KVM: VMX: Optimize vmx_get_rflags()") Fixes: 2fb92db1 ("KVM: VMX: Cache vmcs segment fields") Fixes: bd31fe49 ("KVM: VMX: Add proper cache tracking for CR0") Fixes: f98c1e77 ("KVM: VMX: Add proper cache tracking for CR4") Fixes: 5addc235 ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags") Fixes: 87915858 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210921000303.400537-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 9月, 2021 1 次提交
-
-
由 Nathan Chancellor 提交于
clang does not support -falign-jumps and only recently gained support for -falign-loops. When one of the configuration options that adds these flags is enabled, clang warns and all cc-{disable-warning,option} that follow fail because -Werror gets added to test for the presence of this warning: clang-14: warning: optimization flag '-falign-jumps=0' is not supported [-Wignored-optimization-argument] To resolve this, add a couple of cc-option calls when building with clang; gcc has supported these options since 3.2 so there is no point in testing for their support. -falign-functions was implemented in clang-7, -falign-loops was implemented in clang-14, and -falign-jumps has not been implemented yet. Link: https://lore.kernel.org/r/YSQE2f5teuvKLkON@Ryzen-9-3900X.localdomain/ Link: https://lore.kernel.org/r/20210824022640.2170859-2-nathan@kernel.org/Reported-by: Nkernel test robot <lkp@intel.com> Reviewed-by: NNick Desaulniers <ndesaulniers@google.com> Acked-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NNathan Chancellor <nathan@kernel.org> Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
-
- 15 9月, 2021 3 次提交
-
-
由 Juergen Gross 提交于
Commit 0881ace2 ("mm/mremap: use pmd/pud_poplulate to update page table entries") introduced a regression when running as Xen PV guest. Today pmd_populate() for Xen PV assumes that the PFN inserted is referencing a not yet used page table. In case of move_normal_pmd() this is not true, resulting in WARN splats like: [34321.304270] ------------[ cut here ]------------ [34321.304277] WARNING: CPU: 0 PID: 23628 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x176/0x1a0 [34321.304288] Modules linked in: [34321.304291] CPU: 0 PID: 23628 Comm: apt-get Not tainted 5.14.1-20210906-doflr-mac80211debug+ #1 [34321.304294] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS V1.8B1 09/13/2010 [34321.304296] RIP: e030:xen_mc_flush+0x176/0x1a0 [34321.304300] Code: 89 45 18 48 c1 e9 3f 48 89 ce e9 20 ff ff ff e8 60 03 00 00 66 90 5b 5d 41 5c 41 5d c3 48 c7 45 18 ea ff ff ff be 01 00 00 00 <0f> 0b 8b 55 00 48 c7 c7 10 97 aa 82 31 db 49 c7 c5 38 97 aa 82 65 [34321.304303] RSP: e02b:ffffc90000a97c90 EFLAGS: 00010002 [34321.304305] RAX: ffff88807d416398 RBX: ffff88807d416350 RCX: ffff88807d416398 [34321.304306] RDX: 0000000000000001 RSI: 0000000000000001 RDI: deadbeefdeadf00d [34321.304308] RBP: ffff88807d416300 R08: aaaaaaaaaaaaaaaa R09: ffff888006160cc0 [34321.304309] R10: deadbeefdeadf00d R11: ffffea000026a600 R12: 0000000000000000 [34321.304310] R13: ffff888012f6b000 R14: 0000000012f6b000 R15: 0000000000000001 [34321.304320] FS: 00007f5071177800(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000 [34321.304322] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [34321.304323] CR2: 00007f506f542000 CR3: 00000000160cc000 CR4: 0000000000000660 [34321.304326] Call Trace: [34321.304331] xen_alloc_pte+0x294/0x320 [34321.304334] move_pgt_entry+0x165/0x4b0 [34321.304339] move_page_tables+0x6fa/0x8d0 [34321.304342] move_vma.isra.44+0x138/0x500 [34321.304345] __x64_sys_mremap+0x296/0x410 [34321.304348] do_syscall_64+0x3a/0x80 [34321.304352] entry_SYSCALL_64_after_hwframe+0x44/0xae [34321.304355] RIP: 0033:0x7f507196301a [34321.304358] Code: 73 01 c3 48 8b 0d 76 0e 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 19 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 46 0e 0c 00 f7 d8 64 89 01 48 [34321.304360] RSP: 002b:00007ffda1eecd38 EFLAGS: 00000246 ORIG_RAX: 0000000000000019 [34321.304362] RAX: ffffffffffffffda RBX: 000056205f950f30 RCX: 00007f507196301a [34321.304363] RDX: 0000000001a00000 RSI: 0000000001900000 RDI: 00007f506dc56000 [34321.304364] RBP: 0000000001a00000 R08: 0000000000000010 R09: 0000000000000004 [34321.304365] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f506dc56060 [34321.304367] R13: 00007f506dc56000 R14: 00007f506dc56060 R15: 000056205f950f30 [34321.304368] ---[ end trace a19885b78fe8f33e ]--- [34321.304370] 1 of 2 multicall(s) failed: cpu 0 [34321.304371] call 2: op=12297829382473034410 arg=[aaaaaaaaaaaaaaaa] result=-22 Fix that by modifying xen_alloc_ptpage() to only pin the page table in case it wasn't pinned already. Fixes: 0881ace2 ("mm/mremap: use pmd/pud_poplulate to update page table entries") Cc: <stable@vger.kernel.org> Reported-by: NSander Eikelenboom <linux@eikelenboom.it> Tested-by: NSander Eikelenboom <linux@eikelenboom.it> Signed-off-by: NJuergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20210908073640.11299-1-jgross@suse.comSigned-off-by: NJuergen Gross <jgross@suse.com>
-
由 Juergen Gross 提交于
A Xen PV guest doesn't have a legacy RTC device, so reset the legacy RTC flag. Otherwise the following WARN splat will occur at boot: [ 1.333404] WARNING: CPU: 1 PID: 1 at /home/gross/linux/head/drivers/rtc/rtc-mc146818-lib.c:25 mc146818_get_time+0x1be/0x210 [ 1.333404] Modules linked in: [ 1.333404] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G W 5.14.0-rc7-default+ #282 [ 1.333404] RIP: e030:mc146818_get_time+0x1be/0x210 [ 1.333404] Code: c0 64 01 c5 83 fd 45 89 6b 14 7f 06 83 c5 64 89 6b 14 41 83 ec 01 b8 02 00 00 00 44 89 63 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 30 0e ef 82 4c 89 e6 e8 71 2a 24 00 48 c7 c0 ff ff [ 1.333404] RSP: e02b:ffffc90040093df8 EFLAGS: 00010002 [ 1.333404] RAX: 00000000000000ff RBX: ffffc90040093e34 RCX: 0000000000000000 [ 1.333404] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000000000d [ 1.333404] RBP: ffffffff82ef0e30 R08: ffff888005013e60 R09: 0000000000000000 [ 1.333404] R10: ffffffff82373e9b R11: 0000000000033080 R12: 0000000000000200 [ 1.333404] R13: 0000000000000000 R14: 0000000000000002 R15: ffffffff82cdc6d4 [ 1.333404] FS: 0000000000000000(0000) GS:ffff88807d440000(0000) knlGS:0000000000000000 [ 1.333404] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.333404] CR2: 0000000000000000 CR3: 000000000260a000 CR4: 0000000000050660 [ 1.333404] Call Trace: [ 1.333404] ? wakeup_sources_sysfs_init+0x30/0x30 [ 1.333404] ? rdinit_setup+0x2b/0x2b [ 1.333404] early_resume_init+0x23/0xa4 [ 1.333404] ? cn_proc_init+0x36/0x36 [ 1.333404] do_one_initcall+0x3e/0x200 [ 1.333404] kernel_init_freeable+0x232/0x28e [ 1.333404] ? rest_init+0xd0/0xd0 [ 1.333404] kernel_init+0x16/0x120 [ 1.333404] ret_from_fork+0x1f/0x30 Cc: <stable@vger.kernel.org> Fixes: 8d152e7a ("x86/rtc: Replace paravirt rtc check with platform legacy quirk") Signed-off-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20210903084937.19392-3-jgross@suse.comSigned-off-by: NJuergen Gross <jgross@suse.com>
-
由 Linus Torvalds 提交于
The boot-time allocation interface for memblock is a mess, with 'memblock_alloc()' returning a virtual pointer, but then you are supposed to free it with 'memblock_free()' that takes a _physical_ address. Not only is that all kinds of strange and illogical, but it actually causes bugs, when people then use it like a normal allocation function, and it fails spectacularly on a NULL pointer: https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/ or just random memory corruption if the debug checks don't catch it: https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/ I really don't want to apply patches that treat the symptoms, when the fundamental cause is this horribly confusing interface. I started out looking at just automating a sane replacement sequence, but because of this mix or virtual and physical addresses, and because people have used the "__pa()" macro that can take either a regular kernel pointer, or just the raw "unsigned long" address, it's all quite messy. So this just introduces a new saner interface for freeing a virtual address that was allocated using 'memblock_alloc()', and that was kept as a regular kernel pointer. And then it converts a couple of users that are obvious and easy to test, including the 'xbc_nodes' case in lib/bootconfig.c that caused problems. Reported-by: Nkernel test robot <oliver.sang@intel.com> Fixes: 40caa127 ("init: bootconfig: Remove all bootconfig data when the init memory is removed") Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 9月, 2021 2 次提交
-
-
由 Tony Luck 提交于
There are two cases for machine check recovery: 1) The machine check was triggered by ring3 (application) code. This is the simpler case. The machine check handler simply queues work to be executed on return to user. That code unmaps the page from all users and arranges to send a SIGBUS to the task that triggered the poison. 2) The machine check was triggered in kernel code that is covered by an exception table entry. In this case the machine check handler still queues a work entry to unmap the page, etc. but this will not be called right away because the #MC handler returns to the fix up code address in the exception table entry. Problems occur if the kernel triggers another machine check before the return to user processes the first queued work item. Specifically, the work is queued using the ->mce_kill_me callback structure in the task struct for the current thread. Attempting to queue a second work item using this same callback results in a loop in the linked list of work functions to call. So when the kernel does return to user, it enters an infinite loop processing the same entry for ever. There are some legitimate scenarios where the kernel may take a second machine check before returning to the user. 1) Some code (e.g. futex) first tries a get_user() with page faults disabled. If this fails, the code retries with page faults enabled expecting that this will resolve the page fault. 2) Copy from user code retries a copy in byte-at-time mode to check whether any additional bytes can be copied. On the other side of the fence are some bad drivers that do not check the return value from individual get_user() calls and may access multiple user addresses without noticing that some/all calls have failed. Fix by adding a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is four machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same address with page faults enabled ... repeat in copy tail bytes). Just in case there is some code that loops forever enforce a limit of 10. [ bp: Massage commit message, drop noinstr, fix typo, extend panic messages. ] Fixes: 5567d11c ("x86/mce: Send #MC singal from task work") Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
-
由 Will Deacon 提交于
Commit 865c50e1 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT") added an optimised version of __get_user_asm() for x86 using 'asm goto'. Like the non-optimised code, the 32-bit implementation of 64-bit get_user() expands to a pair of 32-bit accesses. Unlike the non-optimised code, the _original_ pointer is incremented to copy the high word instead of loading through a new pointer explicitly constructed to point at a 32-bit type. Consequently, if the pointer points at a 64-bit type then we end up loading the wrong data for the upper 32-bits. This was observed as a mount() failure in Android targeting i686 after b0cfcdd9 ("d_path: make 'prepend()' fill up the buffer exactly on overflow") because the call to copy_from_kernel_nofault() from prepend_copy() ends up in __get_kernel_nofault() and casts the source pointer to a 'u64 __user *'. An attempt to mount at "/debug_ramdisk" therefore ends up failing trying to mount "/debumdismdisk". Use the existing '__gu_ptr' source pointer to unsigned int for 32-bit __get_user_asm_u64() instead of the original pointer. Cc: Bill Wendling <morbo@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Reported-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: 865c50e1 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT") Signed-off-by: NWill Deacon <will@kernel.org> Reviewed-by: NNick Desaulniers <ndesaulniers@google.com> Tested-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 9月, 2021 1 次提交
-
-
由 Wei Liu 提交于
It is not a good practice to allocate a cpumask on stack, given it may consume up to 1 kilobytes of stack space if the kernel is configured to have 8192 cpus. The internal helper functions __send_ipi_mask{,_ex} need to loop over the provided mask anyway, so it is not too difficult to skip `self' there. We can thus do away with the on-stack cpumask in hv_send_ipi_mask_allbutself. Adjust call sites of __send_ipi_mask as needed. Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Suggested-by: NMichael Kelley <mikelley@microsoft.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Fixes: 68bb7bfb ("X86/Hyper-V: Enable IPI enlightenments") Signed-off-by: NWei Liu <wei.liu@kernel.org> Reviewed-by: NMichael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20210910185714.299411-3-wei.liu@kernel.org
-
- 09 9月, 2021 5 次提交
-
-
由 Arnd Bergmann 提交于
All users of compat_alloc_user_space() and copy_in_user() have been removed from the kernel, only a few functions in sparc remain that can be changed to calling arch_copy_in_user() instead. Link: https://lkml.kernel.org/r/20210727144859.4150043-7-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arnd Bergmann 提交于
These are all handled correctly when calling the native system call entry point, so remove the special cases. Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Rapoport 提交于
Jiri Olsa reported a fault when running: # cat /proc/kallsyms | grep ksys_read ffffffff8136d580 T ksys_read # objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore /proc/kcore: file format elf64-x86-64 Segmentation fault general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014 RIP: 0010:kern_addr_valid Call Trace: read_kcore ? rcu_read_lock_sched_held ? rcu_read_lock_sched_held ? rcu_read_lock_sched_held ? trace_hardirqs_on ? rcu_read_lock_sched_held ? lock_acquire ? lock_acquire ? rcu_read_lock_sched_held ? lock_acquire ? rcu_read_lock_sched_held ? rcu_read_lock_sched_held ? rcu_read_lock_sched_held ? lock_release ? _raw_spin_unlock ? __handle_mm_fault ? rcu_read_lock_sched_held ? lock_acquire ? rcu_read_lock_sched_held ? lock_release proc_reg_read ? vfs_read vfs_read ksys_read do_syscall_64 entry_SYSCALL_64_after_hwframe The fault happens because kern_addr_valid() dereferences existent but not present PMD in the high kernel mappings. Such PMDs are created when free_kernel_image_pages() frees regions larger than 2Mb. In this case, a part of the freed memory is mapped with PMDs and the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will mark the PMD as not present rather than wipe it completely. Have kern_addr_valid() check whether higher level page table entries are present before trying to dereference them to fix this issue and to avoid similar issues in the future. Stable backporting note: ------------------------ Note that the stable marking is for all active stable branches because there could be cases where pagetable entries exist but are not valid - see 9a14aefc ("x86: cpa, fix lookup_address"), for example. So make sure to be on the safe side here and use pXY_present() accessors rather than pXY_none() which could #GP when accessing pages in the direct map. Also see: c40a56a7 ("x86/mm/init: Remove freed kernel image areas from alias mapping") for more info. Reported-by: NJiri Olsa <jolsa@redhat.com> Signed-off-by: NMike Rapoport <rppt@linux.ibm.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDave Hansen <dave.hansen@intel.com> Tested-by: NJiri Olsa <jolsa@redhat.com> Cc: <stable@vger.kernel.org> # 4.4+ Link: https://lkml.kernel.org/r/20210819132717.19358-1-rppt@kernel.org
-
由 Zenghui Yu 提交于
This CONFIG option was removed in commit 278b13ce ("Input: remove input_polled_dev implementation") so there's no point to keep it in defconfigs any longer. Get rid of the leftover for all arches. Link: https://lkml.kernel.org/r/20210726074741.1062-1-yuzenghui@huawei.comSigned-off-by: NZenghui Yu <yuzenghui@huawei.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Hildenbrand 提交于
The parameter is unused, let's remove it. Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390] Reviewed-by: NPankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Baoquan He <bhe@redhat.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Joe Perches <joe@perches.com> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: Jia He <justin.he@arm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 9月, 2021 12 次提交
-
-
由 Zelin Deng 提交于
When MSR_IA32_TSC_ADJUST is written by guest due to TSC ADJUST feature especially there's a big tsc warp (like a new vCPU is hot-added into VM which has been up for a long time), tsc_offset is added by a large value then go back to guest. This causes system time jump as tsc_timestamp is not adjusted in the meantime and pvclock monotonic character. To fix this, just notify kvm to update vCPU's guest time before back to guest. Cc: stable@vger.kernel.org Signed-off-by: NZelin Deng <zelin.deng@linux.alibaba.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <1619576521-81399-2-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
It is reasonable for these functions to be used only in some configurations, for example only if the host is 64-bits (and therefore supports 64-bit guests). It is also reasonable to keep the role_regs and role accessors in sync even though some of the accessors may be used only for one of the two sets (as is the case currently for CR4.LA57).. Because clang reports warnings for unused inlines declared in a .c file, mark both sets of accessors as __maybe_unused. Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
Commit f4e61f0c ("x86/kvm: Fix broken irq restoration in kvm_wait") replaced "local_irq_restore() when IRQ enabled" with "local_irq_enable() when IRQ enabled" to suppress a warnning. Although there is no similar debugging warnning for doing local_irq_enable() when IRQ enabled as doing local_irq_restore() in the same IRQ situation. But doing local_irq_enable() when IRQ enabled is no less broken as doing local_irq_restore() and we'd better avoid it. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210814035129.154242-1-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move "lpage_disallowed_link" out of the first 64 bytes, i.e. out of the first cache line, of kvm_mmu_page so that "spt" and to a lesser extent "gfns" land in the first cache line. "lpage_disallowed_link" is accessed relatively infrequently compared to "spt", which is accessed any time KVM is walking and/or manipulating the shadow page tables. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210901221023.1303578-4-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move "tdp_mmu_page" into the 1-byte void left by the recently removed "mmio_cached" so that it resides in the first 64 bytes of kvm_mmu_page, i.e. in the same cache line as the most commonly accessed fields. Don't bother wrapping tdp_mmu_page in CONFIG_X86_64, including the field in 32-bit builds doesn't affect the size of kvm_mmu_page, and a future patch can always wrap the field in the unlikely event KVM gains a 1-byte flag that is 32-bit specific. Note, the size of kvm_mmu_page is also unchanged on CONFIG_X86_64=y due to it previously sharing an 8-byte chunk with write_flooding_count. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210901221023.1303578-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Revert a misguided illegal GPA check when "translating" a non-nested GPA. The check is woefully incomplete as it does not fill in @exception as expected by all callers, which leads to KVM attempting to inject a bogus exception, potentially exposing kernel stack information in the process. WARNING: CPU: 0 PID: 8469 at arch/x86/kvm/x86.c:525 exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525 CPU: 1 PID: 8469 Comm: syz-executor531 Not tainted 5.14.0-rc7-syzkaller #0 RIP: 0010:exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525 Call Trace: x86_emulate_instruction+0xef6/0x1460 arch/x86/kvm/x86.c:7853 kvm_mmu_page_fault+0x2f0/0x1810 arch/x86/kvm/mmu/mmu.c:5199 handle_ept_misconfig+0xdf/0x3e0 arch/x86/kvm/vmx/vmx.c:5336 __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6021 [inline] vmx_handle_exit+0x336/0x1800 arch/x86/kvm/vmx/vmx.c:6038 vcpu_enter_guest+0x2a1c/0x4430 arch/x86/kvm/x86.c:9712 vcpu_run arch/x86/kvm/x86.c:9779 [inline] kvm_arch_vcpu_ioctl_run+0x47d/0x1b20 arch/x86/kvm/x86.c:10010 kvm_vcpu_ioctl+0x49e/0xe50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3652 The bug has escaped notice because practically speaking the GPA check is useless. The GPA check in question only comes into play when KVM is walking guest page tables (or "translating" CR3), and KVM already handles illegal GPA checks by setting reserved bits in rsvd_bits_mask for each PxE, or in the case of CR3 for loading PTDPTRs, manually checks for an illegal CR3. This particular failure doesn't hit the existing reserved bits checks because syzbot sets guest.MAXPHYADDR=1, and IA32 architecture simply doesn't allow for such an absurd MAXPHYADDR, e.g. 32-bit paging doesn't define any reserved PA bits checks, which KVM emulates by only incorporating the reserved PA bits into the "high" bits, i.e. bits 63:32. Simply remove the bogus check. There is zero meaningful value and no architectural justification for supporting guest.MAXPHYADDR < 32, and properly filling the exception would introduce non-trivial complexity. This reverts commit ec7771ab. Fixes: ec7771ab ("KVM: x86: mmu: Add guest physical address check in translate_gpa()") Cc: stable@vger.kernel.org Reported-by: syzbot+200c08e88ae818f849ce@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210831164224.1119728-2-seanjc@google.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jia He 提交于
After reverting and restoring the fast tlb invalidation patch series, the mmio_cached is not removed. Hence a unused field is left in kvm_mmu_page. Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: NJia He <justin.he@arm.com> Message-Id: <20210830145336.27183-1-justin.he@arm.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Eduardo Habkost 提交于
Support for 710 VCPUs was tested by Red Hat since RHEL-8.4, so increase KVM_SOFT_MAX_VCPUS to 710. Signed-off-by: NEduardo Habkost <ehabkost@redhat.com> Message-Id: <20210903211600.2002377-4-ehabkost@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Eduardo Habkost 提交于
Increase KVM_MAX_VCPUS to 1024, so we can test larger VMs. I'm not changing KVM_SOFT_MAX_VCPUS yet because I'm afraid it might involve complicated questions around the meaning of "supported" and "recommended" in the upstream tree. KVM_SOFT_MAX_VCPUS will be changed in a separate patch. For reference, visible effects of this change are: - KVM_CAP_MAX_VCPUS will now return 1024 (of course) - Default value for CPUID[HYPERV_CPUID_IMPLEMENT_LIMITS (00x40000005)].EAX will now be 1024 - KVM_MAX_VCPU_ID will change from 1151 to 4096 - Size of struct kvm will increase from 19328 to 22272 bytes (in x86_64) - Size of struct kvm_ioapic will increase from 1780 to 5084 bytes (in x86_64) - Bitmap stack variables that will grow: - At kvm_hv_flush_tlb() kvm_hv_send_ipi(), vp_bitmap[] and vcpu_bitmap[] will now be 128 bytes long - vcpu_bitmap at bioapic_write_indirect() will be 128 bytes long once patch "KVM: x86: Fix stack-out-of-bounds memory access from ioapic_write_indirect()" is applied Signed-off-by: NEduardo Habkost <ehabkost@redhat.com> Message-Id: <20210903211600.2002377-3-ehabkost@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Eduardo Habkost 提交于
Instead of requiring KVM_MAX_VCPU_ID to be manually increased every time we increase KVM_MAX_VCPUS, set it to 4*KVM_MAX_VCPUS. This should be enough for CPU topologies where Cores-per-Package and Packages-per-Socket are not powers of 2. In practice, this increases KVM_MAX_VCPU_ID from 1023 to 1152. The only side effect of this change is making some fields in struct kvm_ioapic larger, increasing the struct size from 1628 to 1780 bytes (in x86_64). Signed-off-by: NEduardo Habkost <ehabkost@redhat.com> Message-Id: <20210903211600.2002377-2-ehabkost@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
If we are emulating an invalid guest state, we don't have a correct exit reason, and thus we shouldn't do anything in this function. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210826095750.1650467-2-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Fixes: 95b5a48c ("KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn", 2019-06-18) Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Include pml5_root in the set of special roots if and only if the host, and thus NPT, is using 5-level paging. mmu_alloc_special_roots() expects special roots to be allocated as a bundle, i.e. they're either all valid or all NULL. But for pml5_root, that expectation only holds true if the host uses 5-level paging, which causes KVM to WARN about pml5_root being NULL when the other special roots are valid. The silver lining of 4-level vs. 5-level NPT being tied to the host kernel's paging level is that KVM's shadow root level is constant; unlike VMX's EPT, KVM can't choose 4-level NPT based on guest.MAXPHYADDR. That means KVM can still expect pml5_root to be bundled with the other special roots, it just needs to be conditioned on the shadow root level. Fixes: cb0f722a ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host") Reported-by: NMaxim Levitsky <mlevitsk@redhat.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210824005824.205536-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 04 9月, 2021 5 次提交
-
-
由 Suren Baghdasaryan 提交于
Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20210809185259.405936-2-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com> Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Rapoport 提交于
There are a lot of uses of memblock_find_in_range() along with memblock_reserve() from the times memblock allocation APIs did not exist. memblock_find_in_range() is the very core of memblock allocations, so any future changes to its internal behaviour would mandate updates of all the users outside memblock. Replace the calls to memblock_find_in_range() with an equivalent calls to memblock_phys_alloc() and memblock_phys_alloc_range() and make memblock_find_in_range() private method of memblock. This simplifies the callers, ensures that (unlikely) errors in memblock_reserve() are handled and improves maintainability of memblock_find_in_range(). Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: NKirill A. Shutemov <kirill.shtuemov@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ACPI] Acked-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Nick Kossifidis <mick@ics.forth.gr> [riscv] Tested-by: NGuenter Roeck <linux@roeck-us.net> Acked-by: NRob Herring <robh@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vasily Averin 提交于
Each task can request own LDT and force the kernel to allocate up to 64Kb memory per-mm. There are legitimate workloads with hundreds of processes and there can be hundreds of workloads running on large machines. The unaccounted memory can cause isolation issues between the workloads particularly on highly utilized machines. It makes sense to account for this objects to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/38010594-50fe-c06d-7cb0-d1f77ca422f3@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com> Acked-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Hildenbrand 提交于
At exec time when we mmap the new executable via MAP_DENYWRITE we have it opened via do_open_execat() and already deny_write_access()'ed the file successfully. Once exec completes, we allow_write_acces(); however, we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and also deny_write_access() as long as mm->exe_file remains set. We'll effectively deny write access to our executable via mm->exe_file until mm->exe_file is changed -- when the process is removed, on new exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE). Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for mm->exe_file. In case of an elf interpreter, we'll now only deny write access to the file during exec. This is somewhat okay, because the interpreter behaves (and sometime is) a shared library; all shared libraries, especially the ones loaded directly in user space like via dlopen() won't ever be mapped via MAP_DENYWRITE, because we ignore that from user space completely; these shared libraries can always be modified while mapped and executed. Let's only special-case the main executable, denying write access while being executed by a process. This can be considered a minor user space visible change. While this is a cleanup, it also fixes part of a problem reported with VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with this patch and will be removed next: "Overlayfs did not honor positive i_writecount on realfile for VM_DENYWRITE mappings." [1] [1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/Reported-by: NChengguang Xu <cgxu519@mykernel.net> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NDavid Hildenbrand <david@redhat.com>
-
由 David Hildenbrand 提交于
uselib() is the legacy systemcall for loading shared libraries. Nowadays, applications use dlopen() to load shared libraries, completely implemented in user space via mmap(). For example, glibc uses MAP_COPY to mmap shared libraries. While this maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any MAP_DENYWRITE specification from user space in mmap. With this change, all remaining in-tree users of MAP_DENYWRITE use it to map an executable. We will be able to open shared libraries loaded via uselib() writable, just as we already can via dlopen() from user space. This is one step into the direction of removing MAP_DENYWRITE from the kernel. This can be considered a minor user space visible change. Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NDavid Hildenbrand <david@redhat.com>
-
- 03 9月, 2021 2 次提交
-
-
由 Nick Desaulniers 提交于
As noted in the comment, -mtune= has been supported since GCC 3.4. The minimum required version of GCC to build the kernel (as specified in Documentation/process/changes.rst) is GCC 4.9. tune is not immediately expanded. Instead it defines a macro that will test via cc-option later values for -mtune=. But we can skip the test whether to use -mtune= vs. -mcpu=. Signed-off-by: NNick Desaulniers <ndesaulniers@google.com> Reviewed-by: NNathan Chancellor <nathan@kernel.org> Reviewed-by: NMiguel Ojeda <ojeda@kernel.org> Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
-
由 Masahiro Yamada 提交于
Add FORCE so that if_changed can detect the command line change. Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
-