1. 12 9月, 2020 8 次提交
    • W
      KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit · 99b82a14
      Wanpeng Li 提交于
      According to SDM 27.2.4, Event delivery causes an APIC-access VM exit.
      Don't report internal error and freeze guest when event delivery causes
      an APIC-access exit, it is handleable and the event will be re-injected
      during the next vmentry.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1597827327-25055-2-git-send-email-wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      99b82a14
    • W
      KVM: SVM: avoid emulation with stale next_rip · e42c6828
      Wanpeng Li 提交于
      svm->next_rip is reset in svm_vcpu_run() only after calling
      svm_exit_handlers_fastpath(), which will cause SVM's
      skip_emulated_instruction() to write a stale RIP.
      
      We can move svm_exit_handlers_fastpath towards the end of
      svm_vcpu_run().  To align VMX with SVM, keep svm_complete_interrupts()
      close as well.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Paul K. <kronenpj@kronenpj.dyndns.org>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Also move vmcb_mark_all_clean before any possible write to the VMCB.
       - Paolo]
      e42c6828
    • V
      KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN · d831de17
      Vitaly Kuznetsov 提交于
      Even without in-kernel LAPIC we should allow writing '0' to
      MSR_KVM_ASYNC_PF_EN as we're not enabling the mechanism. In
      particular, QEMU with 'kernel-irqchip=off' fails to start
      a guest with
      
      qemu-system-x86_64: error: failed to set MSR 0x4b564d02 to 0x0
      
      Fixes: 9d3c447c ("KVM: X86: Fix async pf caused null-ptr-deref")
      Reported-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200911093147.484565-1-vkuznets@redhat.com>
      [Actually commit the version proposed by Sean Christopherson. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d831de17
    • D
      KVM: SVM: Periodically schedule when unregistering regions on destroy · 7be74942
      David Rientjes 提交于
      There may be many encrypted regions that need to be unregistered when a
      SEV VM is destroyed.  This can lead to soft lockups.  For example, on a
      host running 4.15:
      
      watchdog: BUG: soft lockup - CPU#206 stuck for 11s! [t_virtual_machi:194348]
      CPU: 206 PID: 194348 Comm: t_virtual_machi
      RIP: 0010:free_unref_page_list+0x105/0x170
      ...
      Call Trace:
       [<0>] release_pages+0x159/0x3d0
       [<0>] sev_unpin_memory+0x2c/0x50 [kvm_amd]
       [<0>] __unregister_enc_region_locked+0x2f/0x70 [kvm_amd]
       [<0>] svm_vm_destroy+0xa9/0x200 [kvm_amd]
       [<0>] kvm_arch_destroy_vm+0x47/0x200
       [<0>] kvm_put_kvm+0x1a8/0x2f0
       [<0>] kvm_vm_release+0x25/0x30
       [<0>] do_exit+0x335/0xc10
       [<0>] do_group_exit+0x3f/0xa0
       [<0>] get_signal+0x1bc/0x670
       [<0>] do_signal+0x31/0x130
      
      Although the CLFLUSH is no longer issued on every encrypted region to be
      unregistered, there are no other changes that can prevent soft lockups for
      very large SEV VMs in the latest kernel.
      
      Periodically schedule if necessary.  This still holds kvm->lock across the
      resched, but since this only happens when the VM is destroyed this is
      assumed to be acceptable.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Message-Id: <alpine.DEB.2.23.453.2008251255240.2987727@chino.kir.corp.google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7be74942
    • L
      kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed · f6f6195b
      Lai Jiangshan 提交于
      When kvm_mmu_get_page() gets a page with unsynced children, the spt
      pagetable is unsynchronized with the guest pagetable. But the
      guest might not issue a "flush" operation on it when the pagetable
      entry is changed from zero or other cases. The hypervisor has the
      responsibility to synchronize the pagetables.
      
      KVM behaved as above for many years, But commit 8c8560b8
      ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
      inadvertently included a line of code to change it without giving any
      reason in the changelog. It is clear that the commit's intention was to
      change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't
      needlessly flush other contexts; however, one of the hunks changed
      a nearby KVM_REQ_MMU_SYNC instead.  This patch changes it back.
      
      Link: https://lore.kernel.org/lkml/20200320212833.3507-26-sean.j.christopherson@intel.com/
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20200902135421.31158-1-jiangshanlai@gmail.com>
      fixes: 8c8560b8 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f6f6195b
    • C
      KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control · c6b177a3
      Chenyi Qiang 提交于
      A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
      in exit_ctls_high.
      
      Fixes: 03a8871a ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
      VM-{Entry,Exit} control")
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c6b177a3
    • H
      KVM: Check the allocation of pv cpu mask · 0f990222
      Haiwei Li 提交于
      check the allocation of per-cpu __pv_cpu_mask. Initialize ops only when
      successful.
      Signed-off-by: NHaiwei Li <lihaiwei@tencent.com>
      Message-Id: <d59f05df-e6d3-3d31-a036-cc25a2b2f33f@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f990222
    • P
      KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected · 43fea4e4
      Peter Shier 提交于
      When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
      load_pdptrs to read the possibly updated PDPTEs from the guest
      physical address referenced by CR3.  It loads them into
      vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
      vcpu->arch.regs_dirty.
      
      At the subsequent assumed reentry into L2, the mmu will call
      vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
      VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
      VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
      if the L2 CRn write intercept always resumes L2.
      
      The resume path calls vmx_check_nested_events which checks for
      exceptions, MTF, and expired VMX preemption timers. If
      vmx_check_nested_events finds any of these conditions pending it will
      reflect the corresponding exit into L1. Live migration at this point
      would also cause a missed immediate reentry into L2.
      
      After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
      clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty.  When L2 next
      resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
      vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
      vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
      VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
      values stored at last L2 exit. A repro of this bug showed L2 entering
      triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.
      
      When L2 is in PAE paging mode add a call to ept_load_pdptrs before
      leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
      vcpu->arch.walk_mmu->pdptrs[].
      
      Tested:
      kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
      Verified that test fails without the fix.
      
      Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
      custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
      would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
      failures.
      Signed-off-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20200820230545.2411347-1-pshier@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      43fea4e4
  2. 22 8月, 2020 1 次提交
    • W
      KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd
      Will Deacon 提交于
      The 'flags' field of 'struct mmu_notifier_range' is used to indicate
      whether invalidate_range_{start,end}() are permitted to block. In the
      case of kvm_mmu_notifier_invalidate_range_start(), this field is not
      forwarded on to the architecture-specific implementation of
      kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
      whether or not to block.
      
      Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
      architectures are aware as to whether or not they are permitted to block.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-2-will@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdfe7cbd
  3. 21 8月, 2020 1 次提交
  4. 18 8月, 2020 3 次提交
    • J
      kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode · cb957adb
      Jim Mattson 提交于
      See the SDM, volume 3, section 4.4.1:
      
      If PAE paging would be in use following an execution of MOV to CR0 or
      MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
      CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
      the PDPTEs are loaded from the address in CR3.
      
      Fixes: b9baba86 ("KVM, pkeys: expose CPUID/CR4 to guest")
      Cc: Huaitong Han <huaitong.han@intel.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Message-Id: <20200817181655.3716509-1-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb957adb
    • J
      kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode · 427890af
      Jim Mattson 提交于
      See the SDM, volume 3, section 4.4.1:
      
      If PAE paging would be in use following an execution of MOV to CR0 or
      MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
      CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
      the PDPTEs are loaded from the address in CR3.
      
      Fixes: 0be0226f ("KVM: MMU: fix SMAP virtualization")
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Message-Id: <20200817181655.3716509-2-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      427890af
    • P
      KVM: x86: fix access code passed to gva_to_gpa · 19cf4b7e
      Paolo Bonzini 提交于
      The PK bit of the error code is computed dynamically in permission_fault
      and therefore need not be passed to gva_to_gpa: only the access bits
      (fetch, user, write) need to be passed down.
      
      Not doing so causes a splat in the pku test:
      
         WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm]
         Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020
         RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm]
         Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89
         RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202
         RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007
         RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000
         RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7
         R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
         R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000
         FS:  00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0
         DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         PKRU: 55555554
         Call Trace:
          paging64_gva_to_gpa+0x3f/0xb0 [kvm]
          kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm]
          handle_exception_nmi+0x4fc/0x5b0 [kvm_intel]
          kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm]
          kvm_vcpu_ioctl+0x23e/0x5d0 [kvm]
          ksys_ioctl+0x92/0xb0
          __x64_sys_ioctl+0x16/0x20
          do_syscall_64+0x3e/0xb0
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
         ---[ end trace d17eb998aee991da ]---
      Reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Fixes: 89786147 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19cf4b7e
  5. 15 8月, 2020 2 次提交
    • X
      all arch: remove system call sys_sysctl · 88db0aa2
      Xiaoming Ni 提交于
      Since commit 61a47c1a ("sysctl: Remove the sysctl system call"),
      sys_sysctl is actually unavailable: any input can only return an error.
      
      We have been warning about people using the sysctl system call for years
      and believe there are no more users.  Even if there are users of this
      interface if they have not complained or fixed their code by now they
      probably are not going to, so there is no point in warning them any
      longer.
      
      So completely remove sys_sysctl on all architectures.
      
      [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu]
       Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.comSigned-off-by: NXiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Will Deacon <will@kernel.org>		[arm/arm64]
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bin Meng <bin.meng@windriver.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: chenzefeng <chenzefeng2@huawei.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Diego Elio Pettenò <flameeyes@flameeyes.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kars de Jong <jongk@linux-m68k.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Sven Schnelle <svens@stackframe.org>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com>
      Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88db0aa2
    • E
      x86/fsgsbase/64: Fix NULL deref in 86_fsgsbase_read_task · 8ab49526
      Eric Dumazet 提交于
      syzbot found its way in 86_fsgsbase_read_task() and triggered this oops:
      
         KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
         CPU: 0 PID: 6866 Comm: syz-executor262 Not tainted 5.8.0-syzkaller #0
         RIP: 0010:x86_fsgsbase_read_task+0x16d/0x310 arch/x86/kernel/process_64.c:393
         Call Trace:
           putreg32+0x3ab/0x530 arch/x86/kernel/ptrace.c:876
           genregs32_set arch/x86/kernel/ptrace.c:1026 [inline]
           genregs32_set+0xa4/0x100 arch/x86/kernel/ptrace.c:1006
           copy_regset_from_user include/linux/regset.h:326 [inline]
           ia32_arch_ptrace arch/x86/kernel/ptrace.c:1061 [inline]
           compat_arch_ptrace+0x36c/0xd90 arch/x86/kernel/ptrace.c:1198
           __do_compat_sys_ptrace kernel/ptrace.c:1420 [inline]
           __se_compat_sys_ptrace kernel/ptrace.c:1389 [inline]
           __ia32_compat_sys_ptrace+0x220/0x2f0 kernel/ptrace.c:1389
           do_syscall_32_irqs_on arch/x86/entry/common.c:84 [inline]
           __do_fast_syscall_32+0x57/0x80 arch/x86/entry/common.c:126
           do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:149
           entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
      
      This can happen if ptrace() or sigreturn() pokes an LDT selector into FS
      or GS for a task with no LDT and something tries to read the base before
      a return to usermode notices the bad selector and fixes it.
      
      The fix is to make sure ldt pointer is not NULL.
      
      Fixes: 07e1d88a ("x86/fsgsbase/64: Fix ptrace() to read the FS/GS base accurately")
      Co-developed-by: NJann Horn <jannh@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Chang S. Bae <chang.seok.bae@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Markus T Metzger <markus.t.metzger@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab49526
  6. 14 8月, 2020 3 次提交
  7. 13 8月, 2020 6 次提交
    • S
      x86/alternatives: Acquire pte lock with interrupts enabled · a6d996cb
      Sebastian Andrzej Siewior 提交于
      pte lock is never acquired in-IRQ context so it does not require interrupts
      to be disabled. The lock is a regular spinlock which cannot be acquired
      with interrupts disabled on RT.
      
      RT complains about pte_lock() in __text_poke() because it's invoked after
      disabling interrupts.
      
      __text_poke() has to disable interrupts as use_temporary_mm() expects
      interrupts to be off because it invokes switch_mm_irqs_off() and uses
      per-CPU (current active mm) data.
      
      Move the PTE lock handling outside the interrupt disabled region.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by; Peter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200813105026.bvugytmsso6muljw@linutronix.de
      a6d996cb
    • P
      mm/x86: use general page fault accounting · 968614fc
      Peter Xu 提交于
      Use the general page fault accounting by passing regs into
      handle_mm_fault().
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/20200707225021.200906-23-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      968614fc
    • P
      mm: do page fault accounting in handle_mm_fault · bce617ed
      Peter Xu 提交于
      Patch series "mm: Page fault accounting cleanups", v5.
      
      This is v5 of the pf accounting cleanup series.  It originates from Gerald
      Schaefer's report on an issue a week ago regarding to incorrect page fault
      accountings for retried page fault after commit 4064b982 ("mm: allow
      VM_FAULT_RETRY for multiple times"):
      
        https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
      
      What this series did:
      
        - Correct page fault accounting: we do accounting for a page fault
          (no matter whether it's from #PF handling, or gup, or anything else)
          only with the one that completed the fault.  For example, page fault
          retries should not be counted in page fault counters.  Same to the
          perf events.
      
        - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
          event is used in an adhoc way across different archs.
      
          Case (1): for many archs it's done at the entry of a page fault
          handler, so that it will also cover e.g.  errornous faults.
      
          Case (2): for some other archs, it is only accounted when the page
          fault is resolved successfully.
      
          Case (3): there're still quite some archs that have not enabled
          this perf event.
      
          Since this series will touch merely all the archs, we unify this
          perf event to always follow case (1), which is the one that makes most
          sense.  And since we moved the accounting into handle_mm_fault, the
          other two MAJ/MIN perf events are well taken care of naturally.
      
        - Unify definition of "major faults": the definition of "major
          fault" is slightly changed when used in accounting (not
          VM_FAULT_MAJOR).  More information in patch 1.
      
        - Always account the page fault onto the one that triggered the page
          fault.  This does not matter much for #PF handlings, but mostly for
          gup.  More information on this in patch 25.
      
      Patchset layout:
      
      Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
      Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
      Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
      Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more
      
      This patch (of 25):
      
      This is a preparation patch to move page fault accountings into the
      general code in handle_mm_fault().  This includes both the per task
      flt_maj/flt_min counters, and the major/minor page fault perf events.  To
      do this, the pt_regs pointer is passed into handle_mm_fault().
      
      PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
      handlers.
      
      So far, all the pt_regs pointer that passed into handle_mm_fault() is
      NULL, which means this patch should have no intented functional change.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
      Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce617ed
    • C
      uaccess: remove segment_eq · 428e2976
      Christoph Hellwig 提交于
      segment_eq is only used to implement uaccess_kernel.  Just open code
      uaccess_kernel in the arch uaccess headers and remove one layer of
      indirection.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NGreentime Hu <green.hu@gmail.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Link: http://lkml.kernel.org/r/20200710135706.537715-5-hch@lst.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      428e2976
    • J
      mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() · d622ecec
      Jia He 提交于
      This is to introduce a general dummy helper.  memory_add_physaddr_to_nid()
      is a fallback option to get the nid in case NUMA_NO_NID is detected.
      
      After this patch, arm64/sh/s390 can simply use the general dummy version.
      PowerPC/x86/ia64 will still use their specific version.
      
      This is the preparation to set a fallback value for dev_dax->target_node.
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chuhong Yuan <hslester96@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
      Cc: Kaly Xin <Kaly.Xin@arm.com>
      Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d622ecec
    • D
      x86/mm: use max memory block size on bare metal · fe124c95
      Daniel Jordan 提交于
      Some of our servers spend significant time at kernel boot initializing
      memory block sysfs directories and then creating symlinks between them and
      the corresponding nodes.  The slowness happens because the machines get
      stuck with the smallest supported memory block size on x86 (128M), which
      results in 16,288 directories to cover the 2T of installed RAM.  The
      search for each memory block is noticeable even with commit 4fb6eabf
      ("drivers/base/memory.c: cache memory blocks in xarray to accelerate
      lookup").
      
      Commit 078eb6aa ("x86/mm/memory_hotplug: determine block size based on
      the end of boot memory") chooses the block size based on alignment with
      memory end.  That addresses hotplug failures in qemu guests, but for bare
      metal systems whose memory end isn't aligned to even the smallest size, it
      leaves them at 128M.
      
      Make kernels that aren't running on a hypervisor use the largest supported
      size (2G) to minimize overhead on big machines.  Kernel boot goes 7%
      faster on the aforementioned servers, shaving off half a second.
      
      [daniel.m.jordan@oracle.com: v3]
        Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.comSigned-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe124c95
  8. 11 8月, 2020 5 次提交
  9. 10 8月, 2020 2 次提交
    • P
      x86: Expose SERIALIZE for supported cpuid · 43bd9ef4
      Paolo Bonzini 提交于
      The SERIALIZE instruction is supported by Tntel processors, like
      Sapphire Rapids.  SERIALIZE is a faster serializing instruction which
      does not modify registers, arithmetic flags or memory, will not cause VM
      exit. It's availability is indicated by CPUID.(EAX=7,ECX=0):ECX[bit 14].
      
      Expose it in KVM supported CPUID.  This way, KVM could pass this
      information to guests and they can make use of these features accordingly.
      Signed-off-by: NCathy Zhang <cathy.zhang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      43bd9ef4
    • S
      KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled · 05487215
      Sean Christopherson 提交于
      Don't attempt to load PDPTRs if EFER.LME=1, i.e. if 64-bit mode is
      enabled.  A recent change to reload the PDTPRs when CR0.CD or CR0.NW is
      toggled botched the EFER.LME handling and sends KVM down the PDTPR path
      when is_paging() is true, i.e. when the guest toggles CD/NW in 64-bit
      mode.
      
      Split the CR0 checks for 64-bit vs. 32-bit PAE into separate paths.  The
      64-bit path is specifically checking state when paging is toggled on,
      i.e. CR0.PG transititions from 0->1.  The PDPTR path now needs to run if
      the new CR0 state has paging enabled, irrespective of whether paging was
      already enabled.  Trying to shave a few cycles to make the PDPTR path an
      "else if" case is a mess.
      
      Fixes: d42e3fae ("kvm: x86: Read PDPTEs on CR0.CD and CR0.NW changes")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Peter Shier <pshier@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20200714015732.32426-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05487215
  10. 08 8月, 2020 9 次提交
    • M
      mm/sparse: cleanup the code surrounding memory_present() · c89ab04f
      Mike Rapoport 提交于
      After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
      functions that call memory_present() for each region in memblock.memory:
      sparse_memory_present_with_active_regions() and membocks_present().
      
      Moreover, all architectures have a call to either of these functions
      preceding the call to sparse_init() and in the most cases they are called
      one after the other.
      
      Mark the regions from memblock.memory as present during sparce_init() by
      making sparse_init() call memblocks_present(), make memblocks_present()
      and memory_present() functions static and remove redundant
      sparse_memory_present_with_active_regions() function.
      
      Also remove no longer required HAVE_MEMORY_PRESENT configuration option.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c89ab04f
    • A
      mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf() · 56993b4e
      Anshuman Khandual 提交于
      There are many instances where vmemap allocation is often switched between
      regular memory and device memory just based on whether altmap is available
      or not.  vmemmap_alloc_block_buf() is used in various platforms to
      allocate vmemmap mappings.  Lets also enable it to handle altmap based
      device memory allocation along with existing regular memory allocations.
      This will help in avoiding the altmap based allocation switch in many
      places.  To summarize there are two different methods to call
      vmemmap_alloc_block_buf().
      
      vmemmap_alloc_block_buf(size, node, NULL)   /* Allocate from system RAM */
      vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */
      
      This converts altmap_alloc_block_buf() into a static function, drops it's
      entry from the header and updates Documentation/vm/memory-model.rst.
      Suggested-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJia He <justin.he@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Link: http://lkml.kernel.org/r/1594004178-8861-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56993b4e
    • A
      mm/sparsemem: enable vmem_altmap support in vmemmap_populate_basepages() · 1d9cfee7
      Anshuman Khandual 提交于
      Patch series "arm64: Enable vmemmap mapping from device memory", v4.
      
      This series enables vmemmap backing memory allocation from device memory
      ranges on arm64.  But before that, it enables vmemmap_populate_basepages()
      and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based
      alocation requests.
      
      This patch (of 3):
      
      vmemmap_populate_basepages() is used across platforms to allocate backing
      memory for vmemmap mapping.  This is used as a standard default choice or
      as a fallback when intended huge pages allocation fails.  This just
      creates entire vmemmap mapping with base pages (PAGE_SIZE).
      
      On arm64 platforms, vmemmap_populate_basepages() is called instead of the
      platform specific vmemmap_populate() when ARM64_SWAPPER_USES_SECTION_MAPS
      is not enabled as in case for ARM64_16K_PAGES and ARM64_64K_PAGES configs.
      
      At present vmemmap_populate_basepages() does not support allocating from
      driver defined struct vmem_altmap while trying to create vmemmap mapping
      for a device memory range.  It prevents ARM64_16K_PAGES and
      ARM64_64K_PAGES configs on arm64 from supporting device memory with
      vmemap_altmap request.
      
      This enables vmem_altmap support in vmemmap_populate_basepages() unlocking
      device memory allocation for vmemap mapping on arm64 platforms with 16K or
      64K base page configs.
      
      Each architecture should evaluate and decide on subscribing device memory
      based base page allocation through vmemmap_populate_basepages().  Hence
      lets keep it disabled on all archs in order to preserve the existing
      semantics.  A subsequent patch enables it on arm64.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJia He <justin.he@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Link: http://lkml.kernel.org/r/1594004178-8861-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r/1594004178-8861-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d9cfee7
    • M
      asm-generic: pgalloc: provide generic pgd_free() · f9cb654c
      Mike Rapoport 提交于
      Most architectures define pgd_free() as a wrapper for free_page().
      
      Provide a generic version in asm-generic/pgalloc.h and enable its use for
      most architectures.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-7-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9cb654c
    • M
      asm-generic: pgalloc: provide generic pud_alloc_one() and pud_free_one() · d9e8b929
      Mike Rapoport 提交于
      Several architectures define pud_alloc_one() as a wrapper for
      __get_free_page() and pud_free() as a wrapper for free_page().
      
      Provide a generic implementation in asm-generic/pgalloc.h and use it where
      appropriate.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-6-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9e8b929
    • M
      asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one() · 1355c31e
      Mike Rapoport 提交于
      For most architectures that support >2 levels of page tables,
      pmd_alloc_one() is a wrapper for __get_free_pages(), sometimes with
      __GFP_ZERO and sometimes followed by memset(0) instead.
      
      More elaborate versions on arm64 and x86 account memory for the user page
      tables and call to pgtable_pmd_page_ctor() as the part of PMD page
      initialization.
      
      Move the arm64 version to include/asm-generic/pgalloc.h and use the
      generic version on several architectures.
      
      The pgtable_pmd_page_ctor() is a NOP when ARCH_ENABLE_SPLIT_PMD_PTLOCK is
      not enabled, so there is no functional change for most architectures
      except of the addition of __GFP_ACCOUNT for allocation of user page
      tables.
      
      The pmd_free() is a wrapper for free_page() in all the cases, so no
      functional change here.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-5-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1355c31e
    • M
      mm: remove unneeded includes of <asm/pgalloc.h> · ca15ca40
      Mike Rapoport 提交于
      Patch series "mm: cleanup usage of <asm/pgalloc.h>"
      
      Most architectures have very similar versions of pXd_alloc_one() and
      pXd_free_one() for intermediate levels of page table.  These patches add
      generic versions of these functions in <asm-generic/pgalloc.h> and enable
      use of the generic functions where appropriate.
      
      In addition, functions declared and defined in <asm/pgalloc.h> headers are
      used mostly by core mm and early mm initialization in arch and there is no
      actual reason to have the <asm/pgalloc.h> included all over the place.
      The first patch in this series removes unneeded includes of
      <asm/pgalloc.h>
      
      In the end it didn't work out as neatly as I hoped and moving
      pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
      unnecessary changes to arches that have custom page table allocations, so
      I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
      to mm/.
      
      This patch (of 8):
      
      In most cases <asm/pgalloc.h> header is required only for allocations of
      page table memory.  Most of the .c files that include that header do not
      use symbols declared in <asm/pgalloc.h> and do not require that header.
      
      As for the other header files that used to include <asm/pgalloc.h>, it is
      possible to move that include into the .c file that actually uses symbols
      from <asm/pgalloc.h> and drop the include from the header file.
      
      The process was somewhat automated using
      
      	sed -i -E '/[<"]asm\/pgalloc\.h/d' \
                      $(grep -L -w -f /tmp/xx \
                              $(git grep -E -l '[<"]asm/pgalloc\.h'))
      
      where /tmp/xx contains all the symbols defined in
      arch/*/include/asm/pgalloc.h.
      
      [rppt@linux.ibm.com: fix powerpc warning]
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca15ca40
    • W
      mm, treewide: rename kzfree() to kfree_sensitive() · 453431a5
      Waiman Long 提交于
      As said by Linus:
      
        A symmetric naming is only helpful if it implies symmetries in use.
        Otherwise it's actively misleading.
      
        In "kzalloc()", the z is meaningful and an important part of what the
        caller wants.
      
        In "kzfree()", the z is actively detrimental, because maybe in the
        future we really _might_ want to use that "memfill(0xdeadbeef)" or
        something. The "zero" part of the interface isn't even _relevant_.
      
      The main reason that kzfree() exists is to clear sensitive information
      that should not be leaked to other future users of the same memory
      objects.
      
      Rename kzfree() to kfree_sensitive() to follow the example of the recently
      added kvfree_sensitive() and make the intention of the API more explicit.
      In addition, memzero_explicit() is used to clear the memory to make sure
      that it won't get optimized away by the compiler.
      
      The renaming is done by using the command sequence:
      
        git grep -w --name-only kzfree |\
        xargs sed -i 's/kzfree/kfree_sensitive/'
      
      followed by some editing of the kfree_sensitive() kerneldoc and adding
      a kzfree backward compatibility macro in slab.h.
      
      [akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
      [akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]
      Suggested-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
      Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453431a5
    • J
      x86/mm/64: Do not dereference non-present PGD entries · 995909a4
      Joerg Roedel 提交于
      The code for preallocate_vmalloc_pages() was written under the
      assumption that the p4d_offset() and pud_offset() functions will perform
      present checks before dereferencing the parent entries.
      
      This assumption is wrong an leads to a bug in the code which causes the
      physical address found in the PGD be used as a page-table page, even if
      the PGD is not present.
      
      So the code flow currently is:
      
      	pgd = pgd_offset_k(addr);
      	p4d = p4d_offset(pgd, addr);
      	if (p4d_none(*p4d))
      		p4d = p4d_alloc(&init_mm, pgd, addr);
      
      This lacks a check for pgd_none() at least, the correct flow would be:
      
      	pgd = pgd_offset_k(addr);
      	if (pgd_none(*pgd))
      		p4d = p4d_alloc(&init_mm, pgd, addr);
      	else
      		p4d = p4d_offset(pgd, addr);
      
      But this is the same flow that the p4d_alloc() and the pud_alloc()
      functions use internally, so there is no need to duplicate them.
      
      Remove the p?d_none() checks from the function and just call into
      p4d_alloc() and pud_alloc() to correctly pre-allocate the PGD entries.
      Reported-and-tested-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Fixes: 6eb82f99 ("x86/mm: Pre-allocate P4D/PUD pages for vmalloc area")
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      995909a4