1. 15 11月, 2020 1 次提交
    • S
      KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook · c2fe3cd4
      Sean Christopherson 提交于
      Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
      and invoke the new hook from kvm_valid_cr4().  This fixes an issue where
      KVM_SET_SREGS would return success while failing to actually set CR4.
      
      Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
      in __set_sregs() is not a viable option as KVM has already stuffed a
      variety of vCPU state.
      
      Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
      inverted semantics.  This will be remedied in a future patch.
      
      Fixes: 5e1746d6 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c2fe3cd4
  2. 13 11月, 2020 1 次提交
    • B
      KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch · 0107973a
      Babu Moger 提交于
      SEV guests fail to boot on a system that supports the PCID feature.
      
      While emulating the RSM instruction, KVM reads the guest CR3
      and calls kvm_set_cr3(). If the vCPU is in the long mode,
      kvm_set_cr3() does a sanity check for the CR3 value. In this case,
      it validates whether the value has any reserved bits set. The
      reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
      encryption is enabled, the memory encryption bit is set in the CR3
      value. The memory encryption bit may fall within the KVM reserved
      bit range, causing the KVM emulation failure.
      
      Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
      cache the reserved bits in the CR3 value. This will be initialized
      to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).
      
      If the architecture has any special bits(like AMD SEV encryption bit)
      that needs to be masked from the reserved bits, should be cleared
      in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.
      
      Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3")
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0107973a
  3. 26 10月, 2020 1 次提交
  4. 24 10月, 2020 1 次提交
    • R
      x86/uaccess: fix code generation in put_user() · 9c5743df
      Rasmus Villemoes 提交于
      Quoting https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html:
      
        You can define a local register variable and associate it with a
        specified register...
      
        The only supported use for this feature is to specify registers for
        input and output operands when calling Extended asm (see Extended
        Asm). This may be necessary if the constraints for a particular
        machine don't provide sufficient control to select the desired
        register.
      
      On 32-bit x86, this is used to ensure that gcc will put an 8-byte value
      into the %edx:%eax pair, while all other cases will just use the single
      register %eax (%rax on x86-64).  While the _ASM_AX actually just expands
      to "%eax", note this comment next to get_user() which does something
      very similar:
      
       * The use of _ASM_DX as the register specifier is a bit of a
       * simplification, as gcc only cares about it as the starting point
       * and not size: for a 64-bit value it will use %ecx:%edx on 32 bits
       * (%ecx being the next register in gcc's x86 register sequence), and
       * %rdx on 64 bits.
      
      However, getting this to work requires that there is no code between the
      assignment to the local register variable and its use as an input to the
      asm() which can possibly clobber any of the registers involved -
      including evaluation of the expressions making up other inputs.
      
      In the current code, the ptr expression used directly as an input may
      cause such code to be emitted.  For example, Sean Christopherson
      observed that with KASAN enabled and ptr being current->set_child_tid
      (from chedule_tail()), the load of current->set_child_tid causes a call
      to __asan_load8() to be emitted immediately prior to the __put_user_4
      call, and Naresh Kamboju reports that various mmstress tests fail on
      KASAN-enabled builds.
      
      It's also possible to synthesize a broken case without KASAN if one uses
      "foo()" as the ptr argument, with foo being some "extern u64 __user
      *foo(void);" (though I don't know if that appears in real code).
      
      Fix it by making sure ptr gets evaluated before the assignment to
      __val_pu, and add a comment that __val_pu must be the last thing
      computed before the asm() is entered.
      
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Fixes: d55564cf ("x86: Make __put_user() generate an out-of-line call")
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c5743df
  5. 23 10月, 2020 1 次提交
  6. 22 10月, 2020 6 次提交
  7. 15 10月, 2020 1 次提交
  8. 14 10月, 2020 1 次提交
    • D
      x86/numa: cleanup configuration dependent command-line options · 2dd57d34
      Dan Williams 提交于
      Patch series "device-dax: Support sub-dividing soft-reserved ranges", v5.
      
      The device-dax facility allows an address range to be directly mapped
      through a chardev, or optionally hotplugged to the core kernel page
      allocator as System-RAM.  It is the mechanism for converting persistent
      memory (pmem) to be used as another volatile memory pool i.e.  the current
      Memory Tiering hot topic on linux-mm.
      
      In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
      it, but that labeling mechanism is not available / applicable to
      soft-reserved ("EFI specific purpose") memory [3].  This series provides a
      sysfs-mechanism for the daxctl utility to enable provisioning of
      volatile-soft-reserved memory ranges.
      
      The motivations for this facility are:
      
      1/ Allow performance differentiated memory ranges to be split between
         kernel-managed and directly-accessed use cases.
      
      2/ Allow physical memory to be provisioned along performance relevant
         address boundaries. For example, divide a memory-side cache [4] along
         cache-color boundaries.
      
      3/ Parcel out soft-reserved memory to VMs using device-dax as a security
         / permissions boundary [5]. Specifically I have seen people (ab)using
         memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
         device-dax interface on custom address ranges. A follow-on for the VM
         use case is to teach device-dax to dynamically allocate 'struct page' at
         runtime to reduce the duplication of 'struct page' space in both the
         guest and the host kernel for the same physical pages.
      
      [2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
      [3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com
      [4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
      [5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
      
      This patch (of 23):
      
      In preparation for adding a new numa= option clean up the existing ones to
      avoid ifdefs in numa_setup(), and provide feedback when the option is
      numa=fake= option is invalid due to kernel config.  The same does not need
      to be done for numa=noacpi, since the capability is already hard disabled
      at compile-time.
      Suggested-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/160106109960.30709.7379926726669669398.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: https://lkml.kernel.org/r/159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: https://lkml.kernel.org/r/159643094925.4062302.14979872973043772305.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dd57d34
  9. 13 10月, 2020 3 次提交
    • N
      x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT · 865c50e1
      Nick Desaulniers 提交于
      Clang-11 shipped support for outputs to asm goto statments along the
      fallthrough path.  Double up some of the get_user() and related macros
      to be able to take advantage of this extended GNU C extension. This
      should help improve the generated code's performance for these accesses.
      
      Cc: Bill Wendling <morbo@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      865c50e1
    • L
      x86: Make __put_user() generate an out-of-line call · d55564cf
      Linus Torvalds 提交于
      Instead of inlining the stac/mov/clac sequence (which also requires
      individual exception table entries and several asm instruction
      alternatives entries), just generate "call __put_user_nocheck_X" for the
      __put_user() cases, the same way we changed __get_user earlier.
      
      Unlike the get_user() case, we didn't have the same nice infrastructure
      to just generate the call with a single case, so this actually has to
      change some of the infrastructure in order to do this.  But that only
      cleans up the code further.
      
      So now, instead of using a case statement for the sizes, we just do the
      same thing we've done on the get_user() side for a long time: use the
      size as an immediate constant to the asm, and generate the asm that way
      directly.
      
      In order to handle the special case of 64-bit data on a 32-bit kernel, I
      needed to change the calling convention slightly: the data is passed in
      %eax[:%edx], the pointer in %ecx, and the return value is also returned
      in %ecx.  It used to be returned in %eax, but because of how %eax can
      now be a double register input, we don't want mix that with a
      single-register output.
      
      The actual low-level asm is easier to handle: we'll just share the code
      between the checking and non-checking case, with the non-checking case
      jumping into the middle of the function.  That may sound a bit too
      special, but this code is all very very special anyway, so...
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d55564cf
    • L
      x86: Make __get_user() generate an out-of-line call · ea6f043f
      Linus Torvalds 提交于
      Instead of inlining the whole stac/lfence/mov/clac sequence (which also
      requires individual exception table entries and several asm instruction
      alternatives entries), just generate "call __get_user_nocheck_X" for the
      __get_user() cases.
      
      We can use all the same infrastructure that we already do for the
      regular "get_user()", and the end result is simpler source code, and
      much simpler code generation.
      
      It also means that when I introduce asm goto with input for
      "unsafe_get_user()", there are no nasty interactions with the
      __get_user() code.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea6f043f
  10. 08 10月, 2020 1 次提交
  11. 07 10月, 2020 13 次提交
  12. 06 10月, 2020 3 次提交
    • D
      x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() · ec6347bb
      Dan Williams 提交于
      In reaction to a proposal to introduce a memcpy_mcsafe_fast()
      implementation Linus points out that memcpy_mcsafe() is poorly named
      relative to communicating the scope of the interface. Specifically what
      addresses are valid to pass as source, destination, and what faults /
      exceptions are handled.
      
      Of particular concern is that even though x86 might be able to handle
      the semantics of copy_mc_to_user() with its common copy_user_generic()
      implementation other archs likely need / want an explicit path for this
      case:
      
        On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
        >
        > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
        > >
        > > However now I see that copy_user_generic() works for the wrong reason.
        > > It works because the exception on the source address due to poison
        > > looks no different than a write fault on the user address to the
        > > caller, it's still just a short copy. So it makes copy_to_user() work
        > > for the wrong reason relative to the name.
        >
        > Right.
        >
        > And it won't work that way on other architectures. On x86, we have a
        > generic function that can take faults on either side, and we use it
        > for both cases (and for the "in_user" case too), but that's an
        > artifact of the architecture oddity.
        >
        > In fact, it's probably wrong even on x86 - because it can hide bugs -
        > but writing those things is painful enough that everybody prefers
        > having just one function.
      
      Replace a single top-level memcpy_mcsafe() with either
      copy_mc_to_user(), or copy_mc_to_kernel().
      
      Introduce an x86 copy_mc_fragile() name as the rename for the
      low-level x86 implementation formerly named memcpy_mcsafe(). It is used
      as the slow / careful backend that is supplanted by a fast
      copy_mc_generic() in a follow-on patch.
      
      One side-effect of this reorganization is that separating copy_mc_64.S
      to its own file means that perf no longer needs to track dependencies
      for its memcpy_64.S benchmarks.
      
       [ bp: Massage a bit. ]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>
      Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
      Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
      ec6347bb
    • C
      dma-mapping: move dma-debug.h to kernel/dma/ · a1fd09e8
      Christoph Hellwig 提交于
      Most of dma-debug.h is not required by anything outside of kernel/dma.
      Move the four declarations needed by dma-mappin.h or dma-ops providers
      into dma-mapping.h and dma-map-ops.h, and move the remainder of the
      file to kernel/dma/debug.h.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a1fd09e8
    • C
      dma-mapping: merge <linux/dma-contiguous.h> into <linux/dma-map-ops.h> · 0b1abd1f
      Christoph Hellwig 提交于
      Merge dma-contiguous.h into dma-map-ops.h, after removing the comment
      describing the contiguous allocator into kernel/dma/contigous.c.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      0b1abd1f
  13. 03 10月, 2020 1 次提交
  14. 01 10月, 2020 1 次提交
  15. 28 9月, 2020 5 次提交
    • S
      x86: Use tracepoint_enabled() for msr tracepoints instead of open coding it · fdb46fae
      Steven Rostedt (VMware) 提交于
      7f47d8cc ("x86, tracing, perf: Add trace point for MSR accesses") added
      tracing of msr read and write, but because of complexity in having
      tracepoints in headers, and even more so for a core header like msr.h, not
      to mention the bloat a tracepoint adds to inline functions, a helper
      function is needed to be called from the header.
      
      Use the new tracepoint_enabled() macro in tracepoint-defs.h to test if the
      tracepoint is active before calling the helper function, instead of open
      coding the same logic, which requires knowing the internals of a tracepoint.
      
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      fdb46fae
    • P
      KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES · 729c15c2
      Paolo Bonzini 提交于
      We are going to use it for SVM too, so use a more generic name.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      729c15c2
    • A
      KVM: x86: Introduce MSR filtering · 1a155254
      Alexander Graf 提交于
      It's not desireable to have all MSRs always handled by KVM kernel space. Some
      MSRs would be useful to handle in user space to either emulate behavior (like
      uCode updates) or differentiate whether they are valid based on the CPU model.
      
      To allow user space to specify which MSRs it wants to see handled by KVM,
      this patch introduces a new ioctl to push filter rules with bitmaps into
      KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
      With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
      denied MSR events to user space to operate on.
      
      If no filter is populated, MSR handling stays identical to before.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-8-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1a155254
    • A
      KVM: x86: Add infrastructure for MSR filtering · 51de8151
      Alexander Graf 提交于
      In the following commits we will add pieces of MSR filtering.
      To ensure that code compiles even with the feature half-merged, let's add
      a few stubs and struct definitions before the real patches start.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-4-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      51de8151
    • A
      KVM: x86: Allow deflecting unknown MSR accesses to user space · 1ae09954
      Alexander Graf 提交于
      MSRs are weird. Some of them are normal control registers, such as EFER.
      Some however are registers that really are model specific, not very
      interesting to virtualization workloads, and not performance critical.
      Others again are really just windows into package configuration.
      
      Out of these MSRs, only the first category is necessary to implement in
      kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
      certain CPU models and MSRs that contain information on the package level
      are much better suited for user space to process. However, over time we have
      accumulated a lot of MSRs that are not the first category, but still handled
      by in-kernel KVM code.
      
      This patch adds a generic interface to handle WRMSR and RDMSR from user
      space. With this, any future MSR that is part of the latter categories can
      be handled in user space.
      
      Furthermore, it allows us to replace the existing "ignore_msrs" logic with
      something that applies per-VM rather than on the full system. That way you
      can run productive VMs in parallel to experimental ones where you don't care
      about proper MSR handling.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      
      Message-Id: <20200925143422.21718-3-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ae09954