1. 06 1月, 2019 1 次提交
    • M
      jump_label: move 'asm goto' support test to Kconfig · e9666d10
      Masahiro Yamada 提交于
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      e9666d10
  2. 05 1月, 2019 7 次提交
    • L
      x86: re-introduce non-generic memcpy_{to,from}io · 170d13ca
      Linus Torvalds 提交于
      This has been broken forever, and nobody ever really noticed because
      it's purely a performance issue.
      
      Long long ago, in commit 6175ddf0 ("x86: Clean up mem*io functions")
      Brian Gerst simplified the memory copies to and from iomem, since on
      x86, the instructions to access iomem are exactly the same as the
      regular instructions.
      
      That is technically true, and things worked, and nobody said anything.
      Besides, back then the regular memcpy was pretty simple and worked fine.
      
      Nobody noticed except for David Laight, that is.  David has a testing a
      TLP monitor he was writing for an FPGA, and has been occasionally
      complaining about how memcpy_toio() writes things one byte at a time.
      
      Which is completely unacceptable from a performance standpoint, even if
      it happens to technically work.
      
      The reason it's writing one byte at a time is because while it's
      technically true that accesses to iomem are the same as accesses to
      regular memory on x86, the _granularity_ (and ordering) of accesses
      matter to iomem in ways that they don't matter to regular cached memory.
      
      In particular, when ERMS is set, we default to using "rep movsb" for
      larger memory copies.  That is indeed perfectly fine for real memory,
      since the whole point is that the CPU is going to do cacheline
      optimizations and executes the memory copy efficiently for cached
      memory.
      
      With iomem? Not so much.  With iomem, "rep movsb" will indeed work, but
      it will copy things one byte at a time. Slowly and ponderously.
      
      Now, originally, back in 2010 when commit 6175ddf0 was done, we
      didn't use ERMS, and this was much less noticeable.
      
      Our normal memcpy() was simpler in other ways too.
      
      Because in fact, it's not just about using the string instructions.  Our
      memcpy() these days does things like "read and write overlapping values"
      to handle the last bytes of the copy.  Again, for normal memory,
      overlapping accesses isn't an issue.  For iomem? It can be.
      
      So this re-introduces the specialized memcpy_toio(), memcpy_fromio() and
      memset_io() functions.  It doesn't particularly optimize them, but it
      tries to at least not be horrid, or do overlapping accesses.  In fact,
      this uses the existing __inline_memcpy() function that we still had
      lying around that uses our very traditional "rep movsl" loop followed by
      movsw/movsb for the final bytes.
      
      Somebody may decide to try to improve on it, but if we've gone almost a
      decade with only one person really ever noticing and complaining, maybe
      it's not worth worrying about further, once it's not _completely_ broken?
      Reported-by: NDavid Laight <David.Laight@aculab.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      170d13ca
    • L
      Use __put_user_goto in __put_user_size() and unsafe_put_user() · a959dc88
      Linus Torvalds 提交于
      This actually enables the __put_user_goto() functionality in
      unsafe_put_user().
      
      For an example of the effect of this, this is the code generated for the
      
              unsafe_put_user(signo, &infop->si_signo, Efault);
      
      in the waitid() system call:
      
      	movl %ecx,(%rbx)        # signo, MEM[(struct __large_struct *)_2]
      
      It's just one single store instruction, along with generating an
      exception table entry pointing to the Efault label case in case that
      instruction faults.
      
      Before, we would generate this:
      
      	xorl    %edx, %edx
      	movl %ecx,(%rbx)        # signo, MEM[(struct __large_struct *)_3]
              testl   %edx, %edx
              jne     .L309
      
      with the exception table generated for that 'mov' instruction causing us
      to jump to a stub that set %edx to -EFAULT and then jumped back to the
      'testl' instruction.
      
      So not only do we now get rid of the extra code in the normal sequence,
      we also avoid unnecessarily keeping that extra error register live
      across it all.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a959dc88
    • L
      x86 uaccess: Introduce __put_user_goto · 4a789213
      Linus Torvalds 提交于
      This is finally the actual reason for the odd error handling in the
      "unsafe_get/put_user()" functions, introduced over three years ago.
      
      Using a "jump to error label" interface is somewhat odd, but very
      convenient as a programming interface, and more importantly, it fits
      very well with simply making the target be the exception handler address
      directly from the inline asm.
      
      The reason it took over three years to actually do this? We need "asm
      goto" support for it, which only became the default on x86 last year.
      It's now been a year that we've forced asm goto support (see commit
      e501ce95 "x86: Force asm-goto"), and so let's just do it here too.
      
      [ Side note: this commit was originally done back in 2016. The above
        commentary about timing is obviously about it only now getting merged
        into my real upstream tree     - Linus ]
      
      Sadly, gcc still only supports "asm goto" with asms that do not have any
      outputs, so we are limited to only the put_user case for this.  Maybe in
      several more years we can do the get_user case too.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a789213
    • J
      mm: select HAVE_MOVE_PMD on x86 for faster mremap · 9f132f7e
      Joel Fernandes (Google) 提交于
      Moving page-tables at the PMD-level on x86 is known to be safe.  Enable
      this option so that we can do fast mremap when possible.
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-4-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f132f7e
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
    • M
      fls: change parameter to unsigned int · 3fc2579e
      Matthew Wilcox 提交于
      When testing in userspace, UBSAN pointed out that shifting into the sign
      bit is undefined behaviour.  It doesn't really make sense to ask for the
      highest set bit of a negative value, so just turn the argument type into
      an unsigned int.
      
      Some architectures (eg ppc) already had it declared as an unsigned int,
      so I don't expect too many problems.
      
      Link: http://lkml.kernel.org/r/20181105221117.31828-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fc2579e
    • L
      make 'user_access_begin()' do 'access_ok()' · 594cc251
      Linus Torvalds 提交于
      Originally, the rule used to be that you'd have to do access_ok()
      separately, and then user_access_begin() before actually doing the
      direct (optimized) user access.
      
      But experience has shown that people then decide not to do access_ok()
      at all, and instead rely on it being implied by other operations or
      similar.  Which makes it very hard to verify that the access has
      actually been range-checked.
      
      If you use the unsafe direct user accesses, hardware features (either
      SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
      Access Never - on ARM) do force you to use user_access_begin().  But
      nothing really forces the range check.
      
      By putting the range check into user_access_begin(), we actually force
      people to do the right thing (tm), and the range check vill be visible
      near the actual accesses.  We have way too long a history of people
      trying to avoid them.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      594cc251
  3. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  4. 30 12月, 2018 2 次提交
    • C
      kgdb/treewide: constify struct kgdb_arch arch_kgdb_ops · cc028297
      Christophe Leroy 提交于
      checkpatch.pl reports the following:
      
        WARNING: struct kgdb_arch should normally be const
        #28: FILE: arch/mips/kernel/kgdb.c:397:
        +struct kgdb_arch arch_kgdb_ops = {
      
      This report makes sense, as all other ops struct, this
      one should also be const. This patch does the change.
      
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: x86@kernel.org
      Acked-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Acked-by: NPaul Burton <paul.burton@mips.com>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      cc028297
    • D
      kgdb: Remove irq flags from roundup · 9ef7fa50
      Douglas Anderson 提交于
      The function kgdb_roundup_cpus() was passed a parameter that was
      documented as:
      
      > the flags that will be used when restoring the interrupts. There is
      > local_irq_save() call before kgdb_roundup_cpus().
      
      Nobody used those flags.  Anyone who wanted to temporarily turn on
      interrupts just did local_irq_enable() and local_irq_disable() without
      looking at them.  So we can definitely remove the flags.
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      9ef7fa50
  5. 29 12月, 2018 7 次提交
  6. 23 12月, 2018 15 次提交
  7. 21 12月, 2018 7 次提交
    • M
      treewide: surround Kconfig file paths with double quotes · 8636a1f9
      Masahiro Yamada 提交于
      The Kconfig lexer supports special characters such as '.' and '/' in
      the parameter context. In my understanding, the reason is just to
      support bare file paths in the source statement.
      
      I do not see a good reason to complicate Kconfig for the room of
      ambiguity.
      
      The majority of code already surrounds file paths with double quotes,
      and it makes sense since file paths are constant string literals.
      
      Make it treewide consistent now.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NWolfram Sang <wsa@the-dreams.de>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      8636a1f9
    • R
      a0aea130
    • S
      KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines · 453eafbe
      Sean Christopherson 提交于
      Transitioning to/from a VMX guest requires KVM to manually save/load
      the bulk of CPU state that the guest is allowed to direclty access,
      e.g. XSAVE state, CR2, GPRs, etc...  For obvious reasons, loading the
      guest's GPR snapshot prior to VM-Enter and saving the snapshot after
      VM-Exit is done via handcoded assembly.  The assembly blob is written
      as inline asm so that it can easily access KVM-defined structs that
      are used to hold guest state, e.g. moving the blob to a standalone
      assembly file would require generating defines for struct offsets.
      
      The other relevant aspect of VMX transitions in KVM is the handling of
      VM-Exits.  KVM doesn't employ a separate VM-Exit handler per se, but
      rather treats the VMX transition as a mega instruction (with many side
      effects), i.e. sets the VMCS.HOST_RIP to a label immediately following
      VMLAUNCH/VMRESUME.  The label is then exposed to C code via a global
      variable definition in the inline assembly.
      
      Because of the global variable, KVM takes steps to (attempt to) ensure
      only a single instance of the owning C function, e.g. vmx_vcpu_run, is
      generated by the compiler.  The earliest approach placed the inline
      assembly in a separate noinline function[1].  Later, the assembly was
      folded back into vmx_vcpu_run() and tagged with __noclone[2][3], which
      is still used today.
      
      After moving to __noclone, an edge case was encountered where GCC's
      -ftracer optimization resulted in the inline assembly blob being
      duplicated.  This was "fixed" by explicitly disabling -ftracer in the
      __noclone definition[4].
      
      Recently, it was found that disabling -ftracer causes build warnings
      for unsuspecting users of __noclone[5], and more importantly for KVM,
      prevents the compiler for properly optimizing vmx_vcpu_run()[6].  And
      perhaps most importantly of all, it was pointed out that there is no
      way to prevent duplication of a function with 100% reliability[7],
      i.e. more edge cases may be encountered in the future.
      
      So to summarize, the only way to prevent the compiler from duplicating
      the global variable definition is to move the variable out of inline
      assembly, which has been suggested several times over[1][7][8].
      
      Resolve the aforementioned issues by moving the VMLAUNCH+VRESUME and
      VM-Exit "handler" to standalone assembly sub-routines.  Moving only
      the core VMX transition codes allows the struct indexing to remain as
      inline assembly and also allows the sub-routines to be used by
      nested_vmx_check_vmentry_hw().  Reusing the sub-routines has a happy
      side-effect of eliminating two VMWRITEs in the nested_early_check path
      as there is no longer a need to dynamically change VMCS.HOST_RIP.
      
      Note that callers to vmx_vmenter() must account for the CALL modifying
      RSP, e.g. must subtract op-size from RSP when synchronizing RSP with
      VMCS.HOST_RSP and "restore" RSP prior to the CALL.  There are no great
      alternatives to fudging RSP.  Saving RSP in vmx_enter() is difficult
      because doing so requires a second register (VMWRITE does not provide
      an immediate encoding for the VMCS field and KVM supports Hyper-V's
      memory-based eVMCS ABI).  The other more drastic alternative would be
      to use eschew VMCS.HOST_RSP and manually save/load RSP using a per-cpu
      variable (which can be encoded as e.g. gs:[imm]).  But because a valid
      stack is needed at the time of VM-Exit (NMIs aren't blocked and a user
      could theoretically insert INT3/INT1ICEBRK at the VM-Exit handler), a
      dedicated per-cpu VM-Exit stack would be required.  A dedicated stack
      isn't difficult to implement, but it would require at least one page
      per CPU and knowledge of the stack in the dumpstack routines.  And in
      most cases there is essentially zero overhead in dynamically updating
      VMCS.HOST_RSP, e.g. the VMWRITE can be avoided for all but the first
      VMLAUNCH unless nested_early_check=1, which is not a fast path.  In
      other words, avoiding the VMCS.HOST_RSP by using a dedicated stack
      would only make the code marginally less ugly while requiring at least
      one page per CPU and forcing the kernel to be aware (and approve) of
      the VM-Exit stack shenanigans.
      
      [1] cea15c24ca39 ("KVM: Move KVM context switch into own function")
      [2] a3b5ba49 ("KVM: VMX: add the __noclone attribute to vmx_vcpu_run")
      [3] 104f226b ("KVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()")
      [4] 95272c29 ("compiler-gcc: disable -ftracer for __noclone functions")
      [5] https://lkml.kernel.org/r/20181218140105.ajuiglkpvstt3qxs@treble
      [6] https://patchwork.kernel.org/patch/8707981/#21817015
      [7] https://lkml.kernel.org/r/ri6y38lo23g.fsf@suse.cz
      [8] https://lkml.kernel.org/r/20181218212042.GE25620@tassilo.jf.intel.comSuggested-by: NAndi Kleen <ak@linux.intel.com>
      Suggested-by: NMartin Jambor <mjambor@suse.cz>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      453eafbe
    • S
      KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs · 051a2d3e
      Sean Christopherson 提交于
      Use '%% " _ASM_CX"' instead of '%0' to dereference RCX, i.e. the
      'struct vcpu_vmx' pointer, in the VM-Enter asm blobs of vmx_vcpu_run()
      and nested_vmx_check_vmentry_hw().  Using the symbolic name means that
      adding/removing an output parameter(s) requires "rewriting" almost all
      of the asm blob, which makes it nearly impossible to understand what's
      being changed in even the most minor patches.
      
      Opportunistically improve the code comments.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      051a2d3e
    • S
      KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup · e8143499
      Sean Christopherson 提交于
      ____kvm_handle_fault_on_reboot() provides a generic exception fixup
      handler that is used to cleanly handle faults on VMX/SVM instructions
      during reboot (or at least try to).  If there isn't a reboot in
      progress, ____kvm_handle_fault_on_reboot() treats any exception as
      fatal to KVM and invokes kvm_spurious_fault(), which in turn generates
      a BUG() to get a stack trace and die.
      
      When it was originally added by commit 4ecac3fd ("KVM: Handle
      virtualization instruction #UD faults during reboot"), the "call" to
      kvm_spurious_fault() was handcoded as PUSH+JMP, where the PUSH'd value
      is the RIP of the faulting instructing.
      
      The PUSH+JMP trickery is necessary because the exception fixup handler
      code lies outside of its associated function, e.g. right after the
      function.  An actual CALL from the .fixup code would show a slightly
      bogus stack trace, e.g. an extra "random" function would be inserted
      into the trace, as the return RIP on the stack would point to no known
      function (and the unwinder will likely try to guess who owns the RIP).
      
      Unfortunately, the JMP was replaced with a CALL when the macro was
      reworked to not spin indefinitely during reboot (commit b7c4145b
      "KVM: Don't spin on virt instruction faults during reboot").  This
      causes the aforementioned behavior where a bogus function is inserted
      into the stack trace, e.g. my builds like to blame free_kvm_area().
      
      Revert the CALL back to a JMP.  The changelog for commit b7c4145b
      ("KVM: Don't spin on virt instruction faults during reboot") contains
      nothing that indicates the switch to CALL was deliberate.  This is
      backed up by the fact that the PUSH <insn RIP> was left intact.
      
      Note that an alternative to the PUSH+JMP magic would be to JMP back
      to the "real" code and CALL from there, but that would require adding
      a JMP in the non-faulting path to avoid calling kvm_spurious_fault()
      and would add no value, i.e. the stack trace would be the same.
      
      Using CALL:
      
      ------------[ cut here ]------------
      kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
      invalid opcode: 0000 [#1] SMP
      CPU: 4 PID: 1057 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
      Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
      RSP: 0018:ffffc900004bbcc8 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888273fd8000 R08: 00000000000003e8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000371fb0
      R13: 0000000000000000 R14: 000000026d763cf4 R15: ffff888273fd8000
      FS:  00007f3d69691700(0000) GS:ffff888277800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055f89bc56fe0 CR3: 0000000271a5a001 CR4: 0000000000362ee0
      Call Trace:
       free_kvm_area+0x1044/0x43ea [kvm_intel]
       ? vmx_vcpu_run+0x156/0x630 [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? __set_task_blocked+0x38/0x90
       ? __set_current_blocked+0x50/0x60
       ? __fpu__restore_sig+0x97/0x490
       ? do_vfs_ioctl+0xa1/0x620
       ? __x64_sys_futex+0x89/0x180
       ? ksys_ioctl+0x66/0x70
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x4f/0x100
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
      ---[ end trace 9775b14b123b1713 ]---
      
      Using JMP:
      
      ------------[ cut here ]------------
      kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
      invalid opcode: 0000 [#1] SMP
      CPU: 6 PID: 1067 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
      Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
      RSP: 0018:ffffc90000497cd0 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88827058bd40 R08: 00000000000003e8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000369fb0
      R13: 0000000000000000 R14: 00000003c8fc6642 R15: ffff88827058bd40
      FS:  00007f3d7219e700(0000) GS:ffff888277900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3d64001000 CR3: 0000000271c6b004 CR4: 0000000000362ee0
      Call Trace:
       vmx_vcpu_run+0x156/0x630 [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? __set_task_blocked+0x38/0x90
       ? __set_current_blocked+0x50/0x60
       ? __fpu__restore_sig+0x97/0x490
       ? do_vfs_ioctl+0xa1/0x620
       ? __x64_sys_futex+0x89/0x180
       ? ksys_ioctl+0x66/0x70
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x4f/0x100
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
      ---[ end trace f9daedb85ab3ddba ]---
      
      Fixes: b7c4145b ("KVM: Don't spin on virt instruction faults during reboot")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e8143499
    • U
      KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams · ac5ffda2
      Uros Bizjak 提交于
      Recently the minimum required version of binutils was changed to 2.20,
      which supports all SVM instruction mnemonics. The patch removes
      all .byte #defines and uses real instruction mnemonics instead.
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ac5ffda2
    • L
      KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range() · 71883a62
      Lan Tianyu 提交于
      Originally, flush tlb is done by slot_handle_level_range(). This patch
      moves the flush directly to kvm_zap_gfn_range() when range flush is
      available, so that only the requested range can be flushed.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      71883a62