1. 13 6月, 2017 2 次提交
  2. 08 6月, 2017 1 次提交
  3. 05 6月, 2017 5 次提交
    • A
      x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant · d6e41f11
      Andy Lutomirski 提交于
      When PCID is enabled, CR3's PCID bits can change during context
      switches, so KVM won't be able to treat CR3 as a per-mm constant any
      more.
      
      I structured this like the existing CR4 handling.  Under ordinary
      circumstances (PCID disabled or if the current PCID and the value
      that's already in the VMCS match), then we won't do an extra VMCS
      write, and we'll never do an extra direct CR3 read.  The overhead
      should be minimal.
      
      I disallowed using the new helper in non-atomic context because
      PCID support will cause CR3 to stop being constant in non-atomic
      process context.
      
      (Frankly, it also scares me a bit that KVM ever treated CR3 as
      constant, but it looks like it was okay before.)
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d6e41f11
    • A
      x86/mm: Rework lazy TLB to track the actual loaded mm · 3d28ebce
      Andy Lutomirski 提交于
      Lazy TLB state is currently managed in a rather baroque manner.
      AFAICT, there are three possible states:
      
       - Non-lazy.  This means that we're running a user thread or a
         kernel thread that has called use_mm().  current->mm ==
         current->active_mm == cpu_tlbstate.active_mm and
         cpu_tlbstate.state == TLBSTATE_OK.
      
       - Lazy with user mm.  We're running a kernel thread without an mm
         and we're borrowing an mm_struct.  We have current->mm == NULL,
         current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
         != TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0).  The current cpu is set
         in mm_cpumask(current->active_mm).  CR3 points to
         current->active_mm->pgd.  The TLB is up to date.
      
       - Lazy with init_mm.  This happens when we call leave_mm().  We
         have current->mm == NULL, current->active_mm ==
         cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
         the scheduler is tracking it for refcounting.  cpu_tlbstate.state
         != TLBSTATE_OK.  The current cpu is clear in
         mm_cpumask(current->active_mm).  CR3 points to swapper_pg_dir,
         i.e. init_mm->pgd.
      
      This patch simplifies the situation.  Other than perf, x86 stops
      caring about current->active_mm at all.  We have
      cpu_tlbstate.loaded_mm pointing to the mm that CR3 references.  The
      TLB is always up to date for that mm.  leave_mm() just switches us
      to init_mm.  There are no longer any special cases for mm_cpumask,
      and switch_mm() switches mms without worrying about laziness.
      
      After this patch, cpu_tlbstate.state serves only to tell the TLB
      flush code whether it may switch to init_mm instead of doing a
      normal flush.
      
      This makes fairly extensive changes to xen_exit_mmap(), which used
      to look a bit like black magic.
      
      Perf is unchanged.  With or without this change, perf may behave a bit
      erratically if it tries to read user memory in kernel thread context.
      We should build on this patch to teach perf to never look at user
      memory when cpu_tlbstate.loaded_mm != current->mm.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3d28ebce
    • A
      x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code · ce4a4e56
      Andy Lutomirski 提交于
      The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
      Aside from that, it's fallen quite a bit behind the SMP code:
      
       - flush_tlb_mm_range() didn't flush individual pages if the range
         was small.
      
       - The lazy TLB code was much weaker.  This usually wouldn't matter,
         but, if a kernel thread flushed its lazy "active_mm" more than
         once (due to reclaim or similar), it wouldn't be unlazied and
         would instead pointlessly flush repeatedly.
      
       - Tracepoints were missing.
      
      Aside from that, simply having the UP code around was a maintanence
      burden, since it means that any change to the TLB flush code had to
      make sure not to break it.
      
      Simplify everything by deleting the UP code.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ce4a4e56
    • A
      x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases · 454bbad9
      Andy Lutomirski 提交于
      The local flush path is very similar to the remote flush path.
      Merge them.
      
      This is intended to make no difference to behavior whatsoever.  It
      removes some code and will make future changes to the flushing
      mechanics simpler.
      
      This patch does remove one small optimization: flush_tlb_mm_range()
      now has an unconditional smp_mb() instead of using MOV to CR3 or
      INVLPG as a full barrier when applicable.  I think this is okay for
      a few reasons.  First, smp_mb() is quite cheap compared to the cost
      of a TLB flush.  Second, this rearrangement makes a bigger
      optimization available: with some work on the SMP function call
      code, we could do the local and remote flushes in parallel.  Third,
      I'm planning a rework of the TLB flush algorithm that will require
      an atomic operation at the beginning of each flush, and that
      operation will replace the smp_mb().
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      454bbad9
    • A
      x86/mm: Pass flush_tlb_info to flush_tlb_others() etc · a2055abe
      Andy Lutomirski 提交于
      Rather than passing all the contents of flush_tlb_info to
      flush_tlb_others(), pass a pointer to the structure directly. For
      consistency, this also removes the unnecessary cpu parameter from
      uv_flush_tlb_others() to make its signature match the other
      *flush_tlb_others() functions.
      
      This serves two purposes:
      
       - It will dramatically simplify future patches that change struct
         flush_tlb_info, which I'm planning to do.
      
       - struct flush_tlb_info is an adequate description of what to do
         for a local flush, too, so by reusing it we can remove duplicated
         code between local and remove flushes in a future patch.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      [ Fix build warning. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a2055abe
  4. 24 5月, 2017 2 次提交
    • A
      mm, x86/mm: Make the batched unmap TLB flush API more generic · e73ad5ff
      Andy Lutomirski 提交于
      try_to_unmap_flush() used to open-code a rather x86-centric flush
      sequence: local_flush_tlb() + flush_tlb_others().  Rearrange the
      code so that the arch (only x86 for now) provides
      arch_tlbbatch_add_mm() and arch_tlbbatch_flush() and the core code
      calls those functions instead.
      
      I'll want this for x86 because, to enable address space ids, I can't
      support the flush_tlb_others() mode used by exising
      try_to_unmap_flush() implementation with good performance.  I can
      support the new API fairly easily, though.
      
      I imagine that other architectures may be in a similar position.
      Architectures with strong remote flush primitives (arm64?) may have
      even worse performance problems with flush_tlb_others() the way that
      try_to_unmap_flush() uses it.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/19f25a8581f9fb77876b7ff3b001f89835e34ea3.1495492063.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e73ad5ff
    • A
      x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range() · ca6c99c0
      Andy Lutomirski 提交于
      flush_tlb_page() was very similar to flush_tlb_mm_range() except that
      it had a couple of issues:
      
       - It was missing an smp_mb() in the case where
         current->active_mm != mm.  (This is a longstanding bug reported by Nadav Amit)
      
       - It was missing tracepoints and vm counter updates.
      
      The only reason that I can see for keeping it at as a separate
      function is that it could avoid a few branches that
      flush_tlb_mm_range() needs to decide to flush just one page.  This
      hardly seems worthwhile.  If we decide we want to get rid of those
      branches again, a better way would be to introduce an
      __flush_tlb_mm_range() helper and make both flush_tlb_page() and
      flush_tlb_mm_range() use it.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.1495492063.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ca6c99c0
  5. 22 5月, 2017 4 次提交
    • L
      x86: fix 32-bit case of __get_user_asm_u64() · 33c9e972
      Linus Torvalds 提交于
      The code to fetch a 64-bit value from user space was entirely buggered,
      and has been since the code was merged in early 2016 in commit
      b2f68038 ("x86/mm/32: Add support for 64-bit __get_user() on 32-bit
      kernels").
      
      Happily the buggered routine is almost certainly entirely unused, since
      the normal way to access user space memory is just with the non-inlined
      "get_user()", and the inlined version didn't even historically exist.
      
      The normal "get_user()" case is handled by external hand-written asm in
      arch/x86/lib/getuser.S that doesn't have either of these issues.
      
      There were two independent bugs in __get_user_asm_u64():
      
       - it still did the STAC/CLAC user space access marking, even though
         that is now done by the wrapper macros, see commit 11f1a4b9
         ("x86: reorganize SMAP handling in user space accesses").
      
         This didn't result in a semantic error, it just means that the
         inlined optimized version was hugely less efficient than the
         allegedly slower standard version, since the CLAC/STAC overhead is
         quite high on modern Intel CPU's.
      
       - the double register %eax/%edx was marked as an output, but the %eax
         part of it was touched early in the asm, and could thus clobber other
         inputs to the asm that gcc didn't expect it to touch.
      
         In particular, that meant that the generated code could look like
         this:
      
              mov    (%eax),%eax
              mov    0x4(%eax),%edx
      
         where the load of %edx obviously was _supposed_ to be from the 32-bit
         word that followed the source of %eax, but because %eax was
         overwritten by the first instruction, the source of %edx was
         basically random garbage.
      
      The fixes are trivial: remove the extraneous STAC/CLAC entries, and mark
      the 64-bit output as early-clobber to let gcc know that no inputs should
      alias with the output register.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable@kernel.org   # v4.8+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c9e972
    • L
      Clean up x86 unsafe_get/put_user() type handling · 334a023e
      Linus Torvalds 提交于
      Al noticed that unsafe_put_user() had type problems, and fixed them in
      commit a7cc722f ("fix unsafe_put_user()"), which made me look more
      at those functions.
      
      It turns out that unsafe_get_user() had a type issue too: it limited the
      largest size of the type it could handle to "unsigned long".  Which is
      fine with the current users, but doesn't match our existing normal
      get_user() semantics, which can also handle "u64" even when that does
      not fit in a long.
      
      While at it, also clean up the type cast in unsafe_put_user().  We
      actually want to just make it an assignment to the expected type of the
      pointer, because we actually do want warnings from types that don't
      convert silently.  And it makes the code more readable by not having
      that one very long and complex line.
      
      [ This patch might become stable material if we ever end up back-porting
        any new users of the unsafe uaccess code, but as things stand now this
        doesn't matter for any current existing uses. ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      334a023e
    • B
      x86/MCE: Export memory_error() · 2d1f4061
      Borislav Petkov 提交于
      Export the function which checks whether an MCE is a memory error to
      other users so that we can reuse the logic. Drop the boot_cpu_data use,
      while at it, as mce.cpuvendor already has the CPU vendor in there.
      
      Integrate a piece from a patch from Vishal Verma
      <vishal.l.verma@intel.com> to export it for modules (nfit).
      
      The main reason we're exporting it is that the nfit handler
      nfit_handle_mce() needs to detect a memory error properly before doing
      its recovery actions.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170519093915.15413-2-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2d1f4061
    • A
      fix unsafe_put_user() · a7cc722f
      Al Viro 提交于
      __put_user_size() relies upon its first argument having the same type as what
      the second one points to; the only other user makes sure of that and
      unsafe_put_user() should do the same.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a7cc722f
  6. 17 5月, 2017 1 次提交
    • P
      KVM: x86: lower default for halt_poll_ns · b401ee0b
      Paolo Bonzini 提交于
      In some fio benchmarks, halt_poll_ns=400000 caused CPU utilization to
      increase heavily even in cases where the performance improvement was
      small.  In particular, bandwidth divided by CPU usage was as much as
      60% lower.
      
      To some extent this is the expected effect of the patch, and the
      additional CPU utilization is only visible when running the
      benchmarks.  However, halving the threshold also halves the extra
      CPU utilization (from +30-130% to +20-70%) and has no negative
      effect on performance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b401ee0b
  7. 10 5月, 2017 1 次提交
  8. 09 5月, 2017 3 次提交
  9. 08 5月, 2017 1 次提交
    • X
      x86/mm: Add support for gbpages to kernel_ident_mapping_init() · 66aad4fd
      Xunlei Pang 提交于
      Kernel identity mappings on x86-64 kernels are created in two
      ways: by the early x86 boot code, or by kernel_ident_mapping_init().
      
      Native kernels (which is the dominant usecase) use the former,
      but the kexec and the hibernation code uses kernel_ident_mapping_init().
      
      There's a subtle difference between these two ways of how identity
      mappings are created, the current kernel_ident_mapping_init() code
      creates identity mappings always using 2MB page(PMD level) - while
      the native kernel boot path also utilizes gbpages where available.
      
      This difference is suboptimal both for performance and for memory
      usage: kernel_ident_mapping_init() needs to allocate pages for the
      page tables when creating the new identity mappings.
      
      This patch adds 1GB page(PUD level) support to kernel_ident_mapping_init()
      to address these concerns.
      
      The primary advantage would be better TLB coverage/performance,
      because we'd utilize 1GB TLBs instead of 2MB ones.
      
      It is also useful for machines with large number of memory to
      save paging structure allocations(around 4MB/TB using 2MB page)
      when setting identity mappings for all the memory, after using
      1GB page it will consume only 8KB/TB.
      
      ( Note that this change alone does not activate gbpages in kexec,
        we are doing that in a separate patch. )
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: akpm@linux-foundation.org
      Cc: kexec@lists.infradead.org
      Link: http://lkml.kernel.org/r/1493862171-8799-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      66aad4fd
  10. 05 5月, 2017 1 次提交
    • M
      x86/mm/kaslr: Use the _ASM_MUL macro for multiplication to work around Clang incompatibility · 121843eb
      Matthias Kaehlcke 提交于
      The constraint "rm" allows the compiler to put mix_const into memory.
      When the input operand is a memory location then MUL needs an operand
      size suffix, since Clang can't infer the multiplication width from the
      operand.
      
      Add and use the _ASM_MUL macro which determines the operand size and
      resolves to the NUL instruction with the corresponding suffix.
      
      This fixes the following error when building with clang:
      
        CC      arch/x86/lib/kaslr.o
        /tmp/kaslr-dfe1ad.s: Assembler messages:
        /tmp/kaslr-dfe1ad.s:182: Error: no instruction mnemonic suffix given and no register operands; can't size instruction
      Signed-off-by: NMatthias Kaehlcke <mka@chromium.org>
      Cc: Grant Grundler <grundler@chromium.org>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Davidson <md@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170501224741.133938-1-mka@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      121843eb
  11. 02 5月, 2017 5 次提交
  12. 29 4月, 2017 2 次提交
  13. 27 4月, 2017 3 次提交
    • L
      KVM: x86: fix emulation of RSM and IRET instructions · 6ed071f0
      Ladi Prosek 提交于
      On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
      on hflags is reverted later on in x86_emulate_instruction where hflags are
      overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
      as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
      
      Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
      an instruction is emulated, this commit deletes emul_flags altogether and
      makes the emulator access vcpu->arch.hflags using two new accessors. This
      way all changes, on the emulator side as well as in functions called from
      the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
      
      More details on the bug and its manifestation with Windows and OVMF:
      
        It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
        I believe that the SMM part explains why we started seeing this only with
        OVMF.
      
        KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
        the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
        later on in x86_emulate_instruction we overwrite arch.hflags with
        ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
        The AMD-specific hflag of interest here is HF_NMI_MASK.
      
        When rebooting the system, Windows sends an NMI IPI to all but the current
        cpu to shut them down. Only after all of them are parked in HLT will the
        initiating cpu finish the restart. If NMI is masked, other cpus never get
        the memo and the initiating cpu spins forever, waiting for
        hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
      
      Fixes: a584539b ("KVM: x86: pass the whole hflags field to emulator and back")
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ed071f0
    • P
      KVM: mark requests that need synchronization · 7a97cec2
      Paolo Bonzini 提交于
      kvm_make_all_requests() provides a synchronization that waits until all
      kicked VCPUs have acknowledged the kick.  This is important for
      KVM_REQ_MMU_RELOAD as it prevents freeing while lockless paging is
      underway.
      
      This patch adds the synchronization property into all requests that are
      currently being used with kvm_make_all_requests() in order to preserve
      the current behavior and only introduce a new framework.  Removing it
      from requests where it is not necessary is left for future patches.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a97cec2
    • R
      KVM: mark requests that do not need a wakeup · 930f7fd6
      Radim Krčmář 提交于
      Some operations must ensure that the guest is not running with stale
      data, but if the guest is halted, then the update can wait until another
      event happens.  kvm_make_all_requests() currently doesn't wake up, so we
      can mark all requests used with it.
      
      First 8 bits were arbitrarily reserved for request numbers.
      
      Most uses of requests have the request type as a constant, so a compiler
      will optimize the '&'.
      
      An alternative would be to have an inline function that would return
      whether the request needs a wake-up or not, but I like this one better
      even though it might produce worse assembly.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      930f7fd6
  14. 26 4月, 2017 2 次提交
    • A
      x86/mm: Remove flush_tlb() and flush_tlb_current_task() · 29961b59
      Andy Lutomirski 提交于
      I was trying to figure out what how flush_tlb_current_task() would
      possibly work correctly if current->mm != current->active_mm, but I
      realized I could spare myself the effort: it has no callers except
      the unused flush_tlb() macro.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      29961b59
    • D
      x86, dax, pmem: remove indirection around memcpy_from_pmem() · 6abccd1b
      Dan Williams 提交于
      memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
      serves no real benefit aside from affording a more generic function name
      than the x86-specific 'mcsafe'. However this would not be the first time
      that x86 terminology leaked into the global namespace. For lack of
      better name, just use memcpy_mcsafe() directly.
      
      This conversion also catches a place where we should have been using
      plain memcpy, acpi_nfit_blk_single_io().
      
      Cc: <x86@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6abccd1b
  15. 23 4月, 2017 1 次提交
    • I
      Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" · 6dd29b3d
      Ingo Molnar 提交于
      This reverts commit 2947ba05.
      
      Dan Williams reported dax-pmem kernel warnings with the following signature:
      
         WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
         percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
      
      ... and bisected it to this commit, which suggests possible memory corruption
      caused by the x86 fast-GUP conversion.
      
      He also pointed out:
      
       "
        This is similar to the backtrace when we were not properly handling
        pud faults and was fixed with this commit: 220ced16 "mm: fix
        get_user_pages() vs device-dax pud mappings"
      
        I've found some missing _devmap checks in the generic
        get_user_pages_fast() path, but this does not fix the regression
        [...]
       "
      
      So given that there are known bugs, and a pretty robust looking bisection
      points to this commit suggesting that are unknown bugs in the conversion
      as well, revert it for the time being - we'll re-try in v4.13.
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dann.frazier@canonical.com
      Cc: dave.hansen@intel.com
      Cc: steve.capper@linaro.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6dd29b3d
  16. 21 4月, 2017 1 次提交
  17. 20 4月, 2017 1 次提交
  18. 19 4月, 2017 2 次提交
  19. 18 4月, 2017 1 次提交
    • A
      Remove compat_sys_getdents64() · 2611dc19
      Al Viro 提交于
      Unlike normal compat syscall variants, it is needed only for
      biarch architectures that have different alignement requirements for
      u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
      a store of 64bit value at 32bit-aligned address.  We used to have one
      such (ia64), but its biarch support has been gone since 2010 (after
      being broken in 2008, which went unnoticed since nobody had been using
      it).
      
      It had escaped removal at the same time only because back in 2004
      a patch that switched several syscalls on amd64 from private wrappers to
      generic compat ones had switched to use of compat_sys_getdents64(), which
      hadn't needed (or used) a compat wrapper on amd64.
      
      Let's bury it - it's at least 7 years overdue.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2611dc19
  20. 17 4月, 2017 1 次提交