1. 20 5月, 2016 1 次提交
    • D
      x86/mm/mpx: Work around MPX erratum SKD046 · 0f6ff2bc
      Dave Hansen 提交于
      This erratum essentially causes the CPU to forget which privilege
      level it is operating on (kernel vs. user) for the purposes of MPX.
      
      This erratum can only be triggered when a system is not using
      Supervisor Mode Execution Prevention (SMEP).  Our workaround for
      the erratum is to ensure that MPX can only be used in cases where
      SMEP is present in the processor and is enabled.
      
      This erratum only affects Core processors.  Atom is unaffected.
      But, there is no architectural way to determine Atom vs. Core.
      So, we just apply this workaround to all processors.  It's
      possible that it will mistakenly disable MPX on some Atom
      processsors or future unaffected Core processors.  There are
      currently no processors that have MPX and not SMEP.  It would
      take something akin to a hypervisor masking SMEP out on an Atom
      processor for this to present itself on current hardware.
      
      More details can be found at:
      
        http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf
      
      "
        SKD046 Branch Instructions May Initialize MPX Bound Registers Incorrectly
      
        Problem:
      
        Depending on the current Intel MPX (Memory Protection
        Extensions) configuration, execution of certain branch
        instructions (near CALL, near RET, near JMP, and Jcc
        instructions) without a BND prefix (F2H) initialize the MPX bound
        registers. Due to this erratum, such a branch instruction that is
        executed both with CPL = 3 and with CPL < 3 may not use the
        correct MPX configuration register (BNDCFGU or BNDCFGS,
        respectively) for determining whether to initialize the bound
        registers; it may thus initialize the bound registers when it
        should not, or fail to initialize them when it should.
      
        Implication:
      
        A branch instruction that has executed both in user mode and in
        supervisor mode (from the same linear address) may cause a #BR
        (bound range fault) when it should not have or may not cause a
        #BR when it should have.  Workaround An operating system can
        avoid this erratum by setting CR4.SMEP[bit 20] to enable
        supervisor-mode execution prevention (SMEP). When SMEP is
        enabled, no code can be executed both with CPL = 3 and with CPL < 3.
      "
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160512220400.3B35F1BC@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f6ff2bc
  2. 19 5月, 2016 1 次提交
    • J
      x86/entry/64: Fix stack return address retrieval in thunk · d4bf7078
      Josh Poimboeuf 提交于
      With CONFIG_FRAME_POINTER enabled, a thunk can pass a bad return address
      value to the called function.  '9*8(%rsp)' actually gets the frame
      pointer, not the return address.
      
      The only users of the 'put_ret_addr_in_rdi' option are two functions
      which trace the enabling and disabling of interrupts, so this bug can
      result in bad debug or tracing information with CONFIG_IRQSOFF_TRACER or
      CONFIG_PROVE_LOCKING.
      
      Fix this by implementing the suggestion of Linus: explicitly push
      the frame pointer all the time and constify the stack offsets that
      way. This is both correct and easier to read.
      Reported-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      [ Extended the changelog a bit. ]
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 058fb732 ("x86/asm/entry: Create stack frames in thunk functions")
      Link: http://lkml.kernel.org/r/20160517180606.v5o7wcgdni7443ol@trebleSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4bf7078
  3. 17 5月, 2016 1 次提交
    • L
      x86/efi: Fix 7-parameter efi_call()s · 683ad809
      Linus Torvalds 提交于
      Alex Thorlton reported that the SGI/UV code crashes in the efi_call()
      code when invoked with 7 parameters, due to:
      
              mov (%rsp), %rax
              mov 8(%rax), %rax
              ...
              mov %rax, 40(%rsp)
      
      Offset 8 is only true if CONFIG_FRAME_POINTERS is disabled,
      with frame pointers enabled it should be 16.
      
      Furthermore, the SAVE_XMM code saves the old stack pointer, but
      that's just crazy. It saves the stack pointer *AFTER* we've done
      the:
      
              FRAME_BEGIN
      
      ... which will have *changed* the stack pointer, depending on whether
      stack frames are enabled or not.
      
      So when the code then does:
      
              mov (%rsp), %rax
      
      ... we now move that old stack pointer into %rax, but the offset off that
      stack pointer will depend on whether that FRAME_BEGIN saved off %rbp
      or not.
      
      So that whole 8-vs-16 offset confusion depends on the frame pointer!
      If frame pointers were enabled, it will be 16. If they weren't, it
      will be 8.
      
      The right fix is to just get rid of that silly conditional frame
      pointer thing, and always use frame pointers in this stub function.
      And then we don't need that (odd) load to get the old stack
      pointer into %rax - we can just use the frame pointer.
      Reported-by: NAlex Thorlton <athorlton@sgi.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/CA%2B55aFzBS2v%3DWnEH83cUDg7XkOremFqJ30BJwF40dCYjReBkUQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      683ad809
  4. 16 5月, 2016 1 次提交
    • D
      x86/cpufeature, x86/mm/pkeys: Fix broken compile-time disabling of pkeys · e8df1a95
      Dave Hansen 提交于
      When I added support for the Memory Protection Keys processor
      feature, I had to reindent the REQUIRED/DISABLED_MASK macros, and
      also consult the later cpufeature words.
      
      I'm not quite sure how I bungled it, but I consulted the wrong
      word at the end.  This only affected required or disabled cpu
      features in cpufeature words 14, 15 and 16.  So, only Protection
      Keys itself was screwed over here.
      
      The result was that if you disabled pkeys in your .config, you
      might still see some code show up that should have been compiled
      out.  There should be no functional problems, though.
      
      In verifying this patch I also realized that the DISABLE_PKU/OSPKE
      macros were defined backwards and that the cpu_has() check in
      setup_pku() was not doing the compile-time disabled checks.
      
      So also fix the macro for DISABLE_PKU/OSPKE and add a compile-time
      check for pkeys being enabled in setup_pku().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: dfb4a70f ("x86/cpufeature, x86/mm/pkeys: Add protection keys related CPUID definitions")
      Link: http://lkml.kernel.org/r/20160513221328.C200930B@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8df1a95
  5. 12 5月, 2016 3 次提交
  6. 11 5月, 2016 2 次提交
  7. 10 5月, 2016 1 次提交
  8. 07 5月, 2016 1 次提交
  9. 06 5月, 2016 2 次提交
    • C
      x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO · 886123fb
      Chen Yu 提交于
      Currently we read the tsc radio: ratio = (MSR_PLATFORM_INFO >> 8) & 0x1f;
      
      Thus we get bit 8-12 of MSR_PLATFORM_INFO, however according to the SDM
      (35.5), the ratio bits are bit 8-15.
      
      Ignoring the upper bits can result in an incorrect tsc ratio, which causes the
      TSC calibration and the Local APIC timer frequency to be incorrect.
      
      Fix this problem by masking 0xff instead.
      
      [ tglx: Massaged changelog ]
      
      Fixes: 7da7c156 "x86, tsc: Add static (MSR) TSC calibration on Intel Atom SoCs"
      Signed-off-by: NChen Yu <yu.c.chen@intel.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: stable@vger.kernel.org
      Cc: Bin Gao <bin.gao@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Link: http://lkml.kernel.org/r/1462505619-5516-1-git-send-email-yu.c.chen@intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      886123fb
    • A
      mm: thp: kvm: fix memory corruption in KVM with THP enabled · 127393fb
      Andrea Arcangeli 提交于
      After the THP refcounting change, obtaining a compound pages from
      get_user_pages() no longer allows us to assume the entire compound page
      is immediately mappable from a secondary MMU.
      
      A secondary MMU doesn't want to call get_user_pages() more than once for
      each compound page, in order to know if it can map the whole compound
      page.  So a secondary MMU needs to know from a single get_user_pages()
      invocation when it can map immediately the entire compound page to avoid
      a flood of unnecessary secondary MMU faults and spurious
      atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
      users).
      
      Ideally instead of the page->_mapcount < 1 check, get_user_pages()
      should return the granularity of the "page" mapping in the "mm" passed
      to get_user_pages().  However it's non trivial change to pass the "pmd"
      status belonging to the "mm" walked by get_user_pages up the stack (up
      to the caller of get_user_pages).  So the fix just checks if there is
      not a single pte mapping on the page returned by get_user_pages, and in
      turn if the caller can assume that the whole compound page is mapped in
      the current "mm" (in a pmd_trans_huge()).  In such case the entire
      compound page is safe to map into the secondary MMU without additional
      get_user_pages() calls on the surrounding tail/head pages.  In addition
      of being faster, not having to run other get_user_pages() calls also
      reduces the memory footprint of the secondary MMU fault in case the pmd
      split happened as result of memory pressure.
      
      Without this fix after a MADV_DONTNEED (like invoked by QEMU during
      postcopy live migration or balloning) or after generic swapping (with a
      failure in split_huge_page() that would only result in pmd splitting and
      not a physical page split), KVM would map the whole compound page into
      the shadow pagetables, despite regular faults or userfaults (like
      UFFDIO_COPY) may map regular pages into the primary MMU as result of the
      pte faults, leading to the guest mode and userland mode going out of
      sync and not working on the same memory at all times.
      
      Any other secondary MMU notifier manager (KVM is just one of the many
      MMU notifier users) will need the same information if it doesn't want to
      run a flood of get_user_pages_fast and it can support multiple
      granularity in the secondary MMU mappings, so I think it is justified to
      be exposed not just to KVM.
      
      The other option would be to move transparent_hugepage_adjust to
      mm/huge_memory.c but that currently has all kind of KVM data structures
      in it, so it's definitely not a cut-and-paste work, so I couldn't do a
      fix as cleaner as this one for 4.6.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Li, Liang Z" <liang.z.li@intel.com>
      Cc: Amit Shah <amit.shah@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      127393fb
  10. 05 5月, 2016 4 次提交
  11. 04 5月, 2016 1 次提交
    • J
      x86/efi-bgrt: Switch all pr_err() to pr_notice() for invalid BGRT · 7f9b474c
      Josh Boyer 提交于
      The promise of pretty boot splashes from firmware via BGRT was at
      best only that; a promise.  The kernel diligently checks to make
      sure the BGRT data firmware gives it is valid, and dutifully warns
      the user when it isn't.  However, it does so via the pr_err log
      level which seems unnecessary.  The user cannot do anything about
      this and there really isn't an error on the part of Linux to
      correct.
      
      This lowers the log level by using pr_notice instead.  Users will
      no longer have their boot process uglified by the kernel reminding
      us that firmware can and often is broken when the 'quiet' kernel
      parameter is specified.  Ironic, considering BGRT is supposed to
      make boot pretty to begin with.
      Signed-off-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Môshe van der Sterre <me@moshe.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/1462303781-8686-4-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7f9b474c
  12. 28 4月, 2016 4 次提交
    • K
      perf/x86/intel: Fix incorrect lbr_sel_mask value · cf3beb7c
      Kan Liang 提交于
      This patch fixes a bug which was introduced by:
      
       b16a5b52 ("perf/x86: Add option to disable reading branch flags/cycles")
      
      In this patch, lbr_sel_mask is used to mask the lbr_select. But LBR_SEL_MASK
      doesn't include the bit for LBR_CALL_STACK. So LBR call stack will never be
      set in lbr_select.
      
      This patch corrects the LBR_SEL_MASK by including all valid bits in
      LBR_SELECT. Also, the LBR_CALL_STACK bit is different as other bit in
      LBR_SELECT. It does not operate in suppress mode, so it needs to be
      specially handled in intel_pmu_setup_hw_lbr_filter.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1461231010-4399-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cf3beb7c
    • A
      perf/x86/intel/pt: Don't die on VMXON · 1c5ac21a
      Alexander Shishkin 提交于
      Some versions of Intel PT do not support tracing across VMXON, more
      specifically, VMXON will clear TraceEn control bit and any attempt to
      set it before VMXOFF will throw a #GP, which in the current state of
      things will crash the kernel. Namely:
      
        $ perf record -e intel_pt// kvm -nographic
      
      on such a machine will kill it.
      
      To avoid this, notify the intel_pt driver before VMXON and after
      VMXOFF so that it knows when not to enable itself.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/87oa9dwrfk.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1c5ac21a
    • A
      perf/x86/amd: Set the size of event map array to PERF_COUNT_HW_MAX · 0a25556f
      Adam Borowski 提交于
      The entry for PERF_COUNT_HW_REF_CPU_CYCLES is not used on AMD, but is
      referenced by filter_events() which expects undefined events to have a
      value of 0.
      
      Found via KASAN:
      
        UBSAN: Undefined behaviour in arch/x86/events/amd/core.c:132:30
        index 9 is out of range for type 'u64 [9]'
        UBSAN: Undefined behaviour in arch/x86/events/amd/core.c:132:9
        load of address ffffffff81c021c8 with insufficient space for an object of type 'const u64'
      Signed-off-by: NAdam Borowski <kilobyte@angband.pl>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1461749731-30979-1-git-send-email-kilobyte@angband.plSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0a25556f
    • K
      x86/apic: Handle zero vector gracefully in clear_vector_irq() · 1bdb8970
      Keith Busch 提交于
      If x86_vector_alloc_irq() fails x86_vector_free_irqs() is invoked to cleanup
      the already allocated vectors. This subsequently calls clear_vector_irq().
      
      The failed irq has no vector assigned, which triggers the BUG_ON(!vector) in
      clear_vector_irq().
      
      We cannot suppress the call to x86_vector_free_irqs() for the failed
      interrupt, because the other data related to this irq must be cleaned up as
      well. So calling clear_vector_irq() with vector == 0 is legitimate.
      
      Remove the BUG_ON and return if vector is zero,
      
      [ tglx: Massaged changelog ]
      
      Fixes: b5dc8e6c "x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors"
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1bdb8970
  13. 27 4月, 2016 1 次提交
  14. 23 4月, 2016 3 次提交
    • S
      perf/x86/intel/rapl: Add missing Haswell model · e1089602
      Srinivas Pandruvada 提交于
      Added one missing Haswell model.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1460907809-11897-1-git-send-email-srinivas.pandruvada@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e1089602
    • A
      perf/x86/intel: Add model number for Skylake Server to perf · b89c1737
      Andi Kleen 提交于
      Everything the same as base Skylake, just a new model number.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1460751933-2264-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b89c1737
    • R
      xen/qspinlock: Don't kick CPU if IRQ is not initialized · 707e59ba
      Ross Lagerwall 提交于
      The following commit:
      
        1fb3a8b2 ("xen/spinlock: Fix locking path engaging too soon under PVHVM.")
      
      ... moved the initalization of the kicker interrupt until after
      native_cpu_up() is called.
      
      However, when using qspinlocks, a CPU may try to kick another CPU that is
      spinning (because it has not yet initialized its kicker interrupt), resulting
      in the following crash during boot:
      
        kernel BUG at /build/linux-Ay7j_C/linux-4.4.0/drivers/xen/events/events_base.c:1210!
        invalid opcode: 0000 [#1] SMP
        ...
        RIP: 0010:[<ffffffff814c97c9>]  [<ffffffff814c97c9>] xen_send_IPI_one+0x59/0x60
        ...
        Call Trace:
         [<ffffffff8102be9e>] xen_qlock_kick+0xe/0x10
         [<ffffffff810cabc2>] __pv_queued_spin_unlock+0xb2/0xf0
         [<ffffffff810ca6d1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
         [<ffffffff81052936>] ? check_tsc_warp+0x76/0x150
         [<ffffffff81052aa6>] check_tsc_sync_source+0x96/0x160
         [<ffffffff81051e28>] native_cpu_up+0x3d8/0x9f0
         [<ffffffff8102b315>] xen_hvm_cpu_up+0x35/0x80
         [<ffffffff8108198c>] _cpu_up+0x13c/0x180
         [<ffffffff81081a4a>] cpu_up+0x7a/0xa0
         [<ffffffff81f80dfc>] smp_init+0x7f/0x81
         [<ffffffff81f5a121>] kernel_init_freeable+0xef/0x212
         [<ffffffff81817f30>] ? rest_init+0x80/0x80
         [<ffffffff81817f3e>] kernel_init+0xe/0xe0
         [<ffffffff8182488f>] ret_from_fork+0x3f/0x70
         [<ffffffff81817f30>] ? rest_init+0x80/0x80
      
      To fix this, only send the kick if the target CPU's interrupt has been
      initialized. This check isn't racy, because the target is waiting for
      the spinlock, so it won't have initialized the interrupt in the
      meantime.
      Signed-off-by: NRoss Lagerwall <ross.lagerwall@citrix.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: xen-devel@lists.xenproject.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      707e59ba
  15. 22 4月, 2016 1 次提交
    • J
      x86/mm/xen: Suppress hugetlbfs in PV guests · 103f6112
      Jan Beulich 提交于
      Huge pages are not normally available to PV guests. Not suppressing
      hugetlbfs use results in an endless loop of page faults when user mode
      code tries to access a hugetlbfs mapped area (since the hypervisor
      denies such PTEs to be created, but error indications can't be
      propagated out of xen_set_pte_at(), just like for various of its
      siblings), and - once killed in an oops like this:
      
        kernel BUG at .../fs/hugetlbfs/inode.c:428!
        invalid opcode: 0000 [#1] SMP
        ...
        RIP: e030:[<ffffffff811c333b>]  [<ffffffff811c333b>] remove_inode_hugepages+0x25b/0x320
        ...
        Call Trace:
         [<ffffffff811c3415>] hugetlbfs_evict_inode+0x15/0x40
         [<ffffffff81167b3d>] evict+0xbd/0x1b0
         [<ffffffff8116514a>] __dentry_kill+0x19a/0x1f0
         [<ffffffff81165b0e>] dput+0x1fe/0x220
         [<ffffffff81150535>] __fput+0x155/0x200
         [<ffffffff81079fc0>] task_work_run+0x60/0xa0
         [<ffffffff81063510>] do_exit+0x160/0x400
         [<ffffffff810637eb>] do_group_exit+0x3b/0xa0
         [<ffffffff8106e8bd>] get_signal+0x1ed/0x470
         [<ffffffff8100f854>] do_signal+0x14/0x110
         [<ffffffff810030e9>] prepare_exit_to_usermode+0xe9/0xf0
         [<ffffffff814178a5>] retint_user+0x8/0x13
      
      This is CVE-2016-3961 / XSA-174.
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juergen Gross <JGross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: stable@vger.kernel.org
      Cc: xen-devel <xen-devel@lists.xenproject.org>
      Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      103f6112
  16. 16 4月, 2016 1 次提交
    • V
      x86/hyperv: Avoid reporting bogus NMI status for Gen2 instances · 1e2ae9ec
      Vitaly Kuznetsov 提交于
      Generation2 instances don't support reporting the NMI status on port 0x61,
      read from there returns 'ff' and we end up reporting nonsensical PCI
      error (as there is no PCI bus in these instances) on all NMIs:
      
          NMI: PCI system error (SERR) for reason ff on CPU 0.
          Dazed and confused, but trying to continue
      
      Fix the issue by overriding x86_platform.get_nmi_reason. Use 'booted on
      EFI' flag to detect Gen2 instances.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Cathy Avery <cavery@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: devel@linuxdriverproject.org
      Link: http://lkml.kernel.org/r/1460728232-31433-1-git-send-email-vkuznets@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e2ae9ec
  17. 15 4月, 2016 2 次提交
  18. 13 4月, 2016 1 次提交
  19. 11 4月, 2016 3 次提交
    • P
      KVM: x86: mask CPUID(0xD,0x1).EAX against host value · 316314ca
      Paolo Bonzini 提交于
      This ensures that the guest doesn't see XSAVE extensions
      (e.g. xgetbv1 or xsavec) that the host lacks.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      316314ca
    • D
      kvm: x86: do not leak guest xcr0 into host interrupt handlers · fc5b7f3b
      David Matlack 提交于
      An interrupt handler that uses the fpu can kill a KVM VM, if it runs
      under the following conditions:
       - the guest's xcr0 register is loaded on the cpu
       - the guest's fpu context is not loaded
       - the host is using eagerfpu
      
      Note that the guest's xcr0 register and fpu context are not loaded as
      part of the atomic world switch into "guest mode". They are loaded by
      KVM while the cpu is still in "host mode".
      
      Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
      interrupt handler will look something like this:
      
      if (irq_fpu_usable()) {
              kernel_fpu_begin();
      
              [... code that uses the fpu ...]
      
              kernel_fpu_end();
      }
      
      As long as the guest's fpu is not loaded and the host is using eager
      fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
      returns true). The interrupt handler proceeds to use the fpu with
      the guest's xcr0 live.
      
      kernel_fpu_begin() saves the current fpu context. If this uses
      XSAVE[OPT], it may leave the xsave area in an undesirable state.
      According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
      if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
      xcr0[i] == 0 following an XSAVE.
      
      kernel_fpu_end() restores the fpu context. Now if any bit i in
      XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
      fault is trapped and SIGSEGV is delivered to the current process.
      
      Only pre-4.2 kernels appear to be vulnerable to this sequence of
      events. Commit 653f52c3 ("kvm,x86: load guest FPU context more eagerly")
      from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.
      
      This patch fixes the bug by keeping the host's xcr0 loaded outside
      of the interrupts-disabled region where KVM switches into guest mode.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      [Move load after goto cancel_injection. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc5b7f3b
    • X
      KVM: MMU: fix permission_fault() · 7a98205d
      Xiao Guangrong 提交于
      kvm-unit-tests complained about the PFEC is not set properly, e.g,:
      test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15
      expected 5
      Dump mapping: address: 0x123400000000
      ------L4: 3e95007
      ------L3: 3e96007
      ------L2: 2000083
      
      It's caused by the reason that PFEC returned to guest is copied from the
      PFEC triggered by shadow page table
      
      This patch fixes it and makes the logic of updating errcode more clean
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      [Do not assume pfec.p=1. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a98205d
  20. 08 4月, 2016 1 次提交
  21. 07 4月, 2016 1 次提交
  22. 05 4月, 2016 1 次提交
    • L
      kvm: x86: make lapic hrtimer pinned · 61abdbe0
      Luiz Capitulino 提交于
      When a vCPU runs on a nohz_full core, the hrtimer used by
      the lapic emulation code can be migrated to another core.
      When this happens, it's possible to observe milisecond
      latency when delivering timer IRQs to KVM guests.
      
      The huge latency is mainly due to the fact that
      apic_timer_fn() expects to run during a kvm exit. It
      sets KVM_REQ_PENDING_TIMER and let it be handled on kvm
      entry. However, if the timer fires on a different core,
      we have to wait until the next kvm exit for the guest
      to see KVM_REQ_PENDING_TIMER set.
      
      This problem became visible after commit 9642d18e. This
      commit changed the timer migration code to always attempt
      to migrate timers away from nohz_full cores. While it's
      discussable if this is correct/desirable (I don't think
      it is), it's clear that the lapic emulation code has
      a requirement on firing the hrtimer in the same core
      where it was started. This is achieved by making the
      hrtimer pinned.
      
      Lastly, note that KVM has code to migrate timers when a
      vCPU is scheduled to run in different core. However, this
      forced migration may fail. When this happens, we can have
      the same problem. If we want 100% correctness, we'll have
      to modify apic_timer_fn() to cause a kvm exit when it runs
      on a different core than the vCPU. Not sure if this is
      possible.
      
      Here's a reproducer for the issue being fixed:
      
       1. Set all cores but core0 to be nohz_full cores
       2. Start a guest with a single vCPU
       3. Trace apic_timer_fn() and kvm_inject_apic_timer_irqs()
      
      You'll see that apic_timer_fn() will run in core0 while
      kvm_inject_apic_timer_irqs() runs in a different core. If
      you get both on core0, try running a program that takes 100%
      of the CPU and pin it to core0 to force the vCPU out.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61abdbe0
  23. 02 4月, 2016 2 次提交
  24. 01 4月, 2016 1 次提交