1. 05 6月, 2017 6 次提交
    • A
      x86/mm: Rework lazy TLB to track the actual loaded mm · 3d28ebce
      Andy Lutomirski 提交于
      Lazy TLB state is currently managed in a rather baroque manner.
      AFAICT, there are three possible states:
      
       - Non-lazy.  This means that we're running a user thread or a
         kernel thread that has called use_mm().  current->mm ==
         current->active_mm == cpu_tlbstate.active_mm and
         cpu_tlbstate.state == TLBSTATE_OK.
      
       - Lazy with user mm.  We're running a kernel thread without an mm
         and we're borrowing an mm_struct.  We have current->mm == NULL,
         current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
         != TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0).  The current cpu is set
         in mm_cpumask(current->active_mm).  CR3 points to
         current->active_mm->pgd.  The TLB is up to date.
      
       - Lazy with init_mm.  This happens when we call leave_mm().  We
         have current->mm == NULL, current->active_mm ==
         cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
         the scheduler is tracking it for refcounting.  cpu_tlbstate.state
         != TLBSTATE_OK.  The current cpu is clear in
         mm_cpumask(current->active_mm).  CR3 points to swapper_pg_dir,
         i.e. init_mm->pgd.
      
      This patch simplifies the situation.  Other than perf, x86 stops
      caring about current->active_mm at all.  We have
      cpu_tlbstate.loaded_mm pointing to the mm that CR3 references.  The
      TLB is always up to date for that mm.  leave_mm() just switches us
      to init_mm.  There are no longer any special cases for mm_cpumask,
      and switch_mm() switches mms without worrying about laziness.
      
      After this patch, cpu_tlbstate.state serves only to tell the TLB
      flush code whether it may switch to init_mm instead of doing a
      normal flush.
      
      This makes fairly extensive changes to xen_exit_mmap(), which used
      to look a bit like black magic.
      
      Perf is unchanged.  With or without this change, perf may behave a bit
      erratically if it tries to read user memory in kernel thread context.
      We should build on this patch to teach perf to never look at user
      memory when cpu_tlbstate.loaded_mm != current->mm.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3d28ebce
    • A
      x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code · ce4a4e56
      Andy Lutomirski 提交于
      The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
      Aside from that, it's fallen quite a bit behind the SMP code:
      
       - flush_tlb_mm_range() didn't flush individual pages if the range
         was small.
      
       - The lazy TLB code was much weaker.  This usually wouldn't matter,
         but, if a kernel thread flushed its lazy "active_mm" more than
         once (due to reclaim or similar), it wouldn't be unlazied and
         would instead pointlessly flush repeatedly.
      
       - Tracepoints were missing.
      
      Aside from that, simply having the UP code around was a maintanence
      burden, since it means that any change to the TLB flush code had to
      make sure not to break it.
      
      Simplify everything by deleting the UP code.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ce4a4e56
    • A
      x86/mm: Use new merged flush logic in arch_tlbbatch_flush() · 3f79e4c7
      Andy Lutomirski 提交于
      Now there's only one copy of the local tlb flush logic for
      non-kernel pages on SMP kernels.
      
      The only functional change is that arch_tlbbatch_flush() will now
      leave_mm() on the local CPU if that CPU is in the batch and is in
      TLBSTATE_LAZY mode.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3f79e4c7
    • A
      x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases · 454bbad9
      Andy Lutomirski 提交于
      The local flush path is very similar to the remote flush path.
      Merge them.
      
      This is intended to make no difference to behavior whatsoever.  It
      removes some code and will make future changes to the flushing
      mechanics simpler.
      
      This patch does remove one small optimization: flush_tlb_mm_range()
      now has an unconditional smp_mb() instead of using MOV to CR3 or
      INVLPG as a full barrier when applicable.  I think this is okay for
      a few reasons.  First, smp_mb() is quite cheap compared to the cost
      of a TLB flush.  Second, this rearrangement makes a bigger
      optimization available: with some work on the SMP function call
      code, we could do the local and remote flushes in parallel.  Third,
      I'm planning a rework of the TLB flush algorithm that will require
      an atomic operation at the beginning of each flush, and that
      operation will replace the smp_mb().
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      454bbad9
    • A
      x86/mm: Change the leave_mm() condition for local TLB flushes · 59f537c1
      Andy Lutomirski 提交于
      On a remote TLB flush, we leave_mm() if we're TLBSTATE_LAZY.  For a
      local flush_tlb_mm_range(), we leave_mm() if !current->mm.  These
      are approximately the same condition -- the scheduler sets lazy TLB
      mode when switching to a thread with no mm.
      
      I'm about to merge the local and remote flush code, but for ease of
      verifying and bisecting the patch, I want the local and remote flush
      behavior to match first.  This patch changes the local code to match
      the remote code.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      59f537c1
    • A
      x86/mm: Pass flush_tlb_info to flush_tlb_others() etc · a2055abe
      Andy Lutomirski 提交于
      Rather than passing all the contents of flush_tlb_info to
      flush_tlb_others(), pass a pointer to the structure directly. For
      consistency, this also removes the unnecessary cpu parameter from
      uv_flush_tlb_others() to make its signature match the other
      *flush_tlb_others() functions.
      
      This serves two purposes:
      
       - It will dramatically simplify future patches that change struct
         flush_tlb_info, which I'm planning to do.
      
       - struct flush_tlb_info is an adequate description of what to do
         for a local flush, too, so by reusing it we can remove duplicated
         code between local and remove flushes in a future patch.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      [ Fix build warning. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a2055abe
  2. 03 6月, 2017 1 次提交
  3. 02 6月, 2017 1 次提交
    • L
      ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation · cb7cf772
      Lorenzo Pieralisi 提交于
      The BAD_MADT_GICC_ENTRY() macro checks if a GICC MADT entry passes
      muster from an ACPI specification standpoint. Current macro detects the
      MADT GICC entry length through ACPI firmware version (it changed from 76
      to 80 bytes in the transition from ACPI 5.1 to ACPI 6.0 specification)
      but always uses (erroneously) the ACPICA (latest) struct (ie struct
      acpi_madt_generic_interrupt - that is 80-bytes long) length to check if
      the current GICC entry memory record exceeds the MADT table end in
      memory as defined by the MADT table header itself, which may result in
      false negatives depending on the ACPI firmware version and how the MADT
      entries are laid out in memory (ie on ACPI 5.1 firmware MADT GICC
      entries are 76 bytes long, so by adding 80 to a GICC entry start address
      in memory the resulting address may well be past the actual MADT end,
      triggering a false negative).
      
      Fix the BAD_MADT_GICC_ENTRY() macro by reshuffling the condition checks
      and update them to always use the firmware version specific MADT GICC
      entry length in order to carry out boundary checks.
      
      Fixes: b6cfb277 ("ACPI / ARM64: add BAD_MADT_GICC_ENTRY() macro")
      Reported-by: NJulien Grall <julien.grall@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Julien Grall <julien.grall@arm.com>
      Cc: Hanjun Guo <hanjun.guo@linaro.org>
      Cc: Al Stone <ahs3@redhat.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      cb7cf772
  4. 01 6月, 2017 3 次提交
    • I
      Revert "x86/PAT: Fix Xorg regression on CPUs that don't support PAT" · c08d5174
      Ingo Molnar 提交于
      This reverts commit cbed27cd.
      
      As Andy Lutomirski observed:
      
       "I think this patch is bogus. pat_enabled() sure looks like it's
        supposed to return true if PAT is *enabled*, and these days PAT is
        'enabled' even if there's no HW PAT support."
      Reported-by: NBernhard Held <berny156@gmx.de>
      Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: stable@vger.kernel.org # v4.2+
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c08d5174
    • Z
      KVM: x86: Fix nmi injection failure when vcpu got blocked · 47a66eed
      ZhuangYanying 提交于
      When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads,
      other than the lock-holding one, would enter into S state because of
      pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could
      not be injected into vm.
      
      The reason is:
      1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets
      cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile.
      2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because
      cpu->kvm_vcpu_dirty is true.
      
      It's not enough just to check nmi_queued to decide whether to stay in
      vcpu_block() or not. NMI should be injected immediately at any situation.
      Add checking nmi_pending, and testing KVM_REQ_NMI replaces nmi_queued
      in vm_vcpu_has_events().
      
      Do the same change for SMIs.
      Signed-off-by: NZhuang Yanying <ann.zhuangyanying@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47a66eed
    • R
      KVM: SVM: do not zero out segment attributes if segment is unusable or not present · d9c1b543
      Roman Pen 提交于
      This is a fix for the problem [1], where VMCB.CPL was set to 0 and interrupt
      was taken on userspace stack.  The root cause lies in the specific AMD CPU
      behaviour which manifests itself as unusable segment attributes on SYSRET.
      The corresponding work around for the kernel is the following:
      
      61f01dd9 ("x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue")
      
      In other turn virtualization side treated unusable segment incorrectly and
      restored CPL from SS attributes, which were zeroed out few lines above.
      
      In current patch it is assured only that P bit is cleared in VMCB.save state
      and segment attributes are not zeroed out if segment is not presented or is
      unusable, therefore CPL can be safely restored from DPL field.
      
      This is only one part of the fix, since QEMU side should be fixed accordingly
      not to zero out attributes on its side.  Corresponding patch will follow.
      
      [1] Message id: CAJrWOzD6Xq==b-zYCDdFLgSRMPM-NkNuTSDFEtX=7MreT45i7Q@mail.gmail.com
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: NMikhail Sennikovskii <mikhail.sennikovskii@profitbricks.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmář <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9c1b543
  5. 30 5月, 2017 3 次提交
    • G
      KVM: SVM: ignore type when setting segment registers · 8eae9570
      Gioh Kim 提交于
      Commit 19bca6ab ("KVM: SVM: Fix cross vendor migration issue with
      unusable bit") added checking type when setting unusable.
      So unusable can be set if present is 0 OR type is 0.
      According to the AMD processor manual, long mode ignores the type value
      in segment descriptor. And type can be 0 if it is read-only data segment.
      Therefore type value is not related to unusable flag.
      
      This patch is based on linux-next v4.12.0-rc3.
      Signed-off-by: NGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8eae9570
    • R
      KVM: nVMX: fix nested_vmx_check_vmptr failure paths under debugging · cbf71279
      Radim Krčmář 提交于
      kvm_skip_emulated_instruction() will return 0 if userspace is
      single-stepping the guest.
      
      kvm_skip_emulated_instruction() uses return status convention of exit
      handler: 0 means "exit to userspace" and 1 means "continue vm entries".
      The problem is that nested_vmx_check_vmptr() return status means
      something else: 0 is ok, 1 is error.
      
      This means we would continue executing after a failure.  Static checker
      noticed it because vmptr was not initialized.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 6affcbed ("KVM: x86: Add kvm_skip_emulated_instruction and use it.")
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cbf71279
    • V
      kthread: fix boot hang (regression) on MIPS/OpenRISC · b0f5a8f3
      Vegard Nossum 提交于
      This fixes a regression in commit 4d6501dc where I didn't notice
      that MIPS and OpenRISC were reinitialising p->{set,clear}_child_tid to
      NULL after our initialisation in copy_process().
      
      We can simply get rid of the arch-specific initialisation here since it
      is now always done in copy_process() before hitting copy_thread{,_tls}().
      
      Review notes:
      
       - As far as I can tell, copy_process() is the only user of
         copy_thread_tls(), which is the only caller of copy_thread() for
         architectures that don't implement copy_thread_tls().
      
       - After this patch, there is no arch-specific code touching
         p->set_child_tid or p->clear_child_tid whatsoever.
      
       - It may look like MIPS/OpenRISC wanted to always have these fields be
         NULL, but that's not true, as copy_process() would unconditionally
         set them again _after_ calling copy_thread_tls() before commit
         4d6501dc.
      
      Fixes: 4d6501dc ("kthread: Fix use-after-free if kthread fork fails")
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: Guenter Roeck <linux@roeck-us.net> # MIPS only
      Acked-by: NStafford Horne <shorne@gmail.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: openrisc@lists.librecores.org
      Cc: Jamie Iles <jamie.iles@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0f5a8f3
  6. 29 5月, 2017 2 次提交
  7. 28 5月, 2017 3 次提交
    • B
      x86/efi: Correct EFI identity mapping under 'efi=old_map' when KASLR is enabled · 94133e46
      Baoquan He 提交于
      For EFI with the 'efi=old_map' kernel option specified, the kernel will panic
      when KASLR is enabled:
      
        BUG: unable to handle kernel paging request at 000000007febd57e
        IP: 0x7febd57e
        PGD 1025a067
        PUD 0
      
        Oops: 0010 [#1] SMP
        Call Trace:
         efi_enter_virtual_mode()
         start_kernel()
         x86_64_start_reservations()
         x86_64_start_kernel()
         start_cpu()
      
      The root cause is that the identity mapping is not built correctly
      in the 'efi=old_map' case.
      
      On 'nokaslr' kernels, PAGE_OFFSET is 0xffff880000000000 which is PGDIR_SIZE
      aligned. We can borrow the PUD table from the direct mappings safely. Given a
      physical address X, we have pud_index(X) == pud_index(__va(X)).
      
      However, on KASLR kernels, PAGE_OFFSET is PUD_SIZE aligned. For a given physical
      address X, pud_index(X) != pud_index(__va(X)). We can't just copy the PGD entry
      from direct mapping to build identity mapping, instead we need to copy the
      PUD entries one by one from the direct mapping.
      
      Fix it.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bhupesh Sharma <bhsharma@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Frank Ramsay <frank.ramsay@hpe.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170526113652.21339-5-matt@codeblueprint.co.uk
      [ Fixed and reworded the changelog and code comments to be more readable. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      94133e46
    • S
      x86/efi: Disable runtime services on kexec kernel if booted with efi=old_map · 4e52797d
      Sai Praneeth 提交于
      Booting kexec kernel with "efi=old_map" in kernel command line hits
      kernel panic as shown below.
      
       BUG: unable to handle kernel paging request at ffff88007fe78070
       IP: virt_efi_set_variable.part.7+0x63/0x1b0
       PGD 7ea28067
       PUD 7ea2b067
       PMD 7ea2d067
       PTE 0
       [...]
       Call Trace:
        virt_efi_set_variable()
        efi_delete_dummy_variable()
        efi_enter_virtual_mode()
        start_kernel()
        x86_64_start_reservations()
        x86_64_start_kernel()
        start_cpu()
      
      [ efi=old_map was never intended to work with kexec. The problem with
        using efi=old_map is that the virtual addresses are assigned from the
        memory region used by other kernel mappings; vmalloc() space.
        Potentially there could be collisions when booting kexec if something
        else is mapped at the virtual address we allocated for runtime service
        regions in the initial boot - Matt Fleming ]
      
      Since kexec was never intended to work with efi=old_map, disable
      runtime services in kexec if booted with efi=old_map, so that we don't
      panic.
      Tested-by: NLee Chun-Yi <jlee@suse.com>
      Signed-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Acked-by: NDave Young <dyoung@redhat.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Ricardo Neri <ricardo.neri@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170526113652.21339-4-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4e52797d
    • J
      efi: Don't issue error message when booted under Xen · 1ea34adb
      Juergen Gross 提交于
      When booted as Xen dom0 there won't be an EFI memmap allocated. Avoid
      issuing an error message in this case:
      
        [    0.144079] efi: Failed to allocate new EFI memmap
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: <stable@vger.kernel.org> # v4.9+
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170526113652.21339-2-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1ea34adb
  8. 27 5月, 2017 4 次提交
    • T
      x86/ftrace: Make sure that ftrace trampolines are not RWX · 6ee98ffe
      Thomas Gleixner 提交于
      ftrace use module_alloc() to allocate trampoline pages. The mapping of
      module_alloc() is RWX, which makes sense as the memory is written to right
      after allocation. But nothing makes these pages RO after writing to them.
      
      Add proper set_memory_rw/ro() calls to protect the trampolines after
      modification.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705251056410.1862@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      6ee98ffe
    • S
      x86/mm/ftrace: Do not bug in early boot on irqs_disabled in cpu_flush_range() · a53276e2
      Steven Rostedt (VMware) 提交于
      With function tracing starting in early bootup and having its trampoline
      pages being read only, a bug triggered with the following:
      
      kernel BUG at arch/x86/mm/pageattr.c:189!
      invalid opcode: 0000 [#1] SMP
      Modules linked in:
      CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc2-test+ #3
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      task: ffffffffb4222500 task.stack: ffffffffb4200000
      RIP: 0010:change_page_attr_set_clr+0x269/0x302
      RSP: 0000:ffffffffb4203c88 EFLAGS: 00010046
      RAX: 0000000000000046 RBX: 0000000000000000 RCX: 00000001b6000000
      RDX: ffffffffb4203d40 RSI: 0000000000000000 RDI: ffffffffb4240d60
      RBP: ffffffffb4203d18 R08: 00000001b6000000 R09: 0000000000000001
      R10: ffffffffb4203aa8 R11: 0000000000000003 R12: ffffffffc029b000
      R13: ffffffffb4203d40 R14: 0000000000000001 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff9a639ea00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff9a636b384000 CR3: 00000001ea21d000 CR4: 00000000000406b0
      Call Trace:
       change_page_attr_clear+0x1f/0x21
       set_memory_ro+0x1e/0x20
       arch_ftrace_update_trampoline+0x207/0x21c
       ? ftrace_caller+0x64/0x64
       ? 0xffffffffc029b000
       ftrace_startup+0xf4/0x198
       register_ftrace_function+0x26/0x3c
       function_trace_init+0x5e/0x73
       tracer_init+0x1e/0x23
       tracing_set_tracer+0x127/0x15a
       register_tracer+0x19b/0x1bc
       init_function_trace+0x90/0x92
       early_trace_init+0x236/0x2b3
       start_kernel+0x200/0x3f5
       x86_64_start_reservations+0x29/0x2b
       x86_64_start_kernel+0x17c/0x18f
       secondary_startup_64+0x9f/0x9f
       ? secondary_startup_64+0x9f/0x9f
      
      Interrupts should not be enabled at this early in the boot process. It is
      also fine to leave interrupts enabled during this time as there's only one
      CPU running, and on_each_cpu() means to only run on the current CPU.
      
      If early_boot_irqs_disabled is set, it is safe to run cpu_flush_range() with
      interrupts disabled. Don't trigger a BUG_ON() in that case.
      
      Link: http://lkml.kernel.org/r/20170526093717.0be3b849@gandalf.local.homeSuggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a53276e2
    • M
      kprobes/x86: Fix to set RWX bits correctly before releasing trampoline · c93f5cf5
      Masami Hiramatsu 提交于
      Fix kprobes to set(recover) RWX bits correctly on trampoline
      buffer before releasing it. Releasing readonly page to
      module_memfree() crash the kernel.
      
      Without this fix, if kprobes user register a bunch of kprobes
      in function body (since kprobes on function entry usually
      use ftrace) and unregister it, kernel hits a BUG and crash.
      
      Link: http://lkml.kernel.org/r/149570868652.3518.14120169373590420503.stgit@devboxSigned-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Fixes: d0381c81 ("kprobes/x86: Set kprobes pages read-only")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c93f5cf5
    • J
      KVM: x86: Fix virtual wire mode · 52b54190
      Jan H. Schönherr 提交于
      Intel SDM says, that at most one LAPIC should be configured with ExtINT
      delivery. KVM configures all LAPICs this way. This causes pic_unlock()
      to kick the first available vCPU from the internal KVM data structures.
      If this vCPU is not the BSP, but some not-yet-booted AP, the BSP may
      never realize that there is an interrupt.
      
      Fix that by enabling ExtINT delivery only for the BSP.
      
      This allows booting a Linux guest without a TSC in the above situation.
      Otherwise the BSP gets stuck in calibrate_delay_converge().
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      52b54190
  9. 26 5月, 2017 3 次提交
    • J
      KVM: nVMX: Fix handling of lmsw instruction · e1d39b17
      Jan H. Schönherr 提交于
      The decision whether or not to exit from L2 to L1 on an lmsw instruction is
      based on bogus values: instead of using the information encoded within the
      exit qualification, it uses the data also used for the mov-to-cr
      instruction, which boils down to using whatever is in %eax at that point.
      
      Use the correct values instead.
      
      Without this fix, an L1 may not get notified when a 32-bit Linux L2
      switches its secondary CPUs to protected mode; the L1 is only notified on
      the next modification of CR0. This short time window poses a problem, when
      there is some other reason to exit to L1 in between. Then, L2 will be
      resumed in real mode and chaos ensues.
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e1d39b17
    • W
      KVM: X86: Fix preempt the preemption timer cancel · 5acc1ca4
      Wanpeng Li 提交于
      Preemption can occur during cancel preemption timer, and there will be
      inconsistent status in lapic, vmx and vmcs field.
      
                CPU0                    CPU1
      
        preemption timer vmexit
        handle_preemption_timer(vCPU0)
          kvm_lapic_expired_hv_timer
            vmx_cancel_hv_timer
              vmx->hv_deadline_tsc = -1
              vmcs_clear_bits
              /* hv_timer_in_use still true */
        sched_out
                                 sched_in
                                 kvm_arch_vcpu_load
                                   vmx_set_hv_timer
                                     write vmx->hv_deadline_tsc
                                     vmcs_set_bits
                                 /* back in kvm_lapic_expired_hv_timer */
                                 hv_timer_in_use = false
                                 ...
                                 vmx_vcpu_run
                                   vmx_arm_hv_run
                                     write preemption timer deadline
                                   spurious preemption timer vmexit
                                     handle_preemption_timer(vCPU0)
                                       kvm_lapic_expired_hv_timer
                                         WARN_ON(!apic->lapic_timer.hv_timer_in_use);
      
      This can be reproduced sporadically during boot of L2 on a
      preemptible L1, causing a splat on L1.
      
       WARNING: CPU: 3 PID: 1952 at arch/x86/kvm/lapic.c:1529 kvm_lapic_expired_hv_timer+0xb5/0xd0 [kvm]
       CPU: 3 PID: 1952 Comm: qemu-system-x86 Not tainted 4.12.0-rc1+ #24 RIP: 0010:kvm_lapic_expired_hv_timer+0xb5/0xd0 [kvm]
        Call Trace:
        handle_preemption_timer+0xe/0x20 [kvm_intel]
        vmx_handle_exit+0xc9/0x15f0 [kvm_intel]
        ? lock_acquire+0xdb/0x250
        ? lock_acquire+0xdb/0x250
        ? kvm_arch_vcpu_ioctl_run+0xdf3/0x1ce0 [kvm]
        kvm_arch_vcpu_ioctl_run+0xe55/0x1ce0 [kvm]
        kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? __fget+0xf3/0x210
        do_vfs_ioctl+0xa4/0x700
        ? __fget+0x114/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x8f/0x750
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL64_slow_path+0x25/0x25
      
      This patch fixes it by disabling preemption while cancelling
      preemption timer.  This way cancel_hv_timer is atomic with
      respect to kvm_arch_vcpu_load.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5acc1ca4
    • J
      x86/timers: Move simple_udelay_calibration past init_hypervisor_platform · 702644ec
      Jan Kiszka 提交于
      This ensures that adjustments to x86_platform done by the hypervisor
      setup is already respected by this simple calibration.
      
      The current user of this, introduced by 1b5aeebf ("x86/earlyprintk:
      Add support for earlyprintk via USB3 debug port"), comes much later
      into play.
      
      Fixes: dd759d93 ("x86/timers: Add simple udelay calibration")
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NLu Baolu <baolu.lu@linux.intel.com>
      Link: http://lkml.kernel.org/r/5e89fe60-aab3-2c1c-aba8-32f8ad376189@siemens.com
      702644ec
  10. 25 5月, 2017 5 次提交
  11. 24 5月, 2017 9 次提交