1. 09 9月, 2016 5 次提交
    • S
      KVM: PPC: Book3S HV: Convert kvmppc_read_intr to a C function · 37f55d30
      Suresh Warrier 提交于
      Modify kvmppc_read_intr to make it a C function.  Because it is called
      from kvmppc_check_wake_reason, any of the assembler code that calls
      either kvmppc_read_intr or kvmppc_check_wake_reason now has to assume
      that the volatile registers might have been modified.
      
      This also adds in the optimization of clearing saved_xirr in the case
      where we completely handle and EOI an IPI.  Without this, the next
      device interrupt will require two trips through the host interrupt
      handling code.
      
      [paulus@ozlabs.org - made kvmppc_check_wake_reason create a stack frame
       when it is calling kvmppc_read_intr, which means we can set r12 to
       the trap number (0x500) after the call to kvmppc_read_intr, instead
       of using r31.  Also moved the deliver_guest_interrupt label so as to
       restore XER and CTR, plus other minor tweaks.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      37f55d30
    • P
      powerpc: move hmi.c to arch/powerpc/kvm/ · 3f257774
      Paolo Bonzini 提交于
      hmi.c functions are unused unless sibling_subcore_state is nonzero, and
      that in turn happens only if KVM is in use.  So move the code to
      arch/powerpc/kvm/, putting it under CONFIG_KVM_BOOK3S_HV_POSSIBLE
      rather than CONFIG_PPC_BOOK3S_64.  The sibling_subcore_state is also
      included in struct paca_struct only if KVM is supported by the kernel.
      
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: kvm-ppc@vger.kernel.org
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3f257774
    • S
      powerpc/powernv: Provide facilities for EOI, usable from real mode · 4ee11c1a
      Suresh Warrier 提交于
      This adds a new function pnv_opal_pci_msi_eoi() which does the part of
      end-of-interrupt (EOI) handling of an MSI which involves doing an
      OPAL call.  This function can be called in real mode.  This doesn't
      just export pnv_ioda2_msi_eoi() because that does a call to
      icp_native_eoi(), which does not work in real mode.
      
      This also adds a function, is_pnv_opal_msi(), which KVM can call to
      check whether an interrupt is one for which we should be calling
      pnv_opal_pci_msi_eoi() when we need to do an EOI.
      
      [paulus@ozlabs.org - split out the addition of pnv_opal_pci_msi_eoi()
       from Suresh's patch "KVM: PPC: Book3S HV: Handle passthrough
       interrupts in guest"; added is_pnv_opal_msi(); wrote description.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4ee11c1a
    • S
      powerpc: Add simple cache inhibited MMIO accessors · 07b1fdf5
      Suresh Warrier 提交于
      Add simple cache inhibited accessors for memory mapped I/O.
      Unlike the accessors built from the DEF_MMIO_* macros, these
      don't include any hardware memory barriers, callers need to
      manage memory barriers on their own. These can only be called
      in hypervisor real mode.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      [paulus@ozlabs.org - added line to comment]
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      07b1fdf5
    • P
      powerpc/mm: Speed up computation of base and actual page size for a HPTE · 0eeede0c
      Paul Mackerras 提交于
      This replaces a 2-D search through an array with a simple 8-bit table
      lookup for determining the actual and/or base page size for a HPT entry.
      
      The encoding in the second doubleword of the HPTE is designed to encode
      the actual and base page sizes without using any more bits than would be
      needed for a 4k page number, by using between 1 and 8 low-order bits of
      the RPN (real page number) field to encode the page sizes.  A single
      "large page" bit in the first doubleword indicates that these low-order
      bits are to be interpreted like this.
      
      We can determine the page sizes by using the low-order 8 bits of the RPN
      to look up a 256-entry table.  For actual page sizes less than 1MB, some
      of the upper bits of these 8 bits are going to be real address bits, but
      we can cope with that by replicating the entries for those smaller page
      sizes.
      
      While we're at it, let's move the hpte_page_size() and hpte_base_page_size()
      functions from a KVM-specific header to a header for 64-bit HPT systems,
      since this computation doesn't have anything specifically to do with KVM.
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0eeede0c
  2. 08 9月, 2016 5 次提交
    • S
      KVM: PPC: Implement existing and add new halt polling vcpu stats · 2a27f514
      Suraj Jitindar Singh 提交于
      vcpu stats are used to collect information about a vcpu which can be viewed
      in the debugfs. For example halt_attempted_poll and halt_successful_poll
      are used to keep track of the number of times the vcpu attempts to and
      successfully polls. These stats are currently not used on powerpc.
      
      Implement incrementation of the halt_attempted_poll and
      halt_successful_poll vcpu stats for powerpc. Since these stats are summed
      over all the vcpus for all running guests it doesn't matter which vcpu
      they are attributed to, thus we choose the current runner vcpu of the
      vcore.
      
      Also add new vcpu stats: halt_poll_success_ns, halt_poll_fail_ns and
      halt_wait_ns to be used to accumulate the total time spend polling
      successfully, polling unsuccessfully and waiting respectively, and
      halt_successful_wait to accumulate the number of times the vcpu waits.
      Given that halt_poll_success_ns, halt_poll_fail_ns and halt_wait_ns are
      expressed in nanoseconds it is necessary to represent these as 64-bit
      quantities, otherwise they would overflow after only about 4 seconds.
      
      Given that the total time spend either polling or waiting will be known and
      the number of times that each was done, it will be possible to determine
      the average poll and wait times. This will give the ability to tune the kvm
      module parameters based on the calculated average wait and poll times.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2a27f514
    • S
      KVM: Add provisioning for ulong vm stats and u64 vcpu stats · 8a7e75d4
      Suraj Jitindar Singh 提交于
      vms and vcpus have statistics associated with them which can be viewed
      within the debugfs. Currently it is assumed within the vcpu_stat_get() and
      vm_stat_get() functions that all of these statistics are represented as
      u32s, however the next patch adds some u64 vcpu statistics.
      
      Change all vcpu statistics to u64 and modify vcpu_stat_get() accordingly.
      Since vcpu statistics are per vcpu, they will only be updated by a single
      vcpu at a time so this shouldn't present a problem on 32-bit machines
      which can't atomically increment 64-bit numbers. However vm statistics
      could potentially be updated by multiple vcpus from that vm at a time.
      To avoid the overhead of atomics make all vm statistics ulong such that
      they are 64-bit on 64-bit systems where they can be atomically incremented
      and are 32-bit on 32-bit systems which may not be able to atomically
      increment 64-bit numbers. Modify vm_stat_get() to expect ulongs.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8a7e75d4
    • S
      KVM: PPC: Book3S HV: Implement halt polling · 0cda69dd
      Suraj Jitindar Singh 提交于
      This patch introduces new halt polling functionality into the kvm_hv kernel
      module. When a vcore is idle it will poll for some period of time before
      scheduling itself out.
      
      When all of the runnable vcpus on a vcore have ceded (and thus the vcore is
      idle) we schedule ourselves out to allow something else to run. In the
      event that we need to wake up very quickly (for example an interrupt
      arrives), we are required to wait until we get scheduled again.
      
      Implement halt polling so that when a vcore is idle, and before scheduling
      ourselves, we poll for vcpus in the runnable_threads list which have
      pending exceptions or which leave the ceded state. If we poll successfully
      then we can get back into the guest very quickly without ever scheduling
      ourselves, otherwise we schedule ourselves out as before.
      
      There exists generic halt_polling code in virt/kvm_main.c, however on
      powerpc the polling conditions are different to the generic case. It would
      be nice if we could just implement an arch specific kvm_check_block()
      function, but there is still other arch specific things which need to be
      done for kvm_hv (for example manipulating vcore states) which means that a
      separate implementation is the best option.
      
      Testing of this patch with a TCP round robin test between two guests with
      virtio network interfaces has found a decrease in round trip time of ~15us
      on average. A performance gain is only seen when going out of and
      back into the guest often and quickly, otherwise there is no net benefit
      from the polling. The polling interval is adjusted such that when we are
      often scheduled out for long periods of time it is reduced, and when we
      often poll successfully it is increased. The rate at which the polling
      interval increases or decreases, and the maximum polling interval, can
      be set through module parameters.
      
      Based on the implementation in the generic kvm module by Wanpeng Li and
      Paolo Bonzini, and on direction from Paul Mackerras.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0cda69dd
    • S
      KVM: PPC: Book3S HV: Change vcore element runnable_threads from linked-list to array · 7b5f8272
      Suraj Jitindar Singh 提交于
      The struct kvmppc_vcore is a structure used to store various information
      about a virtual core for a kvm guest. The runnable_threads element of the
      struct provides a list of all of the currently runnable vcpus on the core
      (those in the KVMPPC_VCPU_RUNNABLE state). The previous implementation of
      this list was a linked_list. The next patch requires that the list be able
      to be iterated over without holding the vcore lock.
      
      Reimplement the runnable_threads list in the kvmppc_vcore struct as an
      array. Implement function to iterate over valid entries in the array and
      update access sites accordingly.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7b5f8272
    • S
      KVM: PPC: Book3S HV: Move struct kvmppc_vcore from kvm_host.h to kvm_book3s.h · e64fb7e2
      Suraj Jitindar Singh 提交于
      The next commit will introduce a member to the kvmppc_vcore struct which
      references MAX_SMT_THREADS which is defined in kvm_book3s_asm.h, however
      this file isn't included in kvm_host.h directly. Thus compiling for
      certain platforms such as pmac32_defconfig and ppc64e_defconfig with KVM
      fails due to MAX_SMT_THREADS not being defined.
      
      Move the struct kvmppc_vcore definition to kvm_book3s.h which explicitly
      includes kvm_book3s_asm.h.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e64fb7e2
  3. 25 8月, 2016 1 次提交
    • P
      KVM: PPC: Always select KVM_VFIO, plus Makefile cleanup · 4b3d173d
      Paul Mackerras 提交于
      As discussed recently on the kvm mailing list, David Gibson's
      intention in commit 178a7875 ("vfio: Enable VFIO device for
      powerpc", 2016-02-01) was to have the KVM VFIO device built in
      on all powerpc platforms.  This patch adds the "select KVM_VFIO"
      statement that makes this happen.
      
      Currently, arch/powerpc/kvm/Makefile doesn't include vfio.o for
      the 64-bit kvm module, because the list of objects doesn't use
      the $(common-objs-y) list.  The reason it doesn't is because we
      don't necessarily want coalesced_mmio.o or emulate.o (for example
      if HV KVM is the only target), and common-objs-y includes both.
      
      Since this is confusing, this patch adjusts the definitions so that
      we now use $(common-objs-y) in the list for the 64-bit kvm.ko
      module, emulate.o is removed from common-objs-y and added in the
      places that need it, and the inclusion of coalesced_mmio.o now
      depends on CONFIG_KVM_MMIO.
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4b3d173d
  4. 19 8月, 2016 2 次提交
    • P
      KVM: PPC: Implement kvm_arch_intc_initialized() for PPC · 34a75b0f
      Paul Mackerras 提交于
      It doesn't make sense to create irqfds for a VM that doesn't have
      in-kernel interrupt controller emulation.  There is an existing
      interface for architecture code to tell the irqfd code whether or
      not any interrupt controller has been initialized, called
      kvm_arch_intc_initialized(), so let's implement that for powerpc.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      34a75b0f
    • P
      KVM: PPC: Book3S: Don't crash if irqfd used with no in-kernel XICS emulation · e48ba1cb
      Paul Mackerras 提交于
      It turns out that if userspace creates a pseries-type VM without
      in-kernel XICS (interrupt controller) emulation, and then connects
      an eventfd to the VM as an irqfd, and the eventfd gets signalled,
      that the code will try to deliver an interrupt via the non-existent
      XICS object and crash the host kernel with a NULL pointer dereference.
      
      To fix this, we check for the presence of the XICS object before
      trying to deliver the interrupt, and return with an error if not.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e48ba1cb
  5. 13 8月, 2016 7 次提交
    • G
      h8300: Add missing include file to asm/io.h · 2b05980d
      Guenter Roeck 提交于
      h8300 builds fail with
      
      arch/h8300/include/asm/io.h:9:15: error: unknown type name ‘u8’
      arch/h8300/include/asm/io.h:15:15: error: unknown type name ‘u16’
      arch/h8300/include/asm/io.h:21:15: error: unknown type name ‘u32’
      
      and many related errors.
      
      Fixes: 23c82d41bdf4 ("kexec-allow-architectures-to-override-boot-mapping-fix")
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      2b05980d
    • G
      unicore32: mm: Add missing parameter to arch_vma_access_permitted · 783011b1
      Guenter Roeck 提交于
      unicore32 fails to compile with the following errors.
      
      mm/memory.c: In function ‘__handle_mm_fault’:
      mm/memory.c:3381: error:
      	too many arguments to function ‘arch_vma_access_permitted’
      mm/gup.c: In function ‘check_vma_flags’:
      mm/gup.c:456: error:
      	too many arguments to function ‘arch_vma_access_permitted’
      mm/gup.c: In function ‘vma_permits_fault’:
      mm/gup.c:640: error:
      	too many arguments to function ‘arch_vma_access_permitted’
      
      Fixes: d61172b4 ("mm/core, x86/mm/pkeys: Differentiate instruction fetches")
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Acked-by: NGuan Xuetao <gxt@mprc.pku.edu.cn>
      783011b1
    • M
      arm64: defconfig: enable CONFIG_LOCALVERSION_AUTO · 53fb45d3
      Masahiro Yamada 提交于
      When CONFIG_LOCALVERSION_AUTO is disabled, the version string is
      just a tag name (or with a '+' appended if HEAD is not a tagged
      commit).
      
      During the development (and especially when git-bisecting), longer
      version string would be helpful to identify the commit we are running.
      
      This is a default y option, so drop the unset to enable it.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      53fb45d3
    • R
      arm64: defconfig: add options for virtualization and containers · 2323439f
      Riku Voipio 提交于
      Enable options commonly needed by popular virtualization
      and container applications. Use modules when possible to
      avoid too much overhead for users not interested.
      
      - add namespace and cgroup options needed
      - add seccomp - optional, but enhances Qemu etc
      - bridge, nat, veth, macvtap and multicast for routing
        guests and containers
      - btfrs and overlayfs modules for container COW backends
      - while near it, make fuse a module instead of built-in.
      
      Generated with make saveconfig and dropping unrelated spurious
      change hunks while commiting. bloat-o-meter old-vmlinux vmlinux:
      
      add/remove: 905/390 grow/shrink: 767/229 up/down: 183513/-94861 (88652)
      ....
      Total: Before=10515408, After=10604060, chg +0.84%
      Signed-off-by: NRiku Voipio <riku.voipio@linaro.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      2323439f
    • M
      arm64: hibernate: handle allocation failures · dfbca61a
      Mark Rutland 提交于
      In create_safe_exec_page(), we create a copy of the hibernate exit text,
      along with some page tables to map this via TTBR0. We then install the
      new tables in TTBR0.
      
      In swsusp_arch_resume() we call create_safe_exec_page() before trying a
      number of operations which may fail (e.g. copying the linear map page
      tables). If these fail, we bail out of swsusp_arch_resume() and return
      an error code, but leave TTBR0 as-is. Subsequently, the core hibernate
      code will call free_basic_memory_bitmaps(), which will free all of the
      memory allocations we made, including the page tables installed in
      TTBR0.
      
      Thus, we may have TTBR0 pointing at dangling freed memory for some
      period of time. If the hibernate attempt was triggered by a user
      requesting a hibernate test via the reboot syscall, we may return to
      userspace with the clobbered TTBR0 value.
      
      Avoid these issues by reorganising swsusp_arch_resume() such that we
      have no failure paths after create_safe_exec_page(). We also add a check
      that the zero page allocation succeeded, matching what we have for other
      allocations.
      
      Fixes: 82869ac5 ("arm64: kernel: Add support for hibernate/suspend-to-disk")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NJames Morse <james.morse@arm.com>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org> # 4.7+
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      dfbca61a
    • M
      arm64: hibernate: avoid potential TLB conflict · 0194e760
      Mark Rutland 提交于
      In create_safe_exec_page we install a set of global mappings in TTBR0,
      then subsequently invalidate TLBs. While TTBR0 points at the zero page,
      and the TLBs should be free of stale global entries, we may have stale
      ASID-tagged entries (e.g. from the EFI runtime services mappings) for
      the same VAs. Per the ARM ARM these ASID-tagged entries may conflict
      with newly-allocated global entries, and we must follow a
      Break-Before-Make approach to avoid issues resulting from this.
      
      This patch reworks create_safe_exec_page to invalidate TLBs while the
      zero page is still in place, ensuring that there are no potential
      conflicts when the new TTBR0 value is installed. As a single CPU is
      online while this code executes, we do not need to perform broadcast TLB
      maintenance, and can call local_flush_tlb_all(), which also subsumes
      some barriers. The remaining assembly is converted to use write_sysreg()
      and isb().
      
      Other than this, we safely manipulate TTBRs in the hibernate dance. The
      code we install as part of the new TTBR0 mapping (the hibernated
      kernel's swsusp_arch_suspend_exit) installs a zero page into TTBR1,
      invalidates TLBs, then installs its preferred value. Upon being restored
      to the middle of swsusp_arch_suspend, the new image will call
      __cpu_suspend_exit, which will call cpu_uninstall_idmap, installing the
      zero page in TTBR0 and invalidating all TLB entries.
      
      Fixes: 82869ac5 ("arm64: kernel: Add support for hibernate/suspend-to-disk")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NJames Morse <james.morse@arm.com>
      Tested-by: NJames Morse <james.morse@arm.com>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org> # 4.7+
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      0194e760
    • L
      arm64: Handle el1 synchronous instruction aborts cleanly · 9adeb8e7
      Laura Abbott 提交于
      Executing from a non-executable area gives an ugly message:
      
      lkdtm: Performing direct entry EXEC_RODATA
      lkdtm: attempting ok execution at ffff0000084c0e08
      lkdtm: attempting bad execution at ffff000008880700
      Bad mode in Synchronous Abort handler detected on CPU2, code 0x8400000e -- IABT (current EL)
      CPU: 2 PID: 998 Comm: sh Not tainted 4.7.0-rc2+ #13
      Hardware name: linux,dummy-virt (DT)
      task: ffff800077e35780 ti: ffff800077970000 task.ti: ffff800077970000
      PC is at lkdtm_rodata_do_nothing+0x0/0x8
      LR is at execute_location+0x74/0x88
      
      The 'IABT (current EL)' indicates the error but it's a bit cryptic
      without knowledge of the ARM ARM. There is also no indication of the
      specific address which triggered the fault. The increase in kernel
      page permissions makes hitting this case more likely as well.
      Handling the case in the vectors gives a much more familiar looking
      error message:
      
      lkdtm: Performing direct entry EXEC_RODATA
      lkdtm: attempting ok execution at ffff0000084c0840
      lkdtm: attempting bad execution at ffff000008880680
      Unable to handle kernel paging request at virtual address ffff000008880680
      pgd = ffff8000089b2000
      [ffff000008880680] *pgd=00000000489b4003, *pud=0000000048904003, *pmd=0000000000000000
      Internal error: Oops: 8400000e [#1] PREEMPT SMP
      Modules linked in:
      CPU: 1 PID: 997 Comm: sh Not tainted 4.7.0-rc1+ #24
      Hardware name: linux,dummy-virt (DT)
      task: ffff800077f9f080 ti: ffff800008a1c000 task.ti: ffff800008a1c000
      PC is at lkdtm_rodata_do_nothing+0x0/0x8
      LR is at execute_location+0x74/0x88
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NLaura Abbott <labbott@redhat.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      9adeb8e7
  6. 12 8月, 2016 12 次提交
    • J
      MIPS: KVM: Propagate kseg0/mapped tlb fault errors · 9b731bcf
      James Hogan 提交于
      Propagate errors from kvm_mips_handle_kseg0_tlb_fault() and
      kvm_mips_handle_mapped_seg_tlb_fault(), usually triggering an internal
      error since they normally indicate the guest accessed bad physical
      memory or the commpage in an unexpected way.
      
      Fixes: 858dd5d4 ("KVM/MIPS32: MMU/TLB operations for the Guest.")
      Fixes: e685c689 ("KVM/MIPS32: Privileged instruction/target branch emulation.")
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.10.x-
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9b731bcf
    • J
      MIPS: KVM: Fix gfn range check in kseg0 tlb faults · 0741f52d
      James Hogan 提交于
      Two consecutive gfns are loaded into host TLB, so ensure the range check
      isn't off by one if guest_pmap_npages is odd.
      
      Fixes: 858dd5d4 ("KVM/MIPS32: MMU/TLB operations for the Guest.")
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.10.x-
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0741f52d
    • J
      MIPS: KVM: Add missing gfn range check · 8985d503
      James Hogan 提交于
      kvm_mips_handle_mapped_seg_tlb_fault() calculates the guest frame number
      based on the guest TLB EntryLo values, however it is not range checked
      to ensure it lies within the guest_pmap. If the physical memory the
      guest refers to is out of range then dump the guest TLB and emit an
      internal error.
      
      Fixes: 858dd5d4 ("KVM/MIPS32: MMU/TLB operations for the Guest.")
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.10.x-
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8985d503
    • J
      MIPS: KVM: Fix mapped fault broken commpage handling · c604cffa
      James Hogan 提交于
      kvm_mips_handle_mapped_seg_tlb_fault() appears to map the guest page at
      virtual address 0 to PFN 0 if the guest has created its own mapping
      there. The intention is unclear, but it may have been an attempt to
      protect the zero page from being mapped to anything but the comm page in
      code paths you wouldn't expect from genuine commpage accesses (guest
      kernel mode cache instructions on that address, hitting trapping
      instructions when executing from that address with a coincidental TLB
      eviction during the KVM handling, and guest user mode accesses to that
      address).
      
      Fix this to check for mappings exactly at KVM_GUEST_COMMPAGE_ADDR (it
      may not be at address 0 since commit 42aa12e7 ("MIPS: KVM: Move
      commpage so 0x0 is unmapped")), and set the corresponding EntryLo to be
      interpreted as 0 (invalid).
      
      Fixes: 858dd5d4 ("KVM/MIPS32: MMU/TLB operations for the Guest.")
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.10.x-
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c604cffa
    • C
      KVM: Protect device ops->create and list_add with kvm->lock · a28ebea2
      Christoffer Dall 提交于
      KVM devices were manipulating list data structures without any form of
      synchronization, and some implementations of the create operations also
      suffered from a lack of synchronization.
      
      Now when we've split the xics create operation into create and init, we
      can hold the kvm->lock mutex while calling the create operation and when
      manipulating the devices list.
      
      The error path in the generic code gets slightly ugly because we have to
      take the mutex again and delete the device from the list, but holding
      the mutex during anon_inode_getfd or releasing/locking the mutex in the
      common non-error path seemed wrong.
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a28ebea2
    • C
      KVM: PPC: Move xics_debugfs_init out of create · 023e9fdd
      Christoffer Dall 提交于
      As we are about to hold the kvm->lock during the create operation on KVM
      devices, we should move the call to xics_debugfs_init into its own
      function, since holding a mutex over extended amounts of time might not
      be a good idea.
      
      Introduce an init operation on the kvm_device_ops struct which cannot
      fail and call this, if configured, after the device has been created.
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      023e9fdd
    • J
      KVM: s390: reset KVM_REQ_MMU_RELOAD if mapping the prefix failed · aca411a4
      Julius Niedworok 提交于
      When triggering KVM_RUN without a user memory region being mapped
      (KVM_SET_USER_MEMORY_REGION) a validity intercept occurs. This could
      happen, if the user memory region was not mapped initially or if it
      was unmapped after the vcpu is initialized. The function
      kvm_s390_handle_requests checks for the KVM_REQ_MMU_RELOAD bit. The
      check function always clears this bit. If gmap_mprotect_notify
      returns an error code, the mapping failed, but the KVM_REQ_MMU_RELOAD
      was not set anymore. So the next time kvm_s390_handle_requests is
      called, the execution would fall trough the check for
      KVM_REQ_MMU_RELOAD. The bit needs to be resetted, if
      gmap_mprotect_notify returns an error code. Resetting the bit with
      kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu) fixes the bug.
      Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NJulius Niedworok <jniedwor@linux.vnet.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      aca411a4
    • J
      KVM: s390: set the prefix initially properly · 75a4615c
      Julius Niedworok 提交于
      When KVM_RUN is triggered on a VCPU without an initial reset, a
      validity intercept occurs.
      Setting the prefix will set the KVM_REQ_MMU_RELOAD bit initially,
      thus preventing the bug.
      Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Acked-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NJulius Niedworok <jniedwor@linux.vnet.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      75a4615c
    • K
      perf/x86/intel/uncore: Add enable_box for client MSR uncore · 95f3be79
      Kan Liang 提交于
      There are bug reports about miscounting uncore counters on some
      client machines like Sandybridge, Broadwell and Skylake. It is
      very likely to be observed on idle systems.
      
      This issue is caused by a hardware issue. PERF_GLOBAL_CTL could be
      cleared after Package C7, and nothing will be count.
      The related errata (HSD 158) could be found in:
      
        www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf
      
      This patch tries to work around this issue by re-enabling PERF_GLOBAL_CTL
      in ->enable_box(). The workaround does not cover all cases. It helps for new
      events after returning from C7. But it cannot prevent C7, it will still
      miscount if a counter is already active.
      
      There is no drawback in leaving it enabled, so it does not need
      disable_box() here.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1470925874-59943-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95f3be79
    • K
      perf/x86/intel/uncore: Fix uncore num_counters · 10e9e7bd
      Kan Liang 提交于
      Some uncore boxes' num_counters value for Haswell server and
      Broadwell server are not correct (too large, off by one).
      
      This issue was found by comparing the code with the document. Although
      there is no bug report from users yet, accessing non-existent counters
      is dangerous and the behavior is undefined: it may cause miscounting or
      even crashes.
      
      This patch makes them consistent with the uncore document.
      Reported-by: NLukasz Odzioba <lukasz.odzioba@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1470925820-59847-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10e9e7bd
    • D
      uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions · 68187872
      Denys Vlasenko 提交于
      Since instruction decoder now supports EVEX-encoded instructions, two fixes
      are needed to correctly handle them in uprobes.
      
      Extended bits for MODRM.rm field need to be sanitized just like we do it
      for VEX3, to avoid encoding wrong register for register-relative access.
      
      EVEX has _two_ extended bits: b and x. Theoretically, EVEX.x should be
      ignored by the CPU (since GPRs go only up to 15, not 31), but let's be
      paranoid here: proper encoding for register-relative access
      should have EVEX.x = 1.
      
      Secondly, we should fetch vex.vvvv for EVEX too.
      This is now super easy because instruction decoder populates
      vex_prefix.bytes[2] for all flavors of (e)vex encodings, even for VEX2.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # v4.1+
      Fixes: 8a764a87 ("x86/asm/decoder: Create artificial 3rd byte for 2-byte VEX")
      Link: http://lkml.kernel.org/r/20160811154521.20469-1-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68187872
    • D
      arm64: Remove stack duplicating code from jprobes · ad05711c
      David A. Long 提交于
      Because the arm64 calling standard allows stacked function arguments to be
      anywhere in the stack frame, do not attempt to duplicate the stack frame for
      jprobes handler functions.
      
      Documentation changes to describe this issue have been broken out into a
      separate patch in order to simultaneously address them in other
      architecture(s).
      Signed-off-by: NDavid A. Long <dave.long@linaro.org>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      ad05711c
  7. 11 8月, 2016 8 次提交
    • S
      x86/apic/x2apic, smp/hotplug: Don't use before alloc in x2apic_cluster_probe() · d52c0569
      Sebastian Andrzej Siewior 提交于
      I made a mistake while converting the driver to the hotplug state
      machine and as a result x2apic_cluster_probe() was accessing
      cpus_in_cluster before allocating it.
      
      This patch fixes it by setting the cpumask after the allocation the
      memory succeeded.
      
      While at it, I marked two functions static which are only used within
      this file.
      Reported-by: NLaura Abbott <labbott@redhat.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 6b2c2847 ("x86/x2apic: Convert to CPU hotplug state machine")
      Link: http://lkml.kernel.org/r/1470924515-9444-1-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d52c0569
    • A
      x86/platform/uv: Skip UV runtime services mapping in the efi_runtime_disabled case · f72075c9
      Alex Thorlton 提交于
      This problem has actually been in the UV code for a while, but we didn't
      catch it until recently, because we had been relying on EFI_OLD_MEMMAP
      to allow our systems to boot for a period of time.  We noticed the issue
      when trying to kexec a recent community kernel, where we hit this NULL
      pointer dereference in efi_sync_low_kernel_mappings():
      
       [    0.337515] BUG: unable to handle kernel NULL pointer dereference at 0000000000000880
       [    0.346276] IP: [<ffffffff8105df8d>] efi_sync_low_kernel_mappings+0x5d/0x1b0
      
      The problem doesn't show up with EFI_OLD_MEMMAP because we skip the
      chunk of setup_efi_state() that sets the efi_loader_signature for the
      kexec'd kernel.  When the kexec'd kernel boots, it won't set EFI_BOOT in
      setup_arch, so we completely avoid the bug.
      
      We always kexec with noefi on the command line, so this shouldn't be an
      issue, but since we're not actually checking for efi_runtime_disabled in
      uv_bios_init(), we end up trying to do EFI runtime callbacks when we
      shouldn't be. This patch just adds a check for efi_runtime_disabled in
      uv_bios_init() so that we don't map in uv_systab when runtime_disabled ==
      true.
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: <stable@vger.kernel.org> # v4.7
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470912120-22831-2-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f72075c9
    • A
      x86/efi: Allocate a trampoline if needed in efi_free_boot_services() · 5bc653b7
      Andy Lutomirski 提交于
      On my Dell XPS 13 9350 with firmware 1.4.4 and SGX on, if I boot
      Fedora 24's grub2-efi off a hard disk, my first 1MB of RAM looks
      like:
      
       efi: mem00: [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000000000-0x0000000000000fff] (0MB)
       efi: mem01: [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000001000-0x0000000000027fff] (0MB)
       efi: mem02: [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000028000-0x0000000000029fff] (0MB)
       efi: mem03: [Reserved           |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000002a000-0x000000000002bfff] (0MB)
       efi: mem04: [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000002c000-0x000000000002cfff] (0MB)
       efi: mem05: [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000002d000-0x000000000002dfff] (0MB)
       efi: mem06: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000002e000-0x0000000000057fff] (0MB)
       efi: mem07: [Reserved           |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000058000-0x0000000000058fff] (0MB)
       efi: mem08: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000059000-0x000000000009ffff] (0MB)
      
      My EBDA is at 0x2c000, which blocks off everything from 0x2c000 and
      up, and my trampoline is 0x6000 bytes (6 pages), so it doesn't fit
      in the loader data range at 0x28000.
      
      Without this patch, it panics due to a failure to allocate the
      trampoline.  With this patch, it works:
      
       [  +0.001744] Base memory trampoline at [ffff880000001000] 1000 size 24576
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/998c77b3bf709f3dfed85cb30701ed1a5d8a438b.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5bc653b7
    • A
      x86/boot: Rework reserve_real_mode() to allow multiple tries · 5ff3e2c3
      Andy Lutomirski 提交于
      If reserve_real_mode() fails, panicing immediately means we're
      doomed.  Make it safe to try more than once to allocate the
      trampoline:
      
       - Degrade a failure from panic() to pr_info().  (If we make it to
         setup_real_mode() without reserving the trampoline, we'll panic
         them.)
      
       - Factor out helpers so that platform code can supply a specific
         address to try.
      
       - Warn if reserve_real_mode() is called after we're done with the
         memblock allocator.  If that were to happen, we would behave
         unpredictably.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/876e383038f3e9971aa72fd20a4f5da05f9d193d.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ff3e2c3
    • A
      x86/boot: Defer setup_real_mode() to early_initcall time · d0de0f68
      Andy Lutomirski 提交于
      There's no need to run setup_real_mode() as early as we run it.
      Defer it to the same early_initcall that sets up the page
      permissions for the real mode code.
      
      This should be a code size reduction.  More importantly, it give us
      a longer window in which we can allocate the real mode trampoline.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/fd62f0da4f79357695e9bf3e365623736b05f119.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d0de0f68
    • A
      x86/boot: Synchronize trampoline_cr4_features and mmu_cr4_features directly · 18bc7bd5
      Andy Lutomirski 提交于
      The initialization process for trampoline_cr4_features and
      mmu_cr4_features was confusing.  The intent is for mmu_cr4_features
      and *trampoline_cr4_features to stay in sync, but
      trampoline_cr4_features is NULL until setup_real_mode() runs.  The
      old code synchronized *trampoline_cr4_features *twice*, once in
      setup_real_mode() and once in setup_arch().  It also initialized
      mmu_cr4_features in setup_real_mode(), which causes the actual value
      of mmu_cr4_features to potentially depend on when setup_real_mode()
      is called.
      
      With this patch, mmu_cr4_features is initialized directly in
      setup_arch(), and *trampoline_cr4_features is synchronized to
      mmu_cr4_features when the trampoline is set up.
      
      After this patch, it should be safe to defer setup_real_mode().
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/d48a263f9912389b957dd495a7127b009259ffe0.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      18bc7bd5
    • A
      x86/boot: Run reserve_bios_regions() after we initialize the memory map · 007b7560
      Andy Lutomirski 提交于
      reserve_bios_regions() is a quirk that reserves memory that we might
      otherwise think is available.  There's no need to run it so early,
      and running it before we have the memory map initialized with its
      non-quirky inputs makes it hard to make reserve_bios_regions() more
      intelligent.
      
      Move it right after we populate the memblock state.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/59f58618911005c799c6c9979ce6ae4881d907c2.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      007b7560
    • A
      x86/irq: Do not substract irq_tlb_count from irq_call_count · 82ba4fac
      Aaron Lu 提交于
      Since commit:
      
        52aec330 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")
      
      the TLB remote shootdown is done through call function vector. That
      commit didn't take care of irq_tlb_count, which a later commit:
      
        fd0f5869 ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")
      
      ... tried to fix.
      
      The fix assumes every increase of irq_tlb_count has a corresponding
      increase of irq_call_count. So the irq_call_count is always bigger than
      irq_tlb_count and we could substract irq_tlb_count from irq_call_count.
      
      Unfortunately this is not true for the smp_call_function_single() case.
      The IPI is only sent if the target CPU's call_single_queue is empty when
      adding a csd into it in generic_exec_single. That means if two threads
      are both adding flush tlb csds to the same CPU's call_single_queue, only
      one IPI is sent. In other words, the irq_call_count is incremented by 1
      but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be
      bigger than irq_call_count and the substract will produce a very large
      irq_call_count value due to overflow.
      
      Considering that:
      
        1) it's not worth to send more IPIs for the sake of accurate counting of
           irq_call_count in generic_exec_single();
      
        2) it's not easy to tell if the call function interrupt is for TLB
           shootdown in __smp_call_function_single_interrupt().
      
      Not to exclude TLB shootdown from call function count seems to be the
      simplest fix and this patch just does that.
      
      This bug was found by LKP's cyclic performance regression tracking recently
      with the vm-scalability test suite. I have bisected to commit:
      
        3dec0ba0 ("mm/rmap: share the i_mmap_rwsem")
      
      This commit didn't do anything wrong but revealed the irq_call_count
      problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file
      concurrent with multiple threads.  When remap_one is try_to_unmap_one(),
      then multiple threads could queue flush TLB to the same CPU but only
      one IPI will be sent.
      
      Since the commit was added in Linux v3.19, the counting problem only
      shows up from v3.19 onwards.
      Signed-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
      Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      82ba4fac