1. 08 8月, 2018 2 次提交
  2. 07 8月, 2018 1 次提交
    • T
      cpu/hotplug: Fix SMT supported evaluation · bc2d8d26
      Thomas Gleixner 提交于
      Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
      cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
      on the kernel command line as it cannot differentiate between SMT disabled
      by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
      makes the sysfs interface unusable.
      
      Rework this so that during bringup of the non boot CPUs the availability of
      SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
      'primary' thread then set the local cpu_smt_available marker and evaluate
      this explicitely right after the initial SMP bringup has finished.
      
      SMT evaulation on x86 is a trainwreck as the firmware has all the
      information _before_ booting the kernel, but there is no interface to query
      it.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc2d8d26
  3. 05 8月, 2018 11 次提交
    • P
      KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry · 5b76a3cf
      Paolo Bonzini 提交于
      When nested virtualization is in use, VMENTER operations from the nested
      hypervisor into the nested guest will always be processed by the bare metal
      hypervisor, and KVM's "conditional cache flushes" mode in particular does a
      flush on nested vmentry.  Therefore, include the "skip L1D flush on
      vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
      
      Add the relevant Documentation.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5b76a3cf
    • P
      x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry · 8e0b2b91
      Paolo Bonzini 提交于
      Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
      not needed.  Add a new value to enum vmx_l1d_flush_state, which is used
      either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      8e0b2b91
    • P
      x86/speculation: Simplify sysfs report of VMX L1TF vulnerability · ea156d19
      Paolo Bonzini 提交于
      Three changes to the content of the sysfs file:
      
       - If EPT is disabled, L1TF cannot be exploited even across threads on the
         same core, and SMT is irrelevant.
      
       - If mitigation is completely disabled, and SMT is enabled, print "vulnerable"
         instead of "vulnerable, SMT vulnerable"
      
       - Reorder the two parts so that the main vulnerability state comes first
         and the detail on SMT is second.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ea156d19
    • N
      x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() · 18b57ce2
      Nicolai Stange 提交于
      For VMEXITs caused by external interrupts, vmx_handle_external_intr()
      indirectly calls into the interrupt handlers through the host's IDT.
      
      It follows that these interrupts get accounted for in the
      kvm_cpu_l1tf_flush_l1d per-cpu flag.
      
      The subsequently executed vmx_l1d_flush() will thus be aware that some
      interrupts have happened and conduct a L1d flush anyway.
      
      Setting l1tf_flush_l1d from vmx_handle_external_intr() isn't needed
      anymore. Drop it.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      18b57ce2
    • N
      x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d · ffcba43f
      Nicolai Stange 提交于
      The last missing piece to having vmx_l1d_flush() take interrupts after
      VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
      irq entry.
      
      Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
      ipi_entering_ack_irq(), smp_reschedule_interrupt() and
      uv_bau_message_interrupt().
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ffcba43f
    • N
      x86: Don't include linux/irq.h from asm/hardirq.h · 447ae316
      Nicolai Stange 提交于
      The next patch in this series will have to make the definition of
      irq_cpustat_t available to entering_irq().
      
      Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
      dependencies like
      
        asm/smp.h
          asm/apic.h
            asm/hardirq.h
              linux/irq.h
                linux/topology.h
                  linux/smp.h
                    asm/smp.h
      
      or
      
        linux/gfp.h
          linux/mmzone.h
            asm/mmzone.h
              asm/mmzone_64.h
                asm/smp.h
                  asm/apic.h
                    asm/hardirq.h
                      linux/irq.h
                        linux/irqdesc.h
                          linux/kobject.h
                            linux/sysfs.h
                              linux/kernfs.h
                                linux/idr.h
                                  linux/gfp.h
      
      and others.
      
      This causes compilation errors because of the header guards becoming
      effective in the second inclusion: symbols/macros that had been defined
      before wouldn't be available to intermediate headers in the #include chain
      anymore.
      
      A possible workaround would be to move the definition of irq_cpustat_t
      into its own header and include that from both, asm/hardirq.h and
      asm/apic.h.
      
      However, this wouldn't solve the real problem, namely asm/harirq.h
      unnecessarily pulling in all the linux/irq.h cruft: nothing in
      asm/hardirq.h itself requires it. Also, note that there are some other
      archs, like e.g. arm64, which don't have that #include in their
      asm/hardirq.h.
      
      Remove the linux/irq.h #include from x86' asm/hardirq.h.
      
      Fix resulting compilation errors by adding appropriate #includes to *.c
      files as needed.
      
      Note that some of these *.c files could be cleaned up a bit wrt. to their
      set of #includes, but that should better be done from separate patches, if
      at all.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      447ae316
    • N
      x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d · 45b575c0
      Nicolai Stange 提交于
      Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
      VMENTRY.
      
      L1D flushes are costly and two modes of operations are provided to users:
      "always" and the more selective "conditional" mode.
      
      If operating in the latter, the cache would get flushed only if a host side
      code path considered unconfined had been traversed. "Unconfined" in this
      context means that it might have pulled in sensitive data like user data
      or kernel crypto keys.
      
      The need for L1D flushes is tracked by means of the per-vcpu flag
      l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
      vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
      L1d flush based on its value and reset that flag again.
      
      Currently, interrupts delivered "normally" while in root operation between
      VMEXIT and VMENTER are not taken into account. Part of the reason is that
      these don't leave any traces and thus, the vmx code is unable to tell if
      any such has happened.
      
      As proposed by Paolo Bonzini, prepare for tracking all interrupts by
      introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
      strong analogy to the per-vcpu ->l1tf_flush_l1d.
      
      A later patch will make interrupt handlers set it.
      
      For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
      per-cpu irq_cpustat_t as suggested by Peter Zijlstra.
      
      Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
      kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
      trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.
      
      Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
      l1tf_flush_l1d.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      45b575c0
    • N
      x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 · 9aee5f8a
      Nicolai Stange 提交于
      An upcoming patch will extend KVM's L1TF mitigation in conditional mode
      to also cover interrupts after VMEXITs. For tracking those, stores to a
      new per-cpu flag from interrupt handlers will become necessary.
      
      In order to improve cache locality, this new flag will be added to x86's
      irq_cpustat_t.
      
      Make some space available there by shrinking the ->softirq_pending bitfield
      from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
      i.e. 10.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      9aee5f8a
    • N
      x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() · 5b6ccc6c
      Nicolai Stange 提交于
      Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes
      vmx_l1d_flush() if so.
      
      This test is unncessary for the "always flush L1D" mode.
      
      Move the check to vmx_l1d_flush()'s conditional mode code path.
      
      Notes:
      - vmx_l1d_flush() is likely to get inlined anyway and thus, there's no
        extra function call.
        
      - This inverts the (static) branch prediction, but there hadn't been any
        explicit likely()/unlikely() annotations before and so it stays as is.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5b6ccc6c
    • N
      x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' · 427362a1
      Nicolai Stange 提交于
      The vmx_l1d_flush_always static key is only ever evaluated if
      vmx_l1d_should_flush is enabled. In that case however, there are only two
      L1d flushing modes possible: "always" and "conditional".
      
      The "conditional" mode's implementation tends to require more sophisticated
      logic than the "always" mode.
      
      Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
      with a 'vmx_l1d_flush_cond' one.
      
      There is no change in functionality.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      427362a1
    • N
      x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() · 379fd0c7
      Nicolai Stange 提交于
      vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no
      point in setting l1tf_flush_l1d to true from there again.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      379fd0c7
  4. 28 7月, 2018 1 次提交
    • R
      Revert "MIPS: BCM47XX: Enable 74K Core ExternalSync for PCIe erratum" · d5ea019f
      Rafał Miłecki 提交于
      This reverts commit 2a027b47 ("MIPS: BCM47XX: Enable 74K Core
      ExternalSync for PCIe erratum").
      
      Enabling ExternalSync caused a regression for BCM4718A1 (used e.g. in
      Netgear E3000 and ASUS RT-N16): it simply hangs during PCIe
      initialization. It's likely that BCM4717A1 is also affected.
      
      I didn't notice that earlier as the only BCM47XX devices with PCIe I
      own are:
      1) BCM4706 with 2 x 14e4:4331
      2) BCM4706 with 14e4:4360 and 14e4:4331
      it appears that BCM4706 is unaffected.
      
      While BCM5300X-ES300-RDS.pdf seems to document that erratum and its
      workarounds (according to quotes provided by Tokunori) it seems not even
      Broadcom follows them.
      
      According to the provided info Broadcom should define CONF7_ES in their
      SDK's mipsinc.h and implement workaround in the si_mips_init(). Checking
      both didn't reveal such code. It *could* mean Broadcom also had some
      problems with the given workaround.
      Signed-off-by: NRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Reported-by: NMichael Marley <michael@michaelmarley.com>
      Patchwork: https://patchwork.linux-mips.org/patch/20032/
      URL: https://bugs.openwrt.org/index.php?do=details&task_id=1688
      Cc: Tokunori Ikegami <ikegami@allied-telesis.co.jp>
      Cc: Hauke Mehrtens <hauke@hauke-m.de>
      Cc: Chris Packham <chris.packham@alliedtelesis.co.nz>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      d5ea019f
  5. 27 7月, 2018 2 次提交
  6. 25 7月, 2018 2 次提交
    • J
      arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap setups · 7b0eb6b4
      Johannes Weiner 提交于
      Arnd reports the following arm64 randconfig build error with the PSI
      patches that add another page flag:
      
        /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
        /git/arm-soc/include/linux/compiler.h:357:38: error: call to
        '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
        failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)
      
      The additional page flag causes other information stored in
      page->flags to get bumped into their own struct page member:
      
        #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
        BITS_PER_LONG - NR_PAGEFLAGS
        #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
        #else
        #define LAST_CPUPID_WIDTH 0
        #endif
      
        #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
        #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
        #endif
      
      which in turn causes the struct page size to exceed the size set in
      STRUCT_PAGE_MAX_SHIFT. This value is an an estimate used to size the
      VMEMMAP page array according to address space and struct page size.
      
      However, the check is performed - and triggers here - on a !VMEMMAP
      config, which consumes an additional 22 page bits for the sparse
      section id. When VMEMMAP is enabled, those bits are returned, cpupid
      doesn't need its own member, and the page passes the VMEMMAP check.
      
      Restrict that check to the situation it was meant to check: that we
      are sizing the VMEMMAP page array correctly.
      
      Says Arnd:
      
          Further experiments show that the build error already existed before,
          but was only triggered with larger values of CONFIG_NR_CPU and/or
          CONFIG_NODES_SHIFT that might be used in actual configurations but
          not in randconfig builds.
      
          With longer CPU and node masks, I could recreate the problem with
          kernels as old as linux-4.7 when arm64 NUMA support got added.
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Tested-by: NArnd Bergmann <arnd@arndb.de>
      Cc: stable@vger.kernel.org
      Fixes: 1a2db300 ("arm64, numa: Add NUMA support for arm64 platforms.")
      Fixes: 3e1907d5 ("arm64: mm: move vmemmap region right below the linear region")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      7b0eb6b4
    • D
      arm64: Check for errata before evaluating cpu features · dc0e3658
      Dirk Mueller 提交于
      Since commit d3aec8a2 ("arm64: capabilities: Restrict KPTI
      detection to boot-time CPUs") we rely on errata flags being already
      populated during feature enumeration. The order of errata and
      features was flipped as part of commit ed478b3f ("arm64:
      capabilities: Group handling of features and errata workarounds").
      
      Return to the orginal order of errata and feature evaluation to
      ensure errata flags are present during feature evaluation.
      
      Fixes: ed478b3f ("arm64: capabilities: Group handling of
          features and errata workarounds")
      CC: Suzuki K Poulose <suzuki.poulose@arm.com>
      CC: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NDirk Mueller <dmueller@suse.com>
      Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      dc0e3658
  7. 24 7月, 2018 1 次提交
    • M
      s390: disable gcc plugins · 2fba3573
      Martin Schwidefsky 提交于
      The s390 build currently fails with the latent entropy plugin:
      
      arch/s390/kernel/als.o: In function `verify_facilities':
      als.c:(.init.text+0x24): undefined reference to `latent_entropy'
      als.c:(.init.text+0xae): undefined reference to `latent_entropy'
      make[3]: *** [arch/s390/boot/compressed/vmlinux] Error 1
      make[2]: *** [arch/s390/boot/compressed/vmlinux] Error 2
      make[1]: *** [bzImage] Error 2
      
      This will be fixed with the early boot rework from Vasily, which
      is planned for the 4.19 merge window.
      
      For 4.18 the simplest solution is to disable the gcc plugins and
      reenable them after the early boot rework is upstream.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      2fba3573
  8. 23 7月, 2018 1 次提交
  9. 22 7月, 2018 2 次提交
    • L
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds 提交于
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • L
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds 提交于
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
  10. 21 7月, 2018 1 次提交
  11. 20 7月, 2018 3 次提交
    • U
      ARM: dts: imx6: RDU2: fix irq type for mv88e6xxx switch · e01a06c8
      Uwe Kleine-König 提交于
      The Marvell switches report their interrupts in a level sensitive way.
      When using edge sensitive detection a race condition in the interrupt
      handler of the swich might result in the OS to miss all future events
      which might make the switch non-functional.
      
      The problem is that both mv88e6xxx_g2_irq_thread_fn() and
      mv88e6xxx_g1_irq_thread_work() sample the irq cause register
      (MV88E6XXX_G2_INT_SRC and MV88E6XXX_G1_STS respectively) once and then
      handle the observed sources. If after sampling but before all observed
      irq sources are handled a new irq source gets active this is not noticed
      by the handler which returns unsuspecting, but the interrupt line stays
      active which prevents the edge detector to kick in.
      
      All device trees but imx6qdl-zii-rdu2 get this right (most of them by
      not specifying an interrupt parent). So fix imx6qdl-zii-rdu2
      accordingly.
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Fixes: f64992d1 ("ARM: dts: imx6: RDU2: Add Switch interrupts")
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NShawn Guo <shawnguo@kernel.org>
      e01a06c8
    • D
      bpf, ppc64: fix unexpected r0=0 exit path inside bpf_xadd · b9c1e60e
      Daniel Borkmann 提交于
      None of the JITs is allowed to implement exit paths from the BPF
      insn mappings other than BPF_JMP | BPF_EXIT. In the BPF core code
      we have a couple of rewrites in eBPF (e.g. LD_ABS / LD_IND) and
      in eBPF to cBPF translation to retain old existing behavior where
      exceptions may occur; they are also tightly controlled by the
      verifier where it disallows some of the features such as BPF to
      BPF calls when legacy LD_ABS / LD_IND ops are present in the BPF
      program. During recent review of all BPF_XADD JIT implementations
      I noticed that the ppc64 one is buggy in that it contains two
      jumps to exit paths. This is problematic as this can bypass verifier
      expectations e.g. pointed out in commit f6b1b3bf ("bpf: fix
      subprog verifier bypass by div/mod by 0 exception"). The first
      exit path is obsoleted by the fix in ca369602 ("bpf: allow xadd
      only on aligned memory") anyway, and for the second one we need to
      do a fetch, add and store loop if the reservation from lwarx/ldarx
      was lost in the meantime.
      
      Fixes: 156d0e29 ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
      Reviewed-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Reviewed-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Tested-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b9c1e60e
    • V
      ARCv2: [plat-hsdk]: Save accl reg pair by default · af1fc5ba
      Vineet Gupta 提交于
      This manifsted as strace segfaulting on HSDK because gcc was targetting
      the accumulator registers as GPRs, which kernek was not saving/restoring
      by default.
      
      Cc: stable@vger.kernel.org   #4.14+
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      af1fc5ba
  12. 19 7月, 2018 1 次提交
    • N
      x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content · 288d152c
      Nicolai Stange 提交于
      The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order
      to evict the L1d cache.
      
      However, these pages are never cleared and, in theory, their data could be
      leaked.
      
      More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages
      to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break
      the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses.
      
      Fix this by initializing the individual vmx_l1d_flush_pages with a
      different pattern each.
      
      Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to
      "flush_pages" to reflect this change.
      
      Fixes: a47dd5f0 ("x86/KVM/VMX: Add L1D flush algorithm")
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      288d152c
  13. 18 7月, 2018 5 次提交
    • G
      powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle) · b03897cf
      Gautham R. Shenoy 提交于
      On 64-bit servers, SPRN_SPRG3 and its userspace read-only mirror
      SPRN_USPRG3 are used as userspace VDSO write and read registers
      respectively.
      
      SPRN_SPRG3 is lost when we enter stop4 and above, and is currently not
      restored.  As a result, any read from SPRN_USPRG3 returns zero on an
      exit from stop4 (Power9 only) and above.
      
      Thus in this situation, on POWER9, any call from sched_getcpu() always
      returns zero, as on powerpc, we call __kernel_getcpu() which relies
      upon SPRN_USPRG3 to report the CPU and NUMA node information.
      
      Fix this by restoring SPRN_SPRG3 on wake up from a deep stop state
      with the sprg_vdso value that is cached in PACA.
      
      Fixes: e1c1cfed ("powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle")
      Cc: stable@vger.kernel.org # v4.14+
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b03897cf
    • J
      powerpc/Makefile: Assemble with -me500 when building for E500 · 4e4a4b75
      James Clarke 提交于
      Some of the assembly files use instructions specific to BookE or E500,
      which are rejected with the now-default -mcpu=powerpc, so we must pass
      -me500 to the assembler just as we pass -me200 for E200.
      
      Fixes: 4bf4f42a ("powerpc/kbuild: Set default generic machine type for 32-bit compile")
      Signed-off-by: NJames Clarke <jrtc27@jrtc27.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4e4a4b75
    • P
      kvmclock: fix TSC calibration for nested guests · e10f7805
      Peng Hao 提交于
      Inside a nested guest, access to hardware can be slow enough that
      tsc_read_refs always return ULLONG_MAX, causing tsc_refine_calibration_work
      to be called periodically and the nested guest to spend a lot of time
      reading the ACPI timer.
      
      However, if the TSC frequency is available from the pvclock page,
      we can just set X86_FEATURE_TSC_KNOWN_FREQ and avoid the recalibration.
      'refine' operation.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
      [Commit message rewritten. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e10f7805
    • L
      KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled · 2307af1c
      Liran Alon 提交于
      When eVMCS is enabled, all VMCS allocated to be used by KVM are marked
      with revision_id of KVM_EVMCS_VERSION instead of revision_id reported
      by MSR_IA32_VMX_BASIC.
      
      However, even though not explictly documented by TLFS, VMXArea passed
      as VMXON argument should still be marked with revision_id reported by
      physical CPU.
      
      This issue was found by the following setup:
      * L0 = KVM which expose eVMCS to it's L1 guest.
      * L1 = KVM which consume eVMCS reported by L0.
      This setup caused the following to occur:
      1) L1 execute hardware_enable().
      2) hardware_enable() calls kvm_cpu_vmxon() to execute VMXON.
      3) L0 intercept L1 VMXON and execute handle_vmon() which notes
      vmxarea->revision_id != VMCS12_REVISION and therefore fails with
      nested_vmx_failInvalid() which sets RFLAGS.CF.
      4) L1 kvm_cpu_vmxon() don't check RFLAGS.CF for failure and therefore
      hardware_enable() continues as usual.
      5) L1 hardware_enable() then calls ept_sync_global() which executes
      INVEPT.
      6) L0 intercept INVEPT and execute handle_invept() which notes
      !vmx->nested.vmxon and thus raise a #UD to L1.
      7) Raised #UD caused L1 to panic.
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 773e8a04Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2307af1c
    • A
      KVM: PPC: Check if IOMMU page is contained in the pinned physical page · 76fa4975
      Alexey Kardashevskiy 提交于
      A VM which has:
       - a DMA capable device passed through to it (eg. network card);
       - running a malicious kernel that ignores H_PUT_TCE failure;
       - capability of using IOMMU pages bigger that physical pages
      can create an IOMMU mapping that exposes (for example) 16MB of
      the host physical memory to the device when only 64K was allocated to the VM.
      
      The remaining 16MB - 64K will be some other content of host memory, possibly
      including pages of the VM, but also pages of host kernel memory, host
      programs or other VMs.
      
      The attacking VM does not control the location of the page it can map,
      and is only allowed to map as many pages as it has pages of RAM.
      
      We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
      an IOMMU page is contained in the physical page so the PCI hardware won't
      get access to unassigned host memory; however this check is missing in
      the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
      did not hit this yet as the very first time when the mapping happens
      we do not have tbl::it_userspace allocated yet and fall back to
      the userspace which in turn calls VFIO IOMMU driver, this fails and
      the guest does not retry,
      
      This stores the smallest preregistered page size in the preregistered
      region descriptor and changes the mm_iommu_xxx API to check this against
      the IOMMU page size.
      
      This calculates maximum page size as a minimum of the natural region
      alignment and compound page size. For the page shift this uses the shift
      returned by find_linux_pte() which indicates how the page is mapped to
      the current userspace - if the page is huge and this is not a zero, then
      it is a leaf pte and the page is mapped within the range.
      
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      76fa4975
  14. 17 7月, 2018 3 次提交
  15. 16 7月, 2018 3 次提交
    • V
      x86/apm: Don't access __preempt_count with zeroed fs · 6f6060a5
      Ville Syrjälä 提交于
      APM_DO_POP_SEGS does not restore fs/gs which were zeroed by
      APM_DO_ZERO_SEGS. Trying to access __preempt_count with
      zeroed fs doesn't really work.
      
      Move the ibrs call outside the APM_DO_SAVE_SEGS/APM_DO_RESTORE_SEGS
      invocations so that fs is actually restored before calling
      preempt_enable().
      
      Fixes the following sort of oopses:
      [    0.313581] general protection fault: 0000 [#1] PREEMPT SMP
      [    0.313803] Modules linked in:
      [    0.314040] CPU: 0 PID: 268 Comm: kapmd Not tainted 4.16.0-rc1-triton-bisect-00090-gdd84441a #19
      [    0.316161] EIP: __apm_bios_call_simple+0xc8/0x170
      [    0.316161] EFLAGS: 00210016 CPU: 0
      [    0.316161] EAX: 00000102 EBX: 00000000 ECX: 00000102 EDX: 00000000
      [    0.316161] ESI: 0000530e EDI: dea95f64 EBP: dea95f18 ESP: dea95ef0
      [    0.316161]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
      [    0.316161] CR0: 80050033 CR2: 00000000 CR3: 015d3000 CR4: 000006d0
      [    0.316161] Call Trace:
      [    0.316161]  ? cpumask_weight.constprop.15+0x20/0x20
      [    0.316161]  on_cpu0+0x44/0x70
      [    0.316161]  apm+0x54e/0x720
      [    0.316161]  ? __switch_to_asm+0x26/0x40
      [    0.316161]  ? __schedule+0x17d/0x590
      [    0.316161]  kthread+0xc0/0xf0
      [    0.316161]  ? proc_apm_show+0x150/0x150
      [    0.316161]  ? kthread_create_worker_on_cpu+0x20/0x20
      [    0.316161]  ret_from_fork+0x2e/0x38
      [    0.316161] Code: da 8e c2 8e e2 8e ea 57 55 2e ff 1d e0 bb 5d b1 0f 92 c3 5d 5f 07 1f 89 47 0c 90 8d b4 26 00 00 00 00 90 8d b4 26 00 00 00 00 90 <64> ff 0d 84 16 5c b1 74 7f 8b 45 dc 8e e0 8b 45 d8 8e e8 8b 45
      [    0.316161] EIP: __apm_bios_call_simple+0xc8/0x170 SS:ESP: 0068:dea95ef0
      [    0.316161] ---[ end trace 656253db2deaa12c ]---
      
      Fixes: dd84441a ("x86/speculation: Use IBRS if available before calling into firmware")
      Signed-off-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc:  David Woodhouse <dwmw@amazon.co.uk>
      Cc:  "H. Peter Anvin" <hpa@zytor.com>
      Cc:  x86@kernel.org
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709133534.5963-1-ville.syrjala@linux.intel.com
      6f6060a5
    • P
      MIPS: Fix off-by-one in pci_resource_to_user() · 38c0a74f
      Paul Burton 提交于
      The MIPS implementation of pci_resource_to_user() introduced in v3.12 by
      commit 4c2924b7 ("MIPS: PCI: Use pci_resource_to_user to map pci
      memory space properly") incorrectly sets *end to the address of the
      byte after the resource, rather than the last byte of the resource.
      
      This results in userland seeing resources as a byte larger than they
      actually are, for example a 32 byte BAR will be reported by a tool such
      as lspci as being 33 bytes in size:
      
          Region 2: I/O ports at 1000 [disabled] [size=33]
      
      Correct this by subtracting one from the calculated end address,
      reporting the correct address to userland.
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Reported-by: NRui Wang <rui.wang@windriver.com>
      Fixes: 4c2924b7 ("MIPS: PCI: Use pci_resource_to_user to map pci memory space properly")
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Wolfgang Grandegger <wg@grandegger.com>
      Cc: linux-mips@linux-mips.org
      Cc: stable@vger.kernel.org # v3.12+
      Patchwork: https://patchwork.linux-mips.org/patch/19829/
      38c0a74f
    • D
      x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling · 092b31aa
      Dan Williams 提交于
      All copy_to_user() implementations need to be prepared to handle faults
      accessing userspace. The __memcpy_mcsafe() implementation handles both
      mmu-faults on the user destination and machine-check-exceptions on the
      source buffer. However, the memcpy_mcsafe() wrapper may silently
      fallback to memcpy() depending on build options and cpu-capabilities.
      
      Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when
      available, and otherwise disable all of the copy_to_user_mcsafe()
      infrastructure when __memcpy_mcsafe() is not available, i.e.
      CONFIG_X86_MCE=n.
      
      This fixes crashes of the form:
          run fstests generic/323 at 2018-07-02 12:46:23
          BUG: unable to handle kernel paging request at 00007f0d50001000
          RIP: 0010:__memcpy+0x12/0x20
          [..]
          Call Trace:
           copyout_mcsafe+0x3a/0x50
           _copy_to_iter_mcsafe+0xa1/0x4a0
           ? dax_alive+0x30/0x50
           dax_iomap_actor+0x1f9/0x280
           ? dax_iomap_rw+0x100/0x100
           iomap_apply+0xba/0x130
           ? dax_iomap_rw+0x100/0x100
           dax_iomap_rw+0x95/0x100
           ? dax_iomap_rw+0x100/0x100
           xfs_file_dax_read+0x7b/0x1d0 [xfs]
           xfs_file_read_iter+0xa7/0xc0 [xfs]
           aio_read+0x11c/0x1a0
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Fixes: 8780356e ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()")
      Link: http://lkml.kernel.org/r/153108277790.37979.1486841789275803399.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      092b31aa
  16. 15 7月, 2018 1 次提交