1. 22 5月, 2018 1 次提交
  2. 18 5月, 2018 13 次提交
    • P
      KVM: PPC: Book3S PR: Enable use on POWER9 inside HPT-mode guests · ec531d02
      Paul Mackerras 提交于
      This relaxes the restriction on using PR KVM on POWER9.  The existing
      code does work inside a guest partition running in HPT mode, because
      hypercalls such as H_ENTER use the old HPTE format, not the new
      format used by POWER9, and so no change to PR KVM's HPT manipulation
      code is required.  PR KVM will still refuse to run if the kernel is
      using radix translation or if it is running bare-metal.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ec531d02
    • N
      KVM: PPC: Book3S HV: Send kvmppc_bad_interrupt NMIs to Linux handlers · 7c1bd80c
      Nicholas Piggin 提交于
      It's possible to take a SRESET or MCE in these paths due to a bug
      in the host code or a NMI IPI, etc. A recent bug attempting to load
      a virtual address from real mode gave th complete but cryptic error,
      abridged:
      
            Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
            LE SMP NR_CPUS=2048 NUMA PowerNV
            CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted
            NIP:  c0000000000155ac LR: c0000000000c2430 CTR: c000000000015580
            REGS: c000000fff76dd80 TRAP: 0200   Not tainted
            MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48082222  XER: 00000000
            CFAR: 0000000102900ef0 DAR: d00017fffd941a28 DSISR: 00000040 SOFTE: 3
            NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
            LR [c0000000000c2430] do_tlbies+0x230/0x2f0
      
      Sending the NMIs through the Linux handlers gives a nicer output:
      
            Severe Machine check interrupt [Not recovered]
              NIP [c0000000000155ac]: perf_trace_tlbie+0x2c/0x1a0
              Initiator: CPU
              Error type: Real address [Load (bad)]
                Effective address: d00017fffcc01a28
            opal: Machine check interrupt unrecoverable: MSR(RI=0)
            opal: Hardware platform error: Unrecoverable Machine Check exception
            CPU: 0 PID: 6700 Comm: qemu-system-ppc Tainted: G   M
            NIP:  c0000000000155ac LR: c0000000000c23c0 CTR: c000000000015580
            REGS: c000000fff9e9d80 TRAP: 0200   Tainted: G   M
            MSR:  9000000000201001 <SF,HV,ME,LE>  CR: 48082222  XER: 00000000
            CFAR: 000000010cbc1a30 DAR: d00017fffcc01a28 DSISR: 00000040 SOFTE: 3
            NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
            LR [c0000000000c23c0] do_tlbies+0x1c0/0x280
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7c1bd80c
    • N
      KVM: PPC: Book3S HV: Fix kvmppc_bad_host_intr for real mode interrupts · eadce3b4
      Nicholas Piggin 提交于
      When CONFIG_RELOCATABLE=n, the Linux real mode interrupt handlers call
      into KVM using real address. This needs to be translated to the kernel
      linear effective address before the MMU is switched on.
      
      kvmppc_bad_host_intr misses adding these bits, so when it is used to
      handle a system reset interrupt (that always gets delivered in real
      mode), it results in an instruction access fault immediately after
      the MMU is turned on.
      
      Fix this by ensuring the top 2 address bits are set when the MMU is
      turned on.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      eadce3b4
    • N
      KVM: PPC: Book3S HV: radix: Do not clear partition PTE when RC or write bits do not match · 878cf2bb
      Nicholas Piggin 提交于
      Adding the write bit and RC bits to pte permissions does not require a
      pte clear and flush. There should not be other bits changed here,
      because restricting access or changing the PFN must have already
      invalidated any existing ptes (otherwise the race is already lost).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      878cf2bb
    • N
      KVM: PPC: Book3S HV: radix: Refine IO region partition scope attributes · bc64dd0e
      Nicholas Piggin 提交于
      When the radix fault handler has no page from the process address
      space (e.g., for IO memory), it looks up the process pte and sets
      partition table pte using that to get attributes like CI and guarded.
      If the process table entry is to be writable, set _PAGE_DIRTY as well
      to avoid an RC update. If not, then ensure _PAGE_DIRTY does not come
      across. Set _PAGE_ACCESSED as well to avoid RC update.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      bc64dd0e
    • N
      KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on · 9a4506e1
      Nicholas Piggin 提交于
      The radix guest code can has fewer restrictions about what context it
      can run in, so move this flushing out of assembly and have it use the
      Linux TLB flush implementations introduced previously.
      
      This allows powerpc:tlbie trace events to be used.
      
      This changes the tlbiel sequence to only execute RIC=2 flush once on
      the first set flushed, then RIC=0 for the rest of the sets. The end
      result of the flush should be unchanged. This matches the local PID
      flush pattern that was introduced in a5998fcb ("powerpc/mm/radix:
      Optimise tlbiel flush all case").
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      9a4506e1
    • N
      KVM: PPC: Book3S HV: Make radix use the Linux translation flush functions for partition scope · d91cb39f
      Nicholas Piggin 提交于
      This has the advantage of consolidating TLB flush code in fewer
      places, and it also implements powerpc:tlbie trace events.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d91cb39f
    • N
      KVM: PPC: Book3S HV: Recursively unmap all page table entries when unmapping · a5704e83
      Nicholas Piggin 提交于
      When partition scope mappings are unmapped with kvm_unmap_radix, the
      pte is cleared, but the page table structure is left in place. If the
      next page fault requests a different page table geometry (e.g., due to
      THP promotion or split), kvmppc_create_pte is responsible for changing
      the page tables.
      
      When a page table entry is to be converted to a large pte, the page
      table entry is cleared, the PWC flushed, then the page table it points
      to freed. This will cause pte page tables to leak when a 1GB page is
      to replace a pud entry points to a pmd table with pte tables under it:
      The pmd table will be freed, but its pte tables will be missed.
      
      Fix this by replacing the simple clear and free code with one that
      walks down the page tables and frees children. Care must be taken to
      clear the root entry being unmapped then flushing the PWC before
      freeing any page tables, as explained in comments.
      
      This requires PWC flush to logically become a flush-all-PWC (which it
      already is in hardware, but the KVM API needs to be changed to avoid
      confusion).
      
      This code also checks that no unexpected pte entries exist in any page
      table being freed, and unmaps those and emits a WARN. This is an
      expensive operation for the pte page level, but partition scope
      changes are rare, so it's unconditional for now to iron out bugs. It
      can be put under a CONFIG option or removed after some time.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a5704e83
    • N
    • N
      KVM: PPC: Book3S HV: Lockless tlbie for HPT hcalls · b7557451
      Nicholas Piggin 提交于
      tlbies to an LPAR do not have to be serialised since POWER4/PPC970,
      after which the MMU_FTR_LOCKLESS_TLBIE feature was introduced to
      avoid tlbie locking.
      
      Since commit c17b98cf ("KVM: PPC: Book3S HV: Remove code for
      PPC970 processors"), KVM no longer supports processors that do not
      have this feature, so the tlbie locking can be removed completely.
      A sanity check for the feature is put in kvmppc_mmu_hv_init.
      
      Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
      in HPT mode. 32 instances of the powerpc fork benchmark from selftests
      were run with --fork, and the results measured.
      
      Without this patch, total throughput was about 13.5K/sec, and this is
      the top of the host profile:
      
         74.52%  [k] do_tlbies
          2.95%  [k] kvmppc_book3s_hv_page_fault
          1.80%  [k] calc_checksum
          1.80%  [k] kvmppc_vcpu_run_hv
          1.49%  [k] kvmppc_run_core
      
      After this patch, throughput was about 51K/sec, with this profile:
      
         21.28%  [k] do_tlbies
          5.26%  [k] kvmppc_run_core
          4.88%  [k] kvmppc_book3s_hv_page_fault
          3.30%  [k] _raw_spin_lock_irqsave
          3.25%  [k] gup_pgd_range
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b7557451
    • S
      KVM: PPC: Fix a mmio_host_swabbed uninitialized usage issue · f19d1f36
      Simon Guo 提交于
      When KVM emulates VMX store, it will invoke kvmppc_get_vmx_data() to
      retrieve VMX reg val. kvmppc_get_vmx_data() will check mmio_host_swabbed
      to decide which double word of vr[] to be used. But the
      mmio_host_swabbed can be uninitialized during VMX store procedure:
      
      kvmppc_emulate_loadstore
      	\- kvmppc_handle_store128_by2x64
      		\- kvmppc_get_vmx_data
      
      So vcpu->arch.mmio_host_swabbed is not meant to be used at all for
      emulation of store instructions, and this patch makes that true for
      VMX stores. This patch also initializes mmio_host_swabbed to avoid
      possible future problems.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f19d1f36
    • S
      KVM: PPC: Move nip/ctr/lr/xer registers to pt_regs in kvm_vcpu_arch · 173c520a
      Simon Guo 提交于
      This patch moves nip/ctr/lr/xer registers from scattered places in
      kvm_vcpu_arch to pt_regs structure.
      
      cr register is "unsigned long" in pt_regs and u32 in vcpu->arch.
      It will need more consideration and may move in later patches.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      173c520a
    • S
      KVM: PPC: Add pt_regs into kvm_vcpu_arch and move vcpu->arch.gpr[] into it · 1143a706
      Simon Guo 提交于
      Current regs are scattered at kvm_vcpu_arch structure and it will
      be more neat to organize them into pt_regs structure.
      
      Also it will enable reimplementation of MMIO emulation code with
      analyse_instr() later.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1143a706
  3. 17 5月, 2018 14 次提交
  4. 15 5月, 2018 1 次提交
  5. 06 5月, 2018 1 次提交
    • A
      KVM: x86: remove APIC Timer periodic/oneshot spikes · ecf08dad
      Anthoine Bourgeois 提交于
      Since the commit "8003c9ae: add APIC Timer periodic/oneshot mode VMX
      preemption timer support", a Windows 10 guest has some erratic timer
      spikes.
      
      Here the results on a 150000 times 1ms timer without any load:
      	  Before 8003c9ae | After 8003c9ae
      Max           1834us          |  86000us
      Mean          1100us          |   1021us
      Deviation       59us          |    149us
      Here the results on a 150000 times 1ms timer with a cpu-z stress test:
      	  Before 8003c9ae | After 8003c9ae
      Max          32000us          | 140000us
      Mean          1006us          |   1997us
      Deviation      140us          |  11095us
      
      The root cause of the problem is starting hrtimer with an expiry time
      already in the past can take more than 20 milliseconds to trigger the
      timer function.  It can be solved by forward such past timers
      immediately, rather than submitting them to hrtimer_start().
      In case the timer is periodic, update the target expiration and call
      hrtimer_start with it.
      
      v2: Check if the tsc deadline is already expired. Thank you Mika.
      v3: Execute the past timers immediately rather than submitting them to
      hrtimer_start().
      v4: Rearm the periodic timer with advance_periodic_target_expiration() a
      simpler version of set_target_expiration(). Thank you Paolo.
      
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAnthoine Bourgeois <anthoine.bourgeois@blade-group.com>
      8003c9ae ("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support")
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ecf08dad
  6. 04 5月, 2018 2 次提交
    • J
      arm64: vgic-v2: Fix proxying of cpuif access · b220244d
      James Morse 提交于
      Proxying the cpuif accesses at EL2 makes use of vcpu_data_guest_to_host
      and co, which check the endianness, which call into vcpu_read_sys_reg...
      which isn't mapped at EL2 (it was inlined before, and got moved OoL
      with the VHE optimizations).
      
      The result is of course a nice panic. Let's add some specialized
      cruft to keep the broken platforms that require this hack alive.
      
      But, this code used vcpu_data_guest_to_host(), which expected us to
      write the value to host memory, instead we have trapped the guest's
      read or write to an mmio-device, and are about to replay it using the
      host's readl()/writel() which also perform swabbing based on the host
      endianness. This goes wrong when both host and guest are big-endian,
      as readl()/writel() will undo the guest's swabbing, causing the
      big-endian value to be written to device-memory.
      
      What needs doing?
      A big-endian guest will have pre-swabbed data before storing, undo this.
      If its necessary for the host, writel() will re-swab it.
      
      For a read a big-endian guest expects to swab the data after the load.
      The hosts's readl() will correct for host endianness, giving us the
      device-memory's value in the register. For a big-endian guest, swab it
      as if we'd only done the load.
      
      For a little-endian guest, nothing needs doing as readl()/writel() leave
      the correct device-memory value in registers.
      
      Tested on Juno with that rarest of things: a big-endian 64K host.
      Based on a patch from Marc Zyngier.
      Reported-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Fixes: bf8feb39 ("arm64: KVM: vgic-v2: Add GICV access from HYP")
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      b220244d
    • J
      KVM: arm64: Fix order of vcpu_write_sys_reg() arguments · 1975fa56
      James Morse 提交于
      A typo in kvm_vcpu_set_be()'s call:
      | vcpu_write_sys_reg(vcpu, SCTLR_EL1, sctlr)
      causes us to use the 32bit register value as an index into the sys_reg[]
      array, and sail off the end of the linear map when we try to bring up
      big-endian secondaries.
      
      | Unable to handle kernel paging request at virtual address ffff80098b982c00
      | Mem abort info:
      |  ESR = 0x96000045
      |  Exception class = DABT (current EL), IL = 32 bits
      |   SET = 0, FnV = 0
      |   EA = 0, S1PTW = 0
      | Data abort info:
      |   ISV = 0, ISS = 0x00000045
      |   CM = 0, WnR = 1
      | swapper pgtable: 4k pages, 48-bit VAs, pgdp = 000000002ea0571a
      | [ffff80098b982c00] pgd=00000009ffff8803, pud=0000000000000000
      | Internal error: Oops: 96000045 [#1] PREEMPT SMP
      | Modules linked in:
      | CPU: 2 PID: 1561 Comm: kvm-vcpu-0 Not tainted 4.17.0-rc3-00001-ga912e2261ca6-dirty #1323
      | Hardware name: ARM Juno development board (r1) (DT)
      | pstate: 60000005 (nZCv daif -PAN -UAO)
      | pc : vcpu_write_sys_reg+0x50/0x134
      | lr : vcpu_write_sys_reg+0x50/0x134
      
      | Process kvm-vcpu-0 (pid: 1561, stack limit = 0x000000006df4728b)
      | Call trace:
      |  vcpu_write_sys_reg+0x50/0x134
      |  kvm_psci_vcpu_on+0x14c/0x150
      |  kvm_psci_0_2_call+0x244/0x2a4
      |  kvm_hvc_call_handler+0x1cc/0x258
      |  handle_hvc+0x20/0x3c
      |  handle_exit+0x130/0x1ec
      |  kvm_arch_vcpu_ioctl_run+0x340/0x614
      |  kvm_vcpu_ioctl+0x4d0/0x840
      |  do_vfs_ioctl+0xc8/0x8d0
      |  ksys_ioctl+0x78/0xa8
      |  sys_ioctl+0xc/0x18
      |  el0_svc_naked+0x30/0x34
      | Code: 73620291 604d00b0 00201891 1ab10194 (957a33f8)
      |---[ end trace 4b4a4f9628596602 ]---
      
      Fix the order of the arguments.
      
      Fixes: 8d404c4c ("KVM: arm64: Rewrite system register accessors to read/write functions")
      CC: Christoffer Dall <cdall@cs.columbia.edu>
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      1975fa56
  7. 03 5月, 2018 4 次提交
    • H
      parisc: Fix section mismatches · 8d73b180
      Helge Deller 提交于
      Fix three section mismatches:
      1) Section mismatch in reference from the function ioread8() to the
         function .init.text:pcibios_init_bridge()
      2) Section mismatch in reference from the function free_initmem() to the
         function .init.text:map_pages()
      3) Section mismatch in reference from the function ccio_ioc_init() to
         the function .init.text:count_parisc_driver()
      Signed-off-by: NHelge Deller <deller@gmx.de>
      8d73b180
    • H
      parisc: drivers.c: Fix section mismatches · b819439f
      Helge Deller 提交于
      Fix two section mismatches in drivers.c:
      1) Section mismatch in reference from the function alloc_tree_node() to
         the function .init.text:create_tree_node().
      2) Section mismatch in reference from the function walk_native_bus() to
         the function .init.text:alloc_pa_dev().
      Signed-off-by: NHelge Deller <deller@gmx.de>
      b819439f
    • D
      bpf, x64: fix memleak when not converging on calls · 39f56ca9
      Daniel Borkmann 提交于
      The JIT logic in jit_subprogs() is as follows: for all subprogs we
      allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
      and pass it to bpf_int_jit_compile(). If a failure occurred during
      JIT and prog->jited is not set, then we bail out from attempting to
      JIT the whole program, and punt to the interpreter instead. In case
      JITing went successful, we fixup BPF call offsets and do another
      pass to bpf_int_jit_compile() (extra_pass is true at that point) to
      complete JITing calls. Given that requires to pass JIT context around
      addrs and jit_data from x86 JIT are freed in the extra_pass in
      bpf_int_jit_compile() when calls are involved (if not, they can
      be freed immediately). However, if in the original pass, the JIT
      image didn't converge then we leak addrs and jit_data since image
      itself is NULL, the prog->is_func is set and extra_pass is false
      in that case, meaning both will become unreachable and are never
      cleaned up, therefore we need to free as well on !image. Only x64
      JIT is affected.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      39f56ca9
    • D
      bpf, x64: fix memleak when not converging after image · 3aab8884
      Daniel Borkmann 提交于
      While reviewing x64 JIT code, I noticed that we leak the prior allocated
      JIT image in the case where proglen != oldproglen during the JIT passes.
      Prior to the commit e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT
      compiler") we would just break out of the loop, and using the image as the
      JITed prog since it could only shrink in size anyway. After e0ee9c12,
      we would bail out to out_addrs label where we free addrs and jit_data but
      not the image coming from bpf_jit_binary_alloc().
      
      Fixes: e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3aab8884
  8. 02 5月, 2018 4 次提交