1. 24 2月, 2017 6 次提交
  2. 23 2月, 2017 9 次提交
  3. 22 2月, 2017 2 次提交
  4. 21 2月, 2017 10 次提交
    • W
      x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 · dd0fd8bc
      Waiman Long 提交于
      It was found when running fio sequential write test with a XFS ramdisk
      on a KVM guest running on a 2-socket x86-64 system, the %CPU times
      as reported by perf were as follows:
      
       69.75%  0.59%  fio  [k] down_write
       69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
       67.12%  1.12%  fio  [k] rwsem_down_write_failed
       63.48% 52.77%  fio  [k] osq_lock
        9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
        3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted
      
      Making vcpu_is_preempted() a callee-save function has a relatively
      high cost on x86-64 primarily due to at least one more cacheline of
      data access from the saving and restoring of registers (8 of them)
      to and from stack as well as one more level of function call.
      
      To reduce this performance overhead, an optimized assembly version
      of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
      provided for x86-64.
      
      With this patch applied on a KVM guest on a 2-socket 16-core 32-thread
      system with 16 parallel jobs (8 on each socket), the aggregrate
      bandwidth of the fio test on an XFS ramdisk were as follows:
      
         I/O Type      w/o patch    with patch
         --------      ---------    ----------
         random read   8141.2 MB/s  8497.1 MB/s
         seq read      8229.4 MB/s  8304.2 MB/s
         random write  1675.5 MB/s  1701.5 MB/s
         seq write     1681.3 MB/s  1699.9 MB/s
      
      There are some increases in the aggregated bandwidth because of
      the patch.
      
      The perf data now became:
      
       70.78%  0.58%  fio  [k] down_write
       70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
       69.70%  1.17%  fio  [k] rwsem_down_write_failed
       59.91% 55.42%  fio  [k] osq_lock
       10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted
      
      The assembly code was verified by using a test kernel module to
      compare the output of C __kvm_vcpu_is_preempted() and that of assembly
      __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dd0fd8bc
    • W
      x86/paravirt: Change vcp_is_preempted() arg type to long · 6c62985d
      Waiman Long 提交于
      The cpu argument in the function prototype of vcpu_is_preempted()
      is changed from int to long. That makes it easier to provide a better
      optimized assembly version of that function.
      
      For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
      downcast from long to int is not a problem as vCPU number won't exceed
      32 bits.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6c62985d
    • C
      KVM: VMX: use correct vmcs_read/write for guest segment selector/base · 96794e4e
      Chao Peng 提交于
      Guest segment selector is 16 bit field and guest segment base is natural
      width field. Fix two incorrect invocations accordingly.
      
      Without this patch, build fails when aggressive inlining is used with ICC.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      96794e4e
    • A
      x86/kvm/vmx: Defer TR reload after VM exit · b7ffc44d
      Andy Lutomirski 提交于
      Intel's VMX is daft and resets the hidden TSS limit register to 0x67
      on VMX reload, and the 0x67 is not configurable.  KVM currently
      reloads TR using the LTR instruction on every exit, but this is quite
      slow because LTR is serializing.
      
      The 0x67 limit is entirely harmless unless ioperm() is in use, so
      defer the reload until a task using ioperm() is actually running.
      
      Here's some poorly done benchmarking using kvm-unit-tests:
      
      Before:
      
      cpuid 1313
      vmcall 1195
      mov_from_cr8 11
      mov_to_cr8 17
      inl_from_pmtimer 6770
      inl_from_qemu 6856
      inl_from_kernel 2435
      outl_to_kernel 1402
      
      After:
      
      cpuid 1291
      vmcall 1181
      mov_from_cr8 11
      mov_to_cr8 16
      inl_from_pmtimer 6457
      inl_from_qemu 6209
      inl_from_kernel 2339
      outl_to_kernel 1391
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      [Force-reload TR in invalidate_tss_limit. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b7ffc44d
    • A
      x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss · d3273dea
      Andy Lutomirski 提交于
      Historically, the entire TSS + io bitmap structure was cacheline
      aligned, but commit ca241c75 ("x86: unify tss_struct") changed it
      (presumably inadvertently) so that the fixed-layout hardware part is
      cacheline-aligned and the io bitmap is after the padding.  This wastes
      24 bytes (the hardware part should be 104 bytes, but this pads it to
      128 bytes) and, serves no purpose, and causes sizeof(struct
      x86_hw_tss) to have a confusing value.
      
      Drop the pointless alignment.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d3273dea
    • A
      x86/kvm/vmx: Simplify segment_base() · 8c2e41f7
      Andy Lutomirski 提交于
      Use actual pointer types for pointers (instead of unsigned long) and
      replace hardcoded constants with the appropriate self-documenting
      macros.
      
      The function is still a bit messy, but this seems a lot better than
      before to me.
      
      This is mostly borrowed from a patch by Thomas Garnier.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8c2e41f7
    • A
      x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels · e28baead
      Andy Lutomirski 提交于
      It was a bit buggy (it didn't list all segment types that needed
      64-bit fixups), but the bug was irrelevant because it wasn't called
      in any interesting context on 64-bit kernels and was only used for
      data segents on 32-bit kernels.
      
      To avoid confusion, make it explicitly 32-bit only.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e28baead
    • A
      x86/kvm/vmx: Don't fetch the TSS base from the GDT · e0c23063
      Andy Lutomirski 提交于
      The current CPU's TSS base is a foregone conclusion, so there's no need
      to parse it out of the segment tables.  This should save a couple cycles
      (as STR is surely microcoded and poorly optimized) but, more importantly,
      it's a cleanup and it means that segment_base() will never be called on
      64-bit kernels.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e0c23063
    • A
      x86/asm: Define the kernel TSS limit in a macro · 4f53ab14
      Andy Lutomirski 提交于
      Rather than open-coding the kernel TSS limit in set_tss_desc(), make
      it a real macro near the TSS layout definition.
      
      This is purely a cleanup.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f53ab14
    • P
      kvm: fix page struct leak in handle_vmon · 06ce521a
      Paolo Bonzini 提交于
      handle_vmon gets a reference on VMXON region page,
      but does not release it. Release the reference.
      
      Found by syzkaller; based on a patch by Dmitry.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      06ce521a
  5. 20 2月, 2017 1 次提交
  6. 19 2月, 2017 1 次提交
  7. 18 2月, 2017 3 次提交
    • P
      KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now · bcd3bb63
      Paul Mackerras 提交于
      The new HPT resizing code added in commit b5baa687 ("KVM: PPC:
      Book3S HV: KVM-HV HPT resizing implementation", 2016-12-20) doesn't
      have code to handle the new HPTE format which POWER9 uses.  Thus it
      would be best not to advertise it to userspace on POWER9 systems
      until it works properly.
      
      Also, since resize_hpt_rehash_hpte() contains BUG_ON() calls that
      could be hit on POWER9, let's prevent it from being called on POWER9
      for now.
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      bcd3bb63
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66
    • D
      bpf: remove stubs for cBPF from arch code · 9383191d
      Daniel Borkmann 提交于
      Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
      that a single __weak function in the core that can be overridden
      similarly to the eBPF one. Also remove stale pr_err() mentions
      of bpf_jit_compile.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9383191d
  8. 17 2月, 2017 8 次提交