1. 01 3月, 2017 1 次提交
    • R
      KVM: x86: never specify a sample period for virtualized in_tx_cp counters · bba82fd7
      Robert O'Callahan 提交于
      pmc_reprogram_counter() always sets a sample period based on the value of
      pmc->counter. However, hsw_hw_config() rejects sample periods less than
      2^31 - 1. So for example, if a KVM guest does
      
          struct perf_event_attr attr;
          memset(&attr, 0, sizeof(attr));
          attr.type = PERF_TYPE_RAW;
          attr.size = sizeof(attr);
          attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
          attr.sample_period = 0;
          int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
          ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
          ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
      
      the guest kernel counts some conditional branch events, then updates the
      virtual PMU register with a nonzero count. The host reaches
      pmc_reprogram_counter() with nonzero pmc->counter, triggers EOPNOTSUPP
      in hsw_hw_config(), prints "kvm_pmu: event creation failed" in
      pmc_reprogram_counter(), and silently (from the guest's point of view) stops
      counting events.
      
      We fix event counting by forcing attr.sample_period to always be zero for
      in_tx_cp counters. Sampling doesn't work, but it already didn't work and
      can't be fixed without major changes to the approach in hsw_hw_config().
      Signed-off-by: NRobert O'Callahan <robert@ocallahan.org>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      bba82fd7
  2. 22 2月, 2017 2 次提交
  3. 21 2月, 2017 10 次提交
    • W
      x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 · dd0fd8bc
      Waiman Long 提交于
      It was found when running fio sequential write test with a XFS ramdisk
      on a KVM guest running on a 2-socket x86-64 system, the %CPU times
      as reported by perf were as follows:
      
       69.75%  0.59%  fio  [k] down_write
       69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
       67.12%  1.12%  fio  [k] rwsem_down_write_failed
       63.48% 52.77%  fio  [k] osq_lock
        9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
        3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted
      
      Making vcpu_is_preempted() a callee-save function has a relatively
      high cost on x86-64 primarily due to at least one more cacheline of
      data access from the saving and restoring of registers (8 of them)
      to and from stack as well as one more level of function call.
      
      To reduce this performance overhead, an optimized assembly version
      of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
      provided for x86-64.
      
      With this patch applied on a KVM guest on a 2-socket 16-core 32-thread
      system with 16 parallel jobs (8 on each socket), the aggregrate
      bandwidth of the fio test on an XFS ramdisk were as follows:
      
         I/O Type      w/o patch    with patch
         --------      ---------    ----------
         random read   8141.2 MB/s  8497.1 MB/s
         seq read      8229.4 MB/s  8304.2 MB/s
         random write  1675.5 MB/s  1701.5 MB/s
         seq write     1681.3 MB/s  1699.9 MB/s
      
      There are some increases in the aggregated bandwidth because of
      the patch.
      
      The perf data now became:
      
       70.78%  0.58%  fio  [k] down_write
       70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
       69.70%  1.17%  fio  [k] rwsem_down_write_failed
       59.91% 55.42%  fio  [k] osq_lock
       10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted
      
      The assembly code was verified by using a test kernel module to
      compare the output of C __kvm_vcpu_is_preempted() and that of assembly
      __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dd0fd8bc
    • W
      x86/paravirt: Change vcp_is_preempted() arg type to long · 6c62985d
      Waiman Long 提交于
      The cpu argument in the function prototype of vcpu_is_preempted()
      is changed from int to long. That makes it easier to provide a better
      optimized assembly version of that function.
      
      For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
      downcast from long to int is not a problem as vCPU number won't exceed
      32 bits.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6c62985d
    • C
      KVM: VMX: use correct vmcs_read/write for guest segment selector/base · 96794e4e
      Chao Peng 提交于
      Guest segment selector is 16 bit field and guest segment base is natural
      width field. Fix two incorrect invocations accordingly.
      
      Without this patch, build fails when aggressive inlining is used with ICC.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      96794e4e
    • A
      x86/kvm/vmx: Defer TR reload after VM exit · b7ffc44d
      Andy Lutomirski 提交于
      Intel's VMX is daft and resets the hidden TSS limit register to 0x67
      on VMX reload, and the 0x67 is not configurable.  KVM currently
      reloads TR using the LTR instruction on every exit, but this is quite
      slow because LTR is serializing.
      
      The 0x67 limit is entirely harmless unless ioperm() is in use, so
      defer the reload until a task using ioperm() is actually running.
      
      Here's some poorly done benchmarking using kvm-unit-tests:
      
      Before:
      
      cpuid 1313
      vmcall 1195
      mov_from_cr8 11
      mov_to_cr8 17
      inl_from_pmtimer 6770
      inl_from_qemu 6856
      inl_from_kernel 2435
      outl_to_kernel 1402
      
      After:
      
      cpuid 1291
      vmcall 1181
      mov_from_cr8 11
      mov_to_cr8 16
      inl_from_pmtimer 6457
      inl_from_qemu 6209
      inl_from_kernel 2339
      outl_to_kernel 1391
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      [Force-reload TR in invalidate_tss_limit. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b7ffc44d
    • A
      x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss · d3273dea
      Andy Lutomirski 提交于
      Historically, the entire TSS + io bitmap structure was cacheline
      aligned, but commit ca241c75 ("x86: unify tss_struct") changed it
      (presumably inadvertently) so that the fixed-layout hardware part is
      cacheline-aligned and the io bitmap is after the padding.  This wastes
      24 bytes (the hardware part should be 104 bytes, but this pads it to
      128 bytes) and, serves no purpose, and causes sizeof(struct
      x86_hw_tss) to have a confusing value.
      
      Drop the pointless alignment.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d3273dea
    • A
      x86/kvm/vmx: Simplify segment_base() · 8c2e41f7
      Andy Lutomirski 提交于
      Use actual pointer types for pointers (instead of unsigned long) and
      replace hardcoded constants with the appropriate self-documenting
      macros.
      
      The function is still a bit messy, but this seems a lot better than
      before to me.
      
      This is mostly borrowed from a patch by Thomas Garnier.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8c2e41f7
    • A
      x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels · e28baead
      Andy Lutomirski 提交于
      It was a bit buggy (it didn't list all segment types that needed
      64-bit fixups), but the bug was irrelevant because it wasn't called
      in any interesting context on 64-bit kernels and was only used for
      data segents on 32-bit kernels.
      
      To avoid confusion, make it explicitly 32-bit only.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e28baead
    • A
      x86/kvm/vmx: Don't fetch the TSS base from the GDT · e0c23063
      Andy Lutomirski 提交于
      The current CPU's TSS base is a foregone conclusion, so there's no need
      to parse it out of the segment tables.  This should save a couple cycles
      (as STR is surely microcoded and poorly optimized) but, more importantly,
      it's a cleanup and it means that segment_base() will never be called on
      64-bit kernels.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e0c23063
    • A
      x86/asm: Define the kernel TSS limit in a macro · 4f53ab14
      Andy Lutomirski 提交于
      Rather than open-coding the kernel TSS limit in set_tss_desc(), make
      it a real macro near the TSS layout definition.
      
      This is purely a cleanup.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f53ab14
    • P
      kvm: fix page struct leak in handle_vmon · 06ce521a
      Paolo Bonzini 提交于
      handle_vmon gets a reference on VMXON region page,
      but does not release it. Release the reference.
      
      Found by syzkaller; based on a patch by Dmitry.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      06ce521a
  4. 18 2月, 2017 2 次提交
    • D
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann 提交于
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latched RB tree that can be read under
      RCU in fast-path that is also shared for symbol/size/offset lookup
      for a specific given address in kallsyms. The slow-path iteration
      through all symbols in the seq file done via RCU list, which holds
      a tiny fraction of all exported ksyms, usually below 0.1 percent.
      Function symbols are exported as bpf_prog_<tag>, in order to aide
      debugging and attribution. This facility is currently enabled for
      root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
      is active in any mode. The rationale behind this is that still a lot
      of systems ship with world read permissions on kallsyms thus addresses
      should not get suddenly exposed for them. If that situation gets
      much better in future, we always have the option to change the
      default on this. Likewise, unprivileged programs are not allowed
      to add entries there either, but that is less of a concern as most
      such programs types relevant in this context are for root-only anyway.
      If enabled, call graphs and stack traces will then show a correct
      attribution; one example is illustrated below, where the trace is
      now visible in tooling such as perf script --kallsyms=/proc/kallsyms
      and friends.
      
      Before:
      
        7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
      
      After:
      
        7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
        [...]
        7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
        7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
               f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74451e66
    • D
      bpf: remove stubs for cBPF from arch code · 9383191d
      Daniel Borkmann 提交于
      Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
      that a single __weak function in the core that can be overridden
      similarly to the eBPF one. Also remove stale pr_err() mentions
      of bpf_jit_compile.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9383191d
  5. 17 2月, 2017 6 次提交
    • P
      KVM: x86: remove code for lazy FPU handling · bd7e5b08
      Paolo Bonzini 提交于
      The FPU is always active now when running KVM.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd7e5b08
    • P
      KVM: race-free exit from KVM_RUN without POSIX signals · 460df4c1
      Paolo Bonzini 提交于
      The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick"
      a VCPU out of KVM_RUN through a POSIX signal.  A signal is attached
      to a dummy signal handler; by blocking the signal outside KVM_RUN and
      unblocking it inside, this possible race is closed:
      
                VCPU thread                     service thread
         --------------------------------------------------------------
              check flag
                                                set flag
                                                raise signal
              (signal handler does nothing)
              KVM_RUN
      
      However, one issue with KVM_SET_SIGNAL_MASK is that it has to take
      tsk->sighand->siglock on every KVM_RUN.  This lock is often on a
      remote NUMA node, because it is on the node of a thread's creator.
      Taking this lock can be very expensive if there are many userspace
      exits (as is the case for SMP Windows VMs without Hyper-V reference
      time counter).
      
      As an alternative, we can put the flag directly in kvm_run so that
      KVM can see it:
      
                VCPU thread                     service thread
         --------------------------------------------------------------
                                                raise signal
              signal handler
                set run->immediate_exit
              KVM_RUN
                check run->immediate_exit
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      460df4c1
    • A
      x86/mm/ptdump: Add address marker for KASAN shadow region · 025205f8
      Andrey Ryabinin 提交于
      Annotate the KASAN shadow with address markers in page table
      dump output:
      
      $ cat /sys/kernel/debug/kernel_page_tables
      ...
      
      ---[ Vmemmap ]---
      0xffffea0000000000-0xffffea0003000000          48M     RW         PSE     GLB NX pmd
      0xffffea0003000000-0xffffea0004000000          16M                               pmd
      0xffffea0004000000-0xffffea0005000000          16M     RW         PSE     GLB NX pmd
      0xffffea0005000000-0xffffea0040000000         944M                               pmd
      0xffffea0040000000-0xffffea8000000000         511G                               pud
      0xffffea8000000000-0xffffec0000000000        1536G                               pgd
      ---[ KASAN shadow ]---
      0xffffec0000000000-0xffffed0000000000           1T     ro                 GLB NX pte
      0xffffed0000000000-0xffffed0018000000         384M     RW         PSE     GLB NX pmd
      0xffffed0018000000-0xffffed0020000000         128M                               pmd
      0xffffed0020000000-0xffffed0028200000         130M     RW         PSE     GLB NX pmd
      0xffffed0028200000-0xffffed0040000000         382M                               pmd
      0xffffed0040000000-0xffffed8000000000         511G                               pud
      0xffffed8000000000-0xfffff50000000000        7680G                               pgd
      0xfffff50000000000-0xfffffbfff0000000     7339776M     ro                 GLB NX pte
      0xfffffbfff0000000-0xfffffbfff0200000           2M                               pmd
      0xfffffbfff0200000-0xfffffbfff0a00000           8M     RW         PSE     GLB NX pmd
      0xfffffbfff0a00000-0xfffffbffffe00000         244M                               pmd
      0xfffffbffffe00000-0xfffffc0000000000           2M     ro                 GLB NX pte
      ---[ KASAN shadow end ]---
      0xfffffc0000000000-0xffffff0000000000           3T                               pgd
      ---[ ESPfix Area ]---
      ...
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NAlexander Potapenko <glider@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: kasan-dev@googlegroups.com
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Link: http://lkml.kernel.org/r/20170214100839.17186-2-aryabinin@virtuozzo.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      025205f8
    • A
      x86/mm/ptdump: Optimize check for W+X mappings for CONFIG_KASAN=y · 243b72aa
      Andrey Ryabinin 提交于
      Enabling both DEBUG_WX=y and KASAN=y options significantly increases
      boot time (dozens of seconds at least).
      KASAN fills kernel page tables with repeated values to map several
      TBs of the virtual memory to the single kasan_zero_page:
      
          kasan_zero_pud ->
              kasan_zero_pmd->
                  kasan_zero_pte->
                      kasan_zero_page
      
      So, the page table walker used to find W+X mapping check the same
      kasan_zero_p?d page table entries a lot more than once.
      With patch pud walker will skip the pud if it has the same value as
      the previous one . Skipping done iff we search for W+X mappings,
      so this optimization won't affect the page table dump via debugfs.
      
      This dropped time spend in W+X check from ~30 sec to reasonable 0.1 sec:
      
      Before:
      [    4.579991] Freeing unused kernel memory: 1000K
      [   35.257523] x86/mm: Checked W+X mappings: passed, no W+X pages found.
      
      After:
      [    5.138756] Freeing unused kernel memory: 1000K
      [    5.266496] x86/mm: Checked W+X mappings: passed, no W+X pages found.
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NAlexander Potapenko <glider@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: kasan-dev@googlegroups.com
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Link: http://lkml.kernel.org/r/20170214100839.17186-1-aryabinin@virtuozzo.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      243b72aa
    • C
      KVM: Support vCPU-based gfn->hva cache · bbd64115
      Cao, Lei 提交于
      Provide versions of struct gfn_to_hva_cache functions that
      take vcpu as a parameter instead of struct kvm.  The existing functions
      are not needed anymore, so delete them.  This allows dirty pages to
      be logged in the vcpu dirty ring, instead of the global dirty ring,
      for ring-based dirty memory tracking.
      Signed-off-by: NLei Cao <lei.cao@stratus.com>
      Message-Id: <CY1PR08MB19929BD2AC47A291FD680E83F04F0@CY1PR08MB1992.namprd08.prod.outlook.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bbd64115
    • P
  6. 16 2月, 2017 3 次提交
  7. 15 2月, 2017 15 次提交
  8. 14 2月, 2017 1 次提交