提交 · 3a4d5c94e959359ece6d6b55045c3f046677f55c · openanolis / cloud-kernel

15 1月, 2010 1 次提交

vhost_net: a kernel-level virtio server · 3a4d5c94

由 Michael S. Tsirkin 提交于 1月 14, 2010

What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
  migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device.  Backend is also configured by userspace, including vlan/mac
etc.

Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.

Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use

Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item.  The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.

(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a4d5c94

10 12月, 2009 1 次提交

x86: i8254.c: Add pr_fmt(fmt) · a78d9626

由 Joe Perches 提交于 12月 09, 2009

- Add pr_fmt(fmt) "pit: " fmt
 - Strip pit: prefixes from pr_debug
Signed-off-by: NJoe Perches <joe@perches.com>
LKML-Reference: <bbd4de532f18bb7c11f64ba20d224c08291cb126.1260383912.git.joe@perches.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a78d9626

03 12月, 2009 38 次提交

KVM: VMX: Fix comparison of guest efer with stale host value · d5696725

由 Avi Kivity 提交于 12月 02, 2009

update_transition_efer() masks out some efer bits when deciding whether
to switch the msr during guest entry; for example, NX is emulated using the
mmu so we don't need to disable it, and LMA/LME are handled by the hardware.

However, with shared msrs, the comparison is made against a stale value;
at the time of the guest switch we may be running with another guest's efer.

Fix by deferring the mask/compare to the actual point of guest entry.

Noted by Marcelo.
Signed-off-by: NAvi Kivity <avi@redhat.com>

d5696725

KVM: Drop user return notifier when disabling virtualization on a cpu · 3548bab5

由 Avi Kivity 提交于 11月 28, 2009

This way, we don't leave a dangling notifier on cpu hotunplug or module
unload.  In particular, module unload leaves the notifier pointing into
freed memory.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

3548bab5

KVM: VMX: Disable unrestricted guest when EPT disabled · 046d8710

由 Sheng Yang 提交于 11月 27, 2009

Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest
supported processors.
Signed-off-by: NSheng Yang <sheng@linux.intel.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

046d8710

KVM: x86 emulator: limit instructions to 15 bytes · eb3c79e6

由 Avi Kivity 提交于 11月 24, 2009

While we are never normally passed an instruction that exceeds 15 bytes,
smp games can cause us to attempt to interpret one, which will cause
large latencies in non-preempt hosts.

Cc: stable@kernel.org
Signed-off-by: NAvi Kivity <avi@redhat.com>

eb3c79e6

KVM: x86: Add KVM_GET/SET_VCPU_EVENTS · 3cfc3092

由 Jan Kiszka 提交于 11月 12, 2009

This new IOCTL exports all yet user-invisible states related to
exceptions, interrupts, and NMIs. Together with appropriate user space
changes, this fixes sporadic problems of vmsave/restore, live migration
and system reset.

[avi: future-proof abi by adding a flags field]
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3cfc3092

KVM: VMX: Report unexpected simultaneous exceptions as internal errors · 65ac7264

由 Avi Kivity 提交于 11月 04, 2009

These happen when we trap an exception when another exception is being
delivered; we only expect these with MCEs and page faults.  If something
unexpected happens, things probably went south and we're better off reporting
an internal error and freezing.
Signed-off-by: NAvi Kivity <avi@redhat.com>

65ac7264

KVM: Allow internal errors reported to userspace to carry extra data · a9c7399d

由 Avi Kivity 提交于 11月 04, 2009

Usually userspace will freeze the guest so we can inspect it, but some
internal state is not available.  Add extra data to internal error
reporting so we can expose it to the debugger.  Extra data is specific
to the suberror.
Signed-off-by: NAvi Kivity <avi@redhat.com>

a9c7399d

KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG · 4f926bf2

由 Jan Kiszka 提交于 10月 30, 2009

Decouple KVM_GUESTDBG_INJECT_DB and KVM_GUESTDBG_INJECT_BP from
KVM_GUESTDBG_ENABLE, their are actually orthogonal. At this chance,
avoid triggering the WARN_ON in kvm_queue_exception if there is already
an exception pending and reject such invalid requests.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

4f926bf2

KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic · 2204ae3c

由 Marcelo Tosatti 提交于 10月 29, 2009

Otherwise kvm might attempt to dereference a NULL pointer.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

2204ae3c

KVM: x86: disallow multiple KVM_CREATE_IRQCHIP · 3ddea128

由 Marcelo Tosatti 提交于 10月 29, 2009

Otherwise kvm will leak memory on multiple KVM_CREATE_IRQCHIP.
Also serialize multiple accesses with kvm->lock.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3ddea128

KVM: VMX: Remove vmx->msr_offset_efer · 92c0d900

由 Avi Kivity 提交于 10月 29, 2009

This variable is used to communicate between a caller and a callee; switch
to a function argument instead.
Signed-off-by: NAvi Kivity <avi@redhat.com>

92c0d900

KVM: MMU: update invlpg handler comment · 5f5c35aa

由 Marcelo Tosatti 提交于 10月 26, 2009

Large page translations are always synchronized (either in level 3
or level 2), so its not necessary to properly deal with them
in the invlpg handler.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

5f5c35aa

KVM: VMX: move CR3/PDPTR update to vmx_set_cr3 · 7c93be44

由 Marcelo Tosatti 提交于 10月 26, 2009

GUEST_CR3 is updated via kvm_set_cr3 whenever CR3 is modified from
outside guest context. Similarly pdptrs are updated via load_pdptrs.

Let kvm_set_cr3 perform the update, removing it from the vcpu_run
fast path.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Acked-by: NAcked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

7c93be44

KVM: remove duplicated task_switch check · 1655e3a3

由 Gleb Natapov 提交于 10月 25, 2009

Probably introduced by a bad merge.
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

1655e3a3

KVM: VMX: Use shared msr infrastructure · 26bb0981

由 Avi Kivity 提交于 9月 07, 2009

Instead of reloading syscall MSRs on every preemption, use the new shared
msr infrastructure to reload them at the last possible minute (just before
exit to userspace).

Improves vcpu/idle/vcpu switches by about 2000 cycles (when EFER needs to be
reloaded as well).

[jan: fix slot index missing indirection]
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

26bb0981

KVM: x86 shared msr infrastructure · 18863bdd

由 Avi Kivity 提交于 9月 07, 2009

The various syscall-related MSRs are fairly expensive to switch.  Currently
we switch them on every vcpu preemption, which is far too often:

- if we're switching to a kernel thread (idle task, threaded interrupt,
  kernel-mode virtio server (vhost-net), for example) and back, then
  there's no need to switch those MSRs since kernel threasd won't
  be exiting to userspace.

- if we're switching to another guest running an identical OS, most likely
  those MSRs will have the same value, so there's little point in reloading
  them.

- if we're running the same OS on the guest and host, the MSRs will have
  identical values and reloading is unnecessary.

This patch uses the new user return notifiers to implement last-minute
switching, and checks the msr values to avoid unnecessary reloading.
Signed-off-by: NAvi Kivity <avi@redhat.com>

18863bdd

KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr area · 44ea2b17

由 Avi Kivity 提交于 9月 06, 2009

Currently MSR_KERNEL_GS_BASE is saved and restored as part of the
guest/host msr reloading.  Since we wish to lazy-restore all the other
msrs, save and reload MSR_KERNEL_GS_BASE explicitly instead of using
the common code.
Signed-off-by: NAvi Kivity <avi@redhat.com>

44ea2b17

KVM: SVM: init_vmcb(): remove redundant save->cr0 initialization · 3ce672d4

由 Eduardo Habkost 提交于 10月 24, 2009

The svm_set_cr0() call will initialize save->cr0 properly even when npt is
enabled, clearing the NW and CD bits as expected, so we don't need to
initialize it manually for npt_enabled anymore.
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3ce672d4

KVM: SVM: Reset cr0 properly on vcpu reset · 18fa000a

由 Eduardo Habkost 提交于 10月 24, 2009

svm_vcpu_reset() was not properly resetting the contents of the guest-visible
cr0 register, causing the following issue:
https://bugzilla.redhat.com/show_bug.cgi?id=525699

Without resetting cr0 properly, the vcpu was running the SIPI bootstrap routine
with paging enabled, making the vcpu get a pagefault exception while trying to
run it.

Instead of setting vmcb->save.cr0 directly, the new code just resets
kvm->arch.cr0 and calls kvm_set_cr0(). The bits that were set/cleared on
vmcb->save.cr0 (PG, WP, !CD, !NW) will be set properly by svm_set_cr0().

kvm_set_cr0() is used instead of calling svm_set_cr0() directly to make sure
kvm_mmu_reset_context() is called to reset the mmu to nonpaging mode.
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

18fa000a

KVM: VMX: Use macros instead of hex value on cr0 initialization · fa40052c

由 Eduardo Habkost 提交于 10月 24, 2009

This should have no effect, it is just to make the code clearer.
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

fa40052c

KVM: allow userspace to adjust kvmclock offset · afbcf7ab

由 Glauber Costa 提交于 10月 16, 2009

When we migrate a kvm guest that uses pvclock between two hosts, we may
suffer a large skew. This is because there can be significant differences
between the monotonic clock of the hosts involved. When a new host with
a much larger monotonic time starts running the guest, the view of time
will be significantly impacted.

Situation is much worse when we do the opposite, and migrate to a host with
a smaller monotonic clock.

This proposed ioctl will allow userspace to inform us what is the monotonic
clock value in the source host, so we can keep the time skew short, and
more importantly, never goes backwards. Userspace may also need to trigger
the current data, since from the first migration onwards, it won't be
reflected by a simple call to clock_gettime() anymore.

[marcelo: future-proof abi with a flags field]
[jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]
Signed-off-by: NGlauber Costa <glommer@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

afbcf7ab

KVM: SVM: Cleanup NMI singlestep · 6be7d306

由 Jan Kiszka 提交于 10月 18, 2009

Push the NMI-related singlestep variable into vcpu_svm. It's dealing
with an AMD-specific deficit, nothing generic for x86.
Acked-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>

 arch/x86/include/asm/kvm_host.h |    1 -
 arch/x86/kvm/svm.c              |   12 +++++++-----
 2 files changed, 7 insertions(+), 6 deletions(-)
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

6be7d306

KVM: x86: Fix guest single-stepping while interruptible · 94fe45da

由 Jan Kiszka 提交于 10月 18, 2009

Commit 705c5323 opened the doors of hell by unconditionally injecting
single-step flags as long as guest_debug signaled this. This doesn't
work when the guest branches into some interrupt or exception handler
and triggers a vmexit with flag reloading.

Fix it by saving cs:rip when user space requests single-stepping and
restricting the trace flag injection to this guest code position.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

94fe45da

KVM: Xen PV-on-HVM guest support · ffde22ac

由 Ed Swierk 提交于 10月 15, 2009

Support for Xen PV-on-HVM guests can be implemented almost entirely in
userspace, except for handling one annoying MSR that maps a Xen
hypercall blob into guest address space.

A generic mechanism to delegate MSR writes to userspace seems overkill
and risks encouraging similar MSR abuse in the future.  Thus this patch
adds special support for the Xen HVM MSR.

I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
KVM which MSR the guest will write to, as well as the starting address
and size of the hypercall blobs (one each for 32-bit and 64-bit) that
userspace has loaded from files.  When the guest writes to the MSR, KVM
copies one page of the blob from userspace to the guest.

I've tested this patch with a hacked-up version of Gerd's userspace
code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

[jan: fix i386 build warning]
[avi: future proof abi with a flags field]
Signed-off-by: NEd Swierk <eswierk@aristanetworks.com>
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

ffde22ac

KVM: x86: Drop unneeded CONFIG_HAS_IOMEM check · 94c30d9c

由 Jan Kiszka 提交于 10月 12, 2009

This (broken) check dates back to the days when this code was shared
across architectures. x86 has IOMEM, so drop it.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

94c30d9c

M
KVM: VMX: fix handle_pause declaration · 9fb41ba8
由 Marcelo Tosatti 提交于 10月 12, 2009
```
There's no kvm_run argument anymore.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
```
9fb41ba8

KVM: x86: Harden against cpufreq · 6b7d7e76

由 Zachary Amsden 提交于 10月 09, 2009

If cpufreq can't determine the CPU khz, or cpufreq is not compiled in,
we should fallback to the measured TSC khz.
Signed-off-by: NZachary Amsden <zamsden@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

6b7d7e76

KVM: SVM: Support Pause Filter in AMD processors · 565d0998

由 Mark Langsdorf 提交于 10月 06, 2009

New AMD processors (Family 0x10 models 8+) support the Pause
Filter Feature.  This feature creates a new field in the VMCB
called Pause Filter Count.  If Pause Filter Count is greater
than 0 and intercepting PAUSEs is enabled, the processor will
increment an internal counter when a PAUSE instruction occurs
instead of intercepting.  When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.

This feature can be used to detect contended spinlocks,
especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.

Experimental results show that most spinlocks are held
for less than 1000 PAUSE cycles or more than a few
thousand.  Default the Pause Filter Counter to 3000 to
detect the contended spinlocks.

Processor support for this feature is indicated by a CPUID
bit.

On a 24 core system running 4 guests each with 16 VCPUs,
this patch improved overall performance of each guest's
32 job kernbench by approximately 3-5% when combined
with a scheduler algorithm thati caused the VCPU to
sleep for a brief period. Further performance improvement
may be possible with a more sophisticated yield algorithm.
Signed-off-by: NMark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

565d0998

KVM: VMX: Add support for Pause-Loop Exiting · 4b8d54f9

由 Zhai, Edwin 提交于 10月 09, 2009

New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution
control fields:
PLE_Gap    - upper bound on the amount of time between two successive
             executions of PAUSE in a loop.
PLE_Window - upper bound on the amount of time a guest is allowed to execute in
             a PAUSE loop

If the time, between this execution of PAUSE and previous one, exceeds the
PLE_Gap, processor consider this PAUSE belongs to a new loop.
Otherwise, processor determins the the total execution time of this loop(since
1st PAUSE in this loop), and triggers a VM exit if total time exceeds the
PLE_Window.
* Refer SDM volume 3b section 21.6.13 & 22.1.3.

Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP
is sched-out after hold a spinlock, then other VPs for same lock are sched-in
to waste the CPU time.

Our tests indicate that most spinlocks are held for less than 212 cycles.
Performance tests show that with 2X LP over-commitment we can get +2% perf
improvement for kernel build(Even more perf gain with more LPs).
Signed-off-by: NZhai Edwin <edwin.zhai@intel.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

4b8d54f9

KVM: SVM: Remove nsvm_printk debugging code · d36f19e9