提交 0d681009 编写于 作者: L Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest

* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest: (45 commits)
  Use "struct boot_params" in example launcher
  Loading bzImage directly.
  Revert lguest magic and use hook in head.S
  Update lguest documentation to reflect the new virtual block device name.
  generalize lgread_u32/lgwrite_u32.
  Example launcher handle guests not being ready for input
  Update example launcher for virtio
  Lguest support for Virtio
  Remove old lguest I/O infrrasructure.
  Remove old lguest bus and drivers.
  Virtio helper routines for a descriptor ringbuffer implementation
  Module autoprobing support for virtio drivers.
  Virtio console driver
  Block driver using virtio.
  Net driver using virtio
  Virtio interface
  Boot with virtual == physical to get closer to native Linux.
  Allow guest to specify syscall vector to use.
  Rename "cr3" to "gpgdir" to avoid x86-specific naming.
  Pagetables to use normal kernel types
  ...
# This creates the demonstration utility "lguest" which runs a Linux guest.
# For those people that have a separate object dir, look there for .config
KBUILD_OUTPUT := ../..
ifdef O
ifeq ("$(origin O)", "command line")
KBUILD_OUTPUT := $(O)
endif
endif
# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
include $(KBUILD_OUTPUT)/.config
LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x08000000)
CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -Wl,-T,lguest.lds
CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -I../../include
LDLIBS:=-lz
# Removing this works for some versions of ld.so (eg. Ubuntu Feisty) and
# not others (eg. FC7).
LDFLAGS+=-static
all: lguest.lds lguest
# The linker script on x86 is so complex the only way of creating one
# which will link our binary in the right place is to mangle the
# default one.
lguest.lds:
$(LD) --verbose | awk '/^==========/ { PRINT=1; next; } /SIZEOF_HEADERS/ { gsub(/0x[0-9A-F]*/, "$(LGUEST_GUEST_TOP)") } { if (PRINT) print $$0; }' > $@
all: lguest
clean:
rm -f lguest.lds lguest
rm -f lguest
此差异已折叠。
......@@ -6,7 +6,7 @@ Lguest is designed to be a minimal hypervisor for the Linux kernel, for
Linux developers and users to experiment with virtualization with the
minimum of complexity. Nonetheless, it should have sufficient
features to make it useful for specific tasks, and, of course, you are
encouraged to fork and enhance it.
encouraged to fork and enhance it (see drivers/lguest/README).
Features:
......@@ -23,19 +23,30 @@ Developer features:
Running Lguest:
- Lguest runs the same kernel as guest and host. You can configure
them differently, but usually it's easiest not to.
- The easiest way to run lguest is to use same kernel as guest and host.
You can configure them differently, but usually it's easiest not to.
You will need to configure your kernel with the following options:
CONFIG_HIGHMEM64G=n ("High Memory Support" "64GB")[1]
CONFIG_TUN=y/m ("Universal TUN/TAP device driver support")
CONFIG_EXPERIMENTAL=y ("Prompt for development and/or incomplete code/drivers")
CONFIG_PARAVIRT=y ("Paravirtualization support (EXPERIMENTAL)")
CONFIG_LGUEST=y/m ("Linux hypervisor example code")
and I recommend:
CONFIG_HZ=100 ("Timer frequency")[2]
"General setup":
"Prompt for development and/or incomplete code/drivers" = Y
(CONFIG_EXPERIMENTAL=y)
"Processor type and features":
"Paravirtualized guest support" = Y
"Lguest guest support" = Y
"High Memory Support" = off/4GB
"Alignment value to which kernel should be aligned" = 0x100000
(CONFIG_PARAVIRT=y, CONFIG_LGUEST_GUEST=y, CONFIG_HIGHMEM64G=n and
CONFIG_PHYSICAL_ALIGN=0x100000)
"Device Drivers":
"Network device support"
"Universal TUN/TAP device driver support" = M/Y
(CONFIG_TUN=m)
"Virtualization"
"Linux hypervisor example code" = M/Y
(CONFIG_LGUEST=m)
- A tool called "lguest" is available in this directory: type "make"
to build it. If you didn't build your kernel in-tree, use "make
......@@ -51,14 +62,17 @@ Running Lguest:
dd if=/dev/zero of=rootfile bs=1M count=2048
qemu -cdrom image.iso -hda rootfile -net user -net nic -boot d
Make sure that you install a getty on /dev/hvc0 if you want to log in on the
console!
- "modprobe lg" if you built it as a module.
- Run an lguest as root:
Documentation/lguest/lguest 64m vmlinux --tunnet=192.168.19.1 --block=rootfile root=/dev/lgba
Documentation/lguest/lguest 64 vmlinux --tunnet=192.168.19.1 --block=rootfile root=/dev/vda
Explanation:
64m: the amount of memory to use.
64: the amount of memory to use, in MB.
vmlinux: the kernel image found in the top of your build directory. You
can also use a standard bzImage.
......@@ -66,10 +80,10 @@ Running Lguest:
--tunnet=192.168.19.1: configures a "tap" device for networking with this
IP address.
--block=rootfile: a file or block device which becomes /dev/lgba
--block=rootfile: a file or block device which becomes /dev/vda
inside the guest.
root=/dev/lgba: this (and anything else on the command line) are
root=/dev/vda: this (and anything else on the command line) are
kernel boot parameters.
- Configuring networking. I usually have the host masquerade, using
......@@ -99,31 +113,7 @@ Running Lguest:
"--sharenet=<filename>": any two guests using the same file are on
the same network. This file is created if it does not exist.
Lguest I/O model:
Lguest uses a simplified DMA model plus shared memory for I/O. Guests
can communicate with each other if they share underlying memory
(usually by the lguest program mmaping the same file), but they can
use any non-shared memory to communicate with the lguest process.
Guests can register DMA buffers at any key (must be a valid physical
address) using the LHCALL_BIND_DMA(key, dmabufs, num<<8|irq)
hypercall. "dmabufs" is the physical address of an array of "num"
"struct lguest_dma": each contains a used_len, and an array of
physical addresses and lengths. When a transfer occurs, the
"used_len" field of one of the buffers which has used_len 0 will be
set to the length transferred and the irq will fire.
There is a helpful mailing list at http://ozlabs.org/mailman/listinfo/lguest
Using an irq value of 0 unbinds the dma buffers.
To send DMA, the LHCALL_SEND_DMA(key, dma_physaddr) hypercall is used,
and the bytes used is written to the used_len field. This can be 0 if
noone else has bound a DMA buffer to that key or some other error.
DMA buffers bound by the same guest are ignored.
Cheers!
Good luck!
Rusty Russell rusty@rustcorp.com.au.
[1] These are on various places on the TODO list, waiting for you to
get annoyed enough at the limitation to fix it.
[2] Lguest is not yet tickless when idle. See [1].
......@@ -227,28 +227,40 @@ config SCHED_NO_NO_OMIT_FRAME_POINTER
If in doubt, say "Y".
config PARAVIRT
bool "Paravirtualization support (EXPERIMENTAL)"
depends on EXPERIMENTAL
bool
depends on !(X86_VISWS || X86_VOYAGER)
help
Paravirtualization is a way of running multiple instances of
Linux on the same machine, under a hypervisor. This option
changes the kernel so it can modify itself when it is run
under a hypervisor, improving performance significantly.
However, when run without a hypervisor the kernel is
theoretically slower. If in doubt, say N.
This changes the kernel so it can modify itself when it is run
under a hypervisor, potentially improving performance significantly
over full virtualization. However, when run without a hypervisor
the kernel is theoretically slower and slightly larger.
menuconfig PARAVIRT_GUEST
bool "Paravirtualized guest support"
help
Say Y here to get to see options related to running Linux under
various hypervisors. This option alone does not add any kernel code.
If you say N, all options in this submenu will be skipped and disabled.
if PARAVIRT_GUEST
source "arch/x86/xen/Kconfig"
config VMI
bool "VMI Paravirt-ops support"
depends on PARAVIRT
bool "VMI Guest support"
select PARAVIRT
depends on !(X86_VISWS || X86_VOYAGER)
help
VMI provides a paravirtualized interface to the VMware ESX server
(it could be used by other hypervisors in theory too, but is not
at the moment), by linking the kernel to a GPL-ed ROM module
provided by the hypervisor.
source "arch/x86/lguest/Kconfig"
endif
config ACPI_SRAT
bool
default y
......
......@@ -99,6 +99,9 @@ core-$(CONFIG_X86_ES7000) := arch/x86/mach-es7000/
# Xen paravirtualization support
core-$(CONFIG_XEN) += arch/x86/xen/
# lguest paravirtualization support
core-$(CONFIG_LGUEST_GUEST) += arch/x86/lguest/
# default subarch .h files
mflags-y += -Iinclude/asm-x86/mach-default
......
......@@ -136,6 +136,7 @@ void foo(void)
#ifdef CONFIG_LGUEST_GUEST
BLANK();
OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
OFFSET(LGUEST_DATA_pgdir, lguest_data, pgdir);
OFFSET(LGUEST_PAGES_host_gdt_desc, lguest_pages, state.host_gdt_desc);
OFFSET(LGUEST_PAGES_host_idt_desc, lguest_pages, state.host_idt_desc);
OFFSET(LGUEST_PAGES_host_cr3, lguest_pages, state.host_cr3);
......
config LGUEST_GUEST
bool "Lguest guest support"
select PARAVIRT
depends on !X86_PAE
select VIRTIO
select VIRTIO_RING
select VIRTIO_CONSOLE
help
Lguest is a tiny in-kernel hypervisor. Selecting this will
allow your kernel to boot under lguest. This option will increase
your kernel size by about 6k. If in doubt, say N.
If you say Y here, make sure you say Y (or M) to the virtio block
and net drivers which lguest needs.
obj-y := i386_head.o boot.o
......@@ -55,7 +55,7 @@
#include <linux/clockchips.h>
#include <linux/lguest.h>
#include <linux/lguest_launcher.h>
#include <linux/lguest_bus.h>
#include <linux/virtio_console.h>
#include <asm/paravirt.h>
#include <asm/param.h>
#include <asm/page.h>
......@@ -65,6 +65,7 @@
#include <asm/e820.h>
#include <asm/mce.h>
#include <asm/io.h>
#include <asm/i387.h>
/*G:010 Welcome to the Guest!
*
......@@ -85,9 +86,10 @@ struct lguest_data lguest_data = {
.hcall_status = { [0 ... LHCALL_RING_SIZE-1] = 0xFF },
.noirq_start = (u32)lguest_noirq_start,
.noirq_end = (u32)lguest_noirq_end,
.kernel_address = PAGE_OFFSET,
.blocked_interrupts = { 1 }, /* Block timer interrupts */
.syscall_vec = SYSCALL_VECTOR,
};
struct lguest_device_desc *lguest_devices;
static cycle_t clock_base;
/*G:035 Notice the lazy_hcall() above, rather than hcall(). This is our first
......@@ -146,10 +148,10 @@ void async_hcall(unsigned long call,
/* Table full, so do normal hcall which will flush table. */
hcall(call, arg1, arg2, arg3);
} else {
lguest_data.hcalls[next_call].eax = call;
lguest_data.hcalls[next_call].edx = arg1;
lguest_data.hcalls[next_call].ebx = arg2;
lguest_data.hcalls[next_call].ecx = arg3;
lguest_data.hcalls[next_call].arg0 = call;
lguest_data.hcalls[next_call].arg1 = arg1;
lguest_data.hcalls[next_call].arg2 = arg2;
lguest_data.hcalls[next_call].arg3 = arg3;
/* Arguments must all be written before we mark it to go */
wmb();
lguest_data.hcall_status[next_call] = 0;
......@@ -160,46 +162,6 @@ void async_hcall(unsigned long call,
}
/*:*/
/* Wrappers for the SEND_DMA and BIND_DMA hypercalls. This is mainly because
* Jeff Garzik complained that __pa() should never appear in drivers, and this
* helps remove most of them. But also, it wraps some ugliness. */
void lguest_send_dma(unsigned long key, struct lguest_dma *dma)
{
/* The hcall might not write this if something goes wrong */
dma->used_len = 0;
hcall(LHCALL_SEND_DMA, key, __pa(dma), 0);
}
int lguest_bind_dma(unsigned long key, struct lguest_dma *dmas,
unsigned int num, u8 irq)
{
/* This is the only hypercall which actually wants 5 arguments, and we
* only support 4. Fortunately the interrupt number is always less
* than 256, so we can pack it with the number of dmas in the final
* argument. */
if (!hcall(LHCALL_BIND_DMA, key, __pa(dmas), (num << 8) | irq))
return -ENOMEM;
return 0;
}
/* Unbinding is the same hypercall as binding, but with 0 num & irq. */
void lguest_unbind_dma(unsigned long key, struct lguest_dma *dmas)
{
hcall(LHCALL_BIND_DMA, key, __pa(dmas), 0);
}
/* For guests, device memory can be used as normal memory, so we cast away the
* __iomem to quieten sparse. */
void *lguest_map(unsigned long phys_addr, unsigned long pages)
{
return (__force void *)ioremap(phys_addr, PAGE_SIZE*pages);
}
void lguest_unmap(void *addr)
{
iounmap((__force void __iomem *)addr);
}
/*G:033
* Here are our first native-instruction replacements: four functions for
* interrupt control.
......@@ -680,6 +642,7 @@ static struct clocksource lguest_clock = {
.mask = CLOCKSOURCE_MASK(64),
.mult = 1 << 22,
.shift = 22,
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};
/* The "scheduler clock" is just our real clock, adjusted to start at zero */
......@@ -761,11 +724,9 @@ static void lguest_time_init(void)
* the TSC, otherwise it's a dumb nanosecond-resolution clock. Either
* way, the "rating" is initialized so high that it's always chosen
* over any other clocksource. */
if (lguest_data.tsc_khz) {
if (lguest_data.tsc_khz)
lguest_clock.mult = clocksource_khz2mult(lguest_data.tsc_khz,
lguest_clock.shift);
lguest_clock.flags = CLOCK_SOURCE_IS_CONTINUOUS;
}
clock_base = lguest_clock_read();
clocksource_register(&lguest_clock);
......@@ -889,6 +850,23 @@ static __init char *lguest_memory_setup(void)
return "LGUEST";
}
/* Before virtqueues are set up, we use LHCALL_NOTIFY on normal memory to
* produce console output. */
static __init int early_put_chars(u32 vtermno, const char *buf, int count)
{
char scratch[17];
unsigned int len = count;
if (len > sizeof(scratch) - 1)
len = sizeof(scratch) - 1;
scratch[len] = '\0';
memcpy(scratch, buf, len);
hcall(LHCALL_NOTIFY, __pa(scratch), 0, 0);
/* This routine returns the number of bytes actually written. */
return len;
}
/*G:050
* Patching (Powerfully Placating Performance Pedants)
*
......@@ -950,18 +928,8 @@ static unsigned lguest_patch(u8 type, u16 clobber, void *ibuf,
/*G:030 Once we get to lguest_init(), we know we're a Guest. The pv_ops
* structures in the kernel provide points for (almost) every routine we have
* to override to avoid privileged instructions. */
__init void lguest_init(void *boot)
__init void lguest_init(void)
{
/* Copy boot parameters first: the Launcher put the physical location
* in %esi, and head.S converted that to a virtual address and handed
* it to us. We use "__memcpy" because "memcpy" sometimes tries to do
* tricky things to go faster, and we're not ready for that. */
__memcpy(&boot_params, boot, PARAM_SIZE);
/* The boot parameters also tell us where the command-line is: save
* that, too. */
__memcpy(boot_command_line, __va(boot_params.hdr.cmd_line_ptr),
COMMAND_LINE_SIZE);
/* We're under lguest, paravirt is enabled, and we're running at
* privilege level 1, not 0 as normal. */
pv_info.name = "lguest";
......@@ -1033,11 +1001,7 @@ __init void lguest_init(void *boot)
/*G:070 Now we've seen all the paravirt_ops, we return to
* lguest_init() where the rest of the fairly chaotic boot setup
* occurs.
*
* The Host expects our first hypercall to tell it where our "struct
* lguest_data" is, so we do that first. */
hcall(LHCALL_LGUEST_INIT, __pa(&lguest_data), 0, 0);
* occurs. */
/* The native boot code sets up initial page tables immediately after
* the kernel itself, and sets init_pg_tables_end so they're not
......@@ -1050,11 +1014,6 @@ __init void lguest_init(void *boot)
* the normal data segment to get through booting. */
asm volatile ("mov %0, %%fs" : : "r" (__KERNEL_DS) : "memory");
/* Clear the part of the kernel data which is expected to be zero.
* Normally it will be anyway, but if we're loading from a bzImage with
* CONFIG_RELOCATALE=y, the relocations will be sitting here. */
memset(__bss_start, 0, __bss_stop - __bss_start);
/* The Host uses the top of the Guest's virtual address space for the
* Host<->Guest Switcher, and it tells us how much it needs in
* lguest_data.reserve_mem, set up on the LGUEST_INIT hypercall. */
......@@ -1092,6 +1051,9 @@ __init void lguest_init(void *boot)
* adapted for lguest's use. */
add_preferred_console("hvc", 0, NULL);
/* Register our very early console. */
virtio_cons_early_init(early_put_chars);
/* Last of all, we set the power management poweroff hook to point to
* the Guest routine to power off. */
pm_power_off = lguest_power_off;
......
#include <linux/linkage.h>
#include <linux/lguest.h>
#include <asm/lguest_hcall.h>
#include <asm/asm-offsets.h>
#include <asm/thread_info.h>
#include <asm/processor-flags.h>
/*G:020 This is where we begin: we have a magic signature which the launcher
* looks for. The plan is that the Linux boot protocol will be extended with a
* "platform type" field which will guide us here from the normal entry point,
* but for the moment this suffices. The normal boot code uses %esi for the
* boot header, so we do too. We convert it to a virtual address by adding
* PAGE_OFFSET, and hand it to lguest_init() as its argument (ie. %eax).
/*G:020 This is where we begin: head.S notes that the boot header's platform
* type field is "1" (lguest), so calls us here. The boot header is in %esi.
*
* WARNING: be very careful here! We're running at addresses equal to physical
* addesses (around 0), not above PAGE_OFFSET as most code expectes
* (eg. 0xC0000000). Jumps are relative, so they're OK, but we can't touch any
* data.
*
* The .section line puts this code in .init.text so it will be discarded after
* boot. */
.section .init.text, "ax", @progbits
.ascii "GenuineLguest"
/* Set up initial stack. */
ENTRY(lguest_entry)
/* Make initial hypercall now, so we can set up the pagetables. */
movl $LHCALL_LGUEST_INIT, %eax
movl $lguest_data - __PAGE_OFFSET, %edx
int $LGUEST_TRAP_ENTRY
/* The Host put the toplevel pagetable in lguest_data.pgdir. The movsl
* instruction uses %esi implicitly. */
movl lguest_data - __PAGE_OFFSET + LGUEST_DATA_pgdir, %esi
/* Copy first 32 entries of page directory to __PAGE_OFFSET entries.
* This means the first 128M of kernel memory will be mapped at
* PAGE_OFFSET where the kernel expects to run. This will get it far
* enough through boot to switch to its own pagetables. */
movl $32, %ecx
movl %esi, %edi
addl $((__PAGE_OFFSET >> 22) * 4), %edi
rep
movsl
/* Set up the initial stack so we can run C code. */
movl $(init_thread_union+THREAD_SIZE),%esp
movl %esi, %eax
addl $__PAGE_OFFSET, %eax
jmp lguest_init
/* Jumps are relative, and we're running __PAGE_OFFSET too low at the
* moment. */
jmp lguest_init+__PAGE_OFFSET
/*G:055 We create a macro which puts the assembler code between lgstart_ and
* lgend_ markers. These templates are put in the .text section: they can't be
......
......@@ -3,8 +3,9 @@
#
config XEN
bool "Enable support for Xen hypervisor"
depends on PARAVIRT && X86_CMPXCHG && X86_TSC && !NEED_MULTIPLE_NODES
bool "Xen guest support"
select PARAVIRT
depends on X86_CMPXCHG && X86_TSC && !NEED_MULTIPLE_NODES && !(X86_VISWS || X86_VOYAGER)
help
This is the Linux Xen port. Enabling this will allow the
kernel to boot in a paravirtualized environment under the
......
......@@ -94,5 +94,5 @@ source "drivers/kvm/Kconfig"
source "drivers/uio/Kconfig"
source "drivers/lguest/Kconfig"
source "drivers/virtio/Kconfig"
endmenu
......@@ -91,3 +91,4 @@ obj-$(CONFIG_HID) += hid/
obj-$(CONFIG_PPC_PS3) += ps3/
obj-$(CONFIG_OF) += of/
obj-$(CONFIG_SSB) += ssb/
obj-$(CONFIG_VIRTIO) += virtio/
......@@ -425,4 +425,10 @@ config XEN_BLKDEV_FRONTEND
block device driver. It communicates with a back-end driver
in another domain which drives the actual block device.
config VIRTIO_BLK
tristate "Virtio block driver (EXPERIMENTAL)"
depends on EXPERIMENTAL && VIRTIO
---help---
This is the virtual block driver for lguest. Say Y or M.
endif # BLK_DEV
......@@ -25,10 +25,10 @@ obj-$(CONFIG_SUNVDC) += sunvdc.o
obj-$(CONFIG_BLK_DEV_UMEM) += umem.o
obj-$(CONFIG_BLK_DEV_NBD) += nbd.o
obj-$(CONFIG_BLK_DEV_CRYPTOLOOP) += cryptoloop.o
obj-$(CONFIG_VIRTIO_BLK) += virtio_blk.o
obj-$(CONFIG_VIODASD) += viodasd.o
obj-$(CONFIG_BLK_DEV_SX8) += sx8.o
obj-$(CONFIG_BLK_DEV_UB) += ub.o
obj-$(CONFIG_XEN_BLKDEV_FRONTEND) += xen-blkfront.o
obj-$(CONFIG_LGUEST_BLOCK) += lguest_blk.o
/*D:400
* The Guest block driver
*
* This is a simple block driver, which appears as /dev/lgba, lgbb, lgbc etc.
* The mechanism is simple: we place the information about the request in the
* device page, then use SEND_DMA (containing the data for a write, or an empty
* "ping" DMA for a read).
:*/
/* Copyright 2006 Rusty Russell <rusty@rustcorp.com.au> IBM Corporation
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*/
//#define DEBUG
#include <linux/init.h>
#include <linux/types.h>
#include <linux/blkdev.h>
#include <linux/interrupt.h>
#include <linux/lguest_bus.h>
static char next_block_index = 'a';
/*D:420 Here is the structure which holds all the information we need about
* each Guest block device.
*
* I'm sure at this stage, you're wondering "hey, where was the adventure I was
* promised?" and thinking "Rusty sucks, I shall say nasty things about him on
* my blog". I think Real adventures have boring bits, too, and you're in the
* middle of one. But it gets better. Just not quite yet. */
struct blockdev
{
/* The block queue infrastructure wants a spinlock: it is held while it
* calls our block request function. We grab it in our interrupt
* handler so the responses don't mess with new requests. */
spinlock_t lock;
/* The disk structure registered with kernel. */
struct gendisk *disk;
/* The major device number for this disk, and the interrupt. We only
* really keep them here for completeness; we'd need them if we
* supported device unplugging. */
int major;
int irq;
/* The physical address of this device's memory page */
unsigned long phys_addr;
/* The mapped memory page for convenient acces. */
struct lguest_block_page *lb_page;
/* We only have a single request outstanding at a time: this is it. */
struct lguest_dma dma;
struct request *req;
};
/*D:495 We originally used end_request() throughout the driver, but it turns
* out that end_request() is deprecated, and doesn't actually end the request
* (which seems like a good reason to deprecate it!). It simply ends the first
* bio. So if we had 3 bios in a "struct request" we would do all 3,
* end_request(), do 2, end_request(), do 1 and end_request(): twice as much
* work as we needed to do.
*
* This reinforced to me that I do not understand the block layer.
*
* Nonetheless, Jens Axboe gave me this nice helper to end all chunks of a
* request. This improved disk speed by 130%. */
static void end_entire_request(struct request *req, int uptodate)
{
if (end_that_request_first(req, uptodate, req->hard_nr_sectors))
BUG();
add_disk_randomness(req->rq_disk);
blkdev_dequeue_request(req);
end_that_request_last(req, uptodate);
}
/* I'm told there are only two stories in the world worth telling: love and
* hate. So there used to be a love scene here like this:
*
* Launcher: We could make beautiful I/O together, you and I.
* Guest: My, that's a big disk!
*
* Unfortunately, it was just too raunchy for our otherwise-gentle tale. */
/*D:490 This is the interrupt handler, called when a block read or write has
* been completed for us. */
static irqreturn_t lgb_irq(int irq, void *_bd)
{
/* We handed our "struct blockdev" as the argument to request_irq(), so
* it is passed through to us here. This tells us which device we're
* dealing with in case we have more than one. */
struct blockdev *bd = _bd;
unsigned long flags;
/* We weren't doing anything? Strange, but could happen if we shared
* interrupts (we don't!). */
if (!bd->req) {
pr_debug("No work!\n");
return IRQ_NONE;
}
/* Not done yet? That's equally strange. */
if (!bd->lb_page->result) {
pr_debug("No result!\n");
return IRQ_NONE;
}
/* We have to grab the lock before ending the request. */
spin_lock_irqsave(&bd->lock, flags);
/* "result" is 1 for success, 2 for failure: end_entire_request() wants
* to know whether this succeeded or not. */
end_entire_request(bd->req, bd->lb_page->result == 1);
/* Clear out request, it's done. */
bd->req = NULL;
/* Reset incoming DMA for next time. */
bd->dma.used_len = 0;
/* Ready for more reads or writes */
blk_start_queue(bd->disk->queue);
spin_unlock_irqrestore(&bd->lock, flags);
/* The interrupt was for us, we dealt with it. */
return IRQ_HANDLED;
}
/*D:480 The block layer's "struct request" contains a number of "struct bio"s,
* each of which contains "struct bio_vec"s, each of which contains a page, an
* offset and a length.
*
* Fortunately there are iterators to help us walk through the "struct
* request". Even more fortunately, there were plenty of places to steal the
* code from. We pack the "struct request" into our "struct lguest_dma" and
* return the total length. */
static unsigned int req_to_dma(struct request *req, struct lguest_dma *dma)
{
unsigned int i = 0, len = 0;
struct req_iterator iter;
struct bio_vec *bvec;
rq_for_each_segment(bvec, req, iter) {
/* We told the block layer not to give us too many. */
BUG_ON(i == LGUEST_MAX_DMA_SECTIONS);
/* If we had a zero-length segment, it would look like
* the end of the data referred to by the "struct
* lguest_dma", so make sure that doesn't happen. */
BUG_ON(!bvec->bv_len);
/* Convert page & offset to a physical address */
dma->addr[i] = page_to_phys(bvec->bv_page)
+ bvec->bv_offset;
dma->len[i] = bvec->bv_len;
len += bvec->bv_len;
i++;
}
/* If the array isn't full, we mark the end with a 0 length */
if (i < LGUEST_MAX_DMA_SECTIONS)
dma->len[i] = 0;
return len;
}
/* This creates an empty DMA, useful for prodding the Host without sending data
* (ie. when we want to do a read) */
static void empty_dma(struct lguest_dma *dma)
{
dma->len[0] = 0;
}
/*D:470 Setting up a request is fairly easy: */
static void setup_req(struct blockdev *bd,
int type, struct request *req, struct lguest_dma *dma)
{
/* The type is 1 (write) or 0 (read). */
bd->lb_page->type = type;
/* The sector on disk where the read or write starts. */
bd->lb_page->sector = req->sector;
/* The result is initialized to 0 (unfinished). */
bd->lb_page->result = 0;
/* The current request (so we can end it in the interrupt handler). */
bd->req = req;
/* The number of bytes: returned as a side-effect of req_to_dma(),
* which packs the block layer's "struct request" into our "struct
* lguest_dma" */
bd->lb_page->bytes = req_to_dma(req, dma);
}
/*D:450 Write is pretty straightforward: we pack the request into a "struct
* lguest_dma", then use SEND_DMA to send the request. */
static void do_write(struct blockdev *bd, struct request *req)
{
struct lguest_dma send;
pr_debug("lgb: WRITE sector %li\n", (long)req->sector);
setup_req(bd, 1, req, &send);
lguest_send_dma(bd->phys_addr, &send);
}
/* Read is similar to write, except we pack the request into our receive
* "struct lguest_dma" and send through an empty DMA just to tell the Host that
* there's a request pending. */
static void do_read(struct blockdev *bd, struct request *req)
{
struct lguest_dma ping;
pr_debug("lgb: READ sector %li\n", (long)req->sector);
setup_req(bd, 0, req, &bd->dma);
empty_dma(&ping);
lguest_send_dma(bd->phys_addr, &ping);
}
/*D:440 This where requests come in: we get handed the request queue and are
* expected to pull a "struct request" off it until we've finished them or
* we're waiting for a reply: */
static void do_lgb_request(struct request_queue *q)
{
struct blockdev *bd;
struct request *req;
again:
/* This sometimes returns NULL even on the very first time around. I
* wonder if it's something to do with letting elves handle the request
* queue... */
req = elv_next_request(q);
if (!req)
return;
/* We attached the struct blockdev to the disk: get it back */
bd = req->rq_disk->private_data;
/* Sometimes we get repeated requests after blk_stop_queue(), but we
* can only handle one at a time. */
if (bd->req)
return;
/* We only do reads and writes: no tricky business! */
if (!blk_fs_request(req)) {
pr_debug("Got non-command 0x%08x\n", req->cmd_type);
req->errors++;
end_entire_request(req, 0);
goto again;
}
if (rq_data_dir(req) == WRITE)
do_write(bd, req);
else
do_read(bd, req);
/* We've put out the request, so stop any more coming in until we get
* an interrupt, which takes us to lgb_irq() to re-enable the queue. */
blk_stop_queue(q);
}
/*D:430 This is the "struct block_device_operations" we attach to the disk at
* the end of lguestblk_probe(). It doesn't seem to want much. */
static struct block_device_operations lguestblk_fops = {
.owner = THIS_MODULE,
};
/*D:425 Setting up a disk device seems to involve a lot of code. I'm not sure
* quite why. I do know that the IDE code sent two or three of the maintainers
* insane, perhaps this is the fringe of the same disease?
*
* As in the console code, the probe function gets handed the generic
* lguest_device from lguest_bus.c: */
static int lguestblk_probe(struct lguest_device *lgdev)
{
struct blockdev *bd;
int err;
int irqflags = IRQF_SHARED;
/* First we allocate our own "struct blockdev" and initialize the easy
* fields. */
bd = kmalloc(sizeof(*bd), GFP_KERNEL);
if (!bd)
return -ENOMEM;
spin_lock_init(&bd->lock);
bd->irq = lgdev_irq(lgdev);
bd->req = NULL;
bd->dma.used_len = 0;
bd->dma.len[0] = 0;
/* The descriptor in the lguest_devices array provided by the Host
* gives the Guest the physical page number of the device's page. */
bd->phys_addr = (lguest_devices[lgdev->index].pfn << PAGE_SHIFT);
/* We use lguest_map() to get a pointer to the device page */
bd->lb_page = lguest_map(bd->phys_addr, 1);
if (!bd->lb_page) {
err = -ENOMEM;
goto out_free_bd;
}
/* We need a major device number: 0 means "assign one dynamically". */
bd->major = register_blkdev(0, "lguestblk");
if (bd->major < 0) {
err = bd->major;
goto out_unmap;
}
/* This allocates a "struct gendisk" where we pack all the information
* about the disk which the rest of Linux sees. The argument is the
* number of minor devices desired: we need one minor for the main
* disk, and one for each partition. Of course, we can't possibly know
* how many partitions are on the disk (add_disk does that).
*/
bd->disk = alloc_disk(16);
if (!bd->disk) {
err = -ENOMEM;
goto out_unregister_blkdev;
}
/* Every disk needs a queue for requests to come in: we set up the
* queue with a callback function (the core of our driver) and the lock
* to use. */
bd->disk->queue = blk_init_queue(do_lgb_request, &bd->lock);
if (!bd->disk->queue) {
err = -ENOMEM;
goto out_put_disk;
}
/* We can only handle a certain number of pointers in our SEND_DMA
* call, so we set that with blk_queue_max_hw_segments(). This is not
* to be confused with blk_queue_max_phys_segments() of course! I
* know, who could possibly confuse the two?
*
* Well, it's simple to tell them apart: this one seems to work and the
* other one didn't. */
blk_queue_max_hw_segments(bd->disk->queue, LGUEST_MAX_DMA_SECTIONS);
/* Due to technical limitations of our Host (and simple coding) we
* can't have a single buffer which crosses a page boundary. Tell it
* here. This means that our maximum request size is 16
* (LGUEST_MAX_DMA_SECTIONS) pages. */
blk_queue_segment_boundary(bd->disk->queue, PAGE_SIZE-1);
/* We name our disk: this becomes the device name when udev does its
* magic thing and creates the device node, such as /dev/lgba.
* next_block_index is a global which starts at 'a'. Unfortunately
* this simple increment logic means that the 27th disk will be called
* "/dev/lgb{". In that case, I recommend having at least 29 disks, so
* your /dev directory will be balanced. */
sprintf(bd->disk->disk_name, "lgb%c", next_block_index++);
/* We look to the device descriptor again to see if this device's
* interrupts are expected to be random. If they are, we tell the irq
* subsystem. At the moment this bit is always set. */
if (lguest_devices[lgdev->index].features & LGUEST_DEVICE_F_RANDOMNESS)
irqflags |= IRQF_SAMPLE_RANDOM;
/* Now we have the name and irqflags, we can request the interrupt; we
* give it the "struct blockdev" we have set up to pass to lgb_irq()
* when there is an interrupt. */
err = request_irq(bd->irq, lgb_irq, irqflags, bd->disk->disk_name, bd);
if (err)
goto out_cleanup_queue;
/* We bind our one-entry DMA pool to the key for this block device so
* the Host can reply to our requests. The key is equal to the
* physical address of the device's page, which is conveniently
* unique. */
err = lguest_bind_dma(bd->phys_addr, &bd->dma, 1, bd->irq);
if (err)
goto out_free_irq;
/* We finish our disk initialization and add the disk to the system. */
bd->disk->major = bd->major;
bd->disk->first_minor = 0;
bd->disk->private_data = bd;
bd->disk->fops = &lguestblk_fops;
/* This is initialized to the disk size by the Launcher. */
set_capacity(bd->disk, bd->lb_page->num_sectors);
add_disk(bd->disk);
printk(KERN_INFO "%s: device %i at major %d\n",
bd->disk->disk_name, lgdev->index, bd->major);
/* We don't need to keep the "struct blockdev" around, but if we ever
* implemented device removal, we'd need this. */
lgdev->private = bd;
return 0;
out_free_irq:
free_irq(bd->irq, bd);
out_cleanup_queue:
blk_cleanup_queue(bd->disk->queue);
out_put_disk:
put_disk(bd->disk);
out_unregister_blkdev:
unregister_blkdev(bd->major, "lguestblk");
out_unmap:
lguest_unmap(bd->lb_page);
out_free_bd:
kfree(bd);
return err;
}
/*D:410 The boilerplate code for registering the lguest block driver is just
* like the console: */
static struct lguest_driver lguestblk_drv = {
.name = "lguestblk",
.owner = THIS_MODULE,
.device_type = LGUEST_DEVICE_T_BLOCK,
.probe = lguestblk_probe,
};
static __init int lguestblk_init(void)
{
return register_lguest_driver(&lguestblk_drv);
}
module_init(lguestblk_init);
MODULE_DESCRIPTION("Lguest block driver");
MODULE_LICENSE("GPL");
//#define DEBUG
#include <linux/spinlock.h>
#include <linux/blkdev.h>
#include <linux/hdreg.h>
#include <linux/virtio.h>
#include <linux/virtio_blk.h>
#include <linux/virtio_blk.h>
static unsigned char virtblk_index = 'a';
struct virtio_blk
{
spinlock_t lock;
struct virtio_device *vdev;
struct virtqueue *vq;
/* The disk structure for the kernel. */
struct gendisk *disk;
/* Request tracking. */
struct list_head reqs;
mempool_t *pool;
/* Scatterlist: can be too big for stack. */
struct scatterlist sg[3+MAX_PHYS_SEGMENTS];
};
struct virtblk_req
{
struct list_head list;
struct request *req;
struct virtio_blk_outhdr out_hdr;
struct virtio_blk_inhdr in_hdr;
};
static bool blk_done(struct virtqueue *vq)
{
struct virtio_blk *vblk = vq->vdev->priv;
struct virtblk_req *vbr;
unsigned int len;
unsigned long flags;
spin_lock_irqsave(&vblk->lock, flags);
while ((vbr = vblk->vq->vq_ops->get_buf(vblk->vq, &len)) != NULL) {
int uptodate;
switch (vbr->in_hdr.status) {
case VIRTIO_BLK_S_OK:
uptodate = 1;
break;
case VIRTIO_BLK_S_UNSUPP:
uptodate = -ENOTTY;
break;
default:
uptodate = 0;
break;
}
end_dequeued_request(vbr->req, uptodate);
list_del(&vbr->list);
mempool_free(vbr, vblk->pool);
}
/* In case queue is stopped waiting for more buffers. */
blk_start_queue(vblk->disk->queue);
spin_unlock_irqrestore(&vblk->lock, flags);
return true;
}
static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
struct request *req)
{
unsigned long num, out, in;
struct virtblk_req *vbr;
vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
if (!vbr)
/* When another request finishes we'll try again. */
return false;
vbr->req = req;
if (blk_fs_request(vbr->req)) {
vbr->out_hdr.type = 0;
vbr->out_hdr.sector = vbr->req->sector;
vbr->out_hdr.ioprio = vbr->req->ioprio;
} else if (blk_pc_request(vbr->req)) {
vbr->out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
vbr->out_hdr.sector = 0;
vbr->out_hdr.ioprio = vbr->req->ioprio;
} else {
/* We don't put anything else in the queue. */
BUG();
}
if (blk_barrier_rq(vbr->req))
vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
/* We have to zero this, otherwise blk_rq_map_sg gets upset. */
memset(vblk->sg, 0, sizeof(vblk->sg));
sg_set_buf(&vblk->sg[0], &vbr->out_hdr, sizeof(vbr->out_hdr));
num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
sg_set_buf(&vblk->sg[num+1], &vbr->in_hdr, sizeof(vbr->in_hdr));
if (rq_data_dir(vbr->req) == WRITE) {
vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
out = 1 + num;
in = 1;
} else {
vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
out = 1;
in = 1 + num;
}
if (vblk->vq->vq_ops->add_buf(vblk->vq, vblk->sg, out, in, vbr)) {
mempool_free(vbr, vblk->pool);
return false;
}
list_add_tail(&vbr->list, &vblk->reqs);
return true;
}
static void do_virtblk_request(struct request_queue *q)
{
struct virtio_blk *vblk = NULL;
struct request *req;
unsigned int issued = 0;
while ((req = elv_next_request(q)) != NULL) {
vblk = req->rq_disk->private_data;
BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
/* If this request fails, stop queue and wait for something to
finish to restart it. */
if (!do_req(q, vblk, req)) {
blk_stop_queue(q);
break;
}
blkdev_dequeue_request(req);
issued++;
}
if (issued)
vblk->vq->vq_ops->kick(vblk->vq);
}
static int virtblk_ioctl(struct inode *inode, struct file *filp,
unsigned cmd, unsigned long data)
{
return scsi_cmd_ioctl(filp, inode->i_bdev->bd_disk->queue,
inode->i_bdev->bd_disk, cmd,
(void __user *)data);
}
static struct block_device_operations virtblk_fops = {
.ioctl = virtblk_ioctl,
.owner = THIS_MODULE,
};
static int virtblk_probe(struct virtio_device *vdev)
{
struct virtio_blk *vblk;
int err, major;
void *token;
unsigned int len;
u64 cap;
u32 v;
vdev->priv = vblk = kmalloc(sizeof(*vblk), GFP_KERNEL);
if (!vblk) {
err = -ENOMEM;
goto out;
}
INIT_LIST_HEAD(&vblk->reqs);
spin_lock_init(&vblk->lock);
vblk->vdev = vdev;
/* We expect one virtqueue, for output. */
vblk->vq = vdev->config->find_vq(vdev, blk_done);
if (IS_ERR(vblk->vq)) {
err = PTR_ERR(vblk->vq);
goto out_free_vblk;
}
vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
if (!vblk->pool) {
err = -ENOMEM;
goto out_free_vq;
}
major = register_blkdev(0, "virtblk");
if (major < 0) {
err = major;
goto out_mempool;
}
/* FIXME: How many partitions? How long is a piece of string? */
vblk->disk = alloc_disk(1 << 4);
if (!vblk->disk) {
err = -ENOMEM;
goto out_unregister_blkdev;
}
vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
if (!vblk->disk->queue) {
err = -ENOMEM;
goto out_put_disk;
}
sprintf(vblk->disk->disk_name, "vd%c", virtblk_index++);
vblk->disk->major = major;
vblk->disk->first_minor = 0;
vblk->disk->private_data = vblk;
vblk->disk->fops = &virtblk_fops;
/* If barriers are supported, tell block layer that queue is ordered */
token = vdev->config->find(vdev, VIRTIO_CONFIG_BLK_F, &len);
if (virtio_use_bit(vdev, token, len, VIRTIO_BLK_F_BARRIER))
blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
err = virtio_config_val(vdev, VIRTIO_CONFIG_BLK_F_CAPACITY, &cap);
if (err) {
dev_err(&vdev->dev, "Bad/missing capacity in config\n");
goto out_put_disk;
}
/* If capacity is too big, truncate with warning. */
if ((sector_t)cap != cap) {
dev_warn(&vdev->dev, "Capacity %llu too large: truncating\n",
(unsigned long long)cap);
cap = (sector_t)-1;
}
set_capacity(vblk->disk, cap);
err = virtio_config_val(vdev, VIRTIO_CONFIG_BLK_F_SIZE_MAX, &v);
if (!err)
blk_queue_max_segment_size(vblk->disk->queue, v);
else if (err != -ENOENT) {
dev_err(&vdev->dev, "Bad SIZE_MAX in config\n");
goto out_put_disk;
}
err = virtio_config_val(vdev, VIRTIO_CONFIG_BLK_F_SEG_MAX, &v);
if (!err)
blk_queue_max_hw_segments(vblk->disk->queue, v);
else if (err != -ENOENT) {
dev_err(&vdev->dev, "Bad SEG_MAX in config\n");
goto out_put_disk;
}
add_disk(vblk->disk);
return 0;
out_put_disk:
put_disk(vblk->disk);
out_unregister_blkdev:
unregister_blkdev(major, "virtblk");
out_mempool:
mempool_destroy(vblk->pool);
out_free_vq:
vdev->config->del_vq(vblk->vq);
out_free_vblk:
kfree(vblk);
out:
return err;
}
static void virtblk_remove(struct virtio_device *vdev)
{
struct virtio_blk *vblk = vdev->priv;
int major = vblk->disk->major;
BUG_ON(!list_empty(&vblk->reqs));
blk_cleanup_queue(vblk->disk->queue);
put_disk(vblk->disk);
unregister_blkdev(major, "virtblk");
mempool_destroy(vblk->pool);
kfree(vblk);
}
static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
{ 0 },
};
static struct virtio_driver virtio_blk = {
.driver.name = KBUILD_MODNAME,
.driver.owner = THIS_MODULE,
.id_table = id_table,
.probe = virtblk_probe,
.remove = __devexit_p(virtblk_remove),
};
static int __init init(void)
{
return register_virtio_driver(&virtio_blk);
}
static void __exit fini(void)
{
unregister_virtio_driver(&virtio_blk);
}
module_init(init);
module_exit(fini);
MODULE_DEVICE_TABLE(virtio, id_table);
MODULE_DESCRIPTION("Virtio block driver");
MODULE_LICENSE("GPL");
......@@ -613,6 +613,10 @@ config HVC_XEN
help
Xen virtual console device driver
config VIRTIO_CONSOLE
bool
select HVC_DRIVER
config HVCS
tristate "IBM Hypervisor Virtual Console Server support"
depends on PPC_PSERIES
......
......@@ -42,7 +42,6 @@ obj-$(CONFIG_SYNCLINK_GT) += synclink_gt.o
obj-$(CONFIG_N_HDLC) += n_hdlc.o
obj-$(CONFIG_AMIGA_BUILTIN_SERIAL) += amiserial.o
obj-$(CONFIG_SX) += sx.o generic_serial.o
obj-$(CONFIG_LGUEST_GUEST) += hvc_lguest.o
obj-$(CONFIG_RIO) += rio/ generic_serial.o
obj-$(CONFIG_HVC_CONSOLE) += hvc_vio.o hvsi.o
obj-$(CONFIG_HVC_ISERIES) += hvc_iseries.o
......@@ -50,6 +49,7 @@ obj-$(CONFIG_HVC_RTAS) += hvc_rtas.o
obj-$(CONFIG_HVC_BEAT) += hvc_beat.o
obj-$(CONFIG_HVC_DRIVER) += hvc_console.o
obj-$(CONFIG_HVC_XEN) += hvc_xen.o
obj-$(CONFIG_VIRTIO_CONSOLE) += virtio_console.o
obj-$(CONFIG_RAW_DRIVER) += raw.o
obj-$(CONFIG_SGI_SNSC) += snsc.o snsc_event.o
obj-$(CONFIG_MSPEC) += mspec.o
......
/*D:300
* The Guest console driver
*
* This is a trivial console driver: we use lguest's DMA mechanism to send
* bytes out, and register a DMA buffer to receive bytes in. It is assumed to
* be present and available from the very beginning of boot.
*
* Writing console drivers is one of the few remaining Dark Arts in Linux.
* Fortunately for us, the path of virtual consoles has been well-trodden by
* the PowerPC folks, who wrote "hvc_console.c" to generically support any
......@@ -16,7 +12,7 @@
/*M:002 The console can be flooded: while the Guest is processing input the
* Host can send more. Buffering in the Host could alleviate this, but it is a
* difficult problem in general. :*/
/* Copyright (C) 2006 Rusty Russell, IBM Corporation
/* Copyright (C) 2006, 2007 Rusty Russell, IBM Corporation
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
......@@ -34,144 +30,196 @@
*/
#include <linux/err.h>
#include <linux/init.h>
#include <linux/lguest_bus.h>
#include <asm/paravirt.h>
#include <linux/virtio.h>
#include <linux/virtio_console.h>
#include "hvc_console.h"
/*D:340 This is our single console input buffer, with associated "struct
* lguest_dma" referring to it. Note the 0-terminated length array, and the
* use of physical address for the buffer itself. */
static char inbuf[256];
static struct lguest_dma cons_input = { .used_len = 0,
.addr[0] = __pa(inbuf),
.len[0] = sizeof(inbuf),
.len[1] = 0 };
/*D:340 These represent our input and output console queues, and the virtio
* operations for them. */
static struct virtqueue *in_vq, *out_vq;
static struct virtio_device *vdev;
/* This is our input buffer, and how much data is left in it. */
static unsigned int in_len;
static char *in, *inbuf;
/* The operations for our console. */
static struct hv_ops virtio_cons;
/*D:310 The put_chars() callback is pretty straightforward.
*
* First we put the pointer and length in a "struct lguest_dma": we only have
* one pointer, so we set the second length to 0. Then we use SEND_DMA to send
* the data to (Host) buffers attached to the console key. Usually a device's
* key is a physical address within the device's memory, but because the
* console device doesn't have any associated physical memory, we use the
* LGUEST_CONSOLE_DMA_KEY constant (aka 0). */
* We turn the characters into a scatter-gather list, add it to the output
* queue and then kick the Host. Then we sit here waiting for it to finish:
* inefficient in theory, but in practice implementations will do it
* immediately (lguest's Launcher does). */
static int put_chars(u32 vtermno, const char *buf, int count)
{
struct lguest_dma dma;
/* FIXME: DMA buffers in a "struct lguest_dma" are not allowed
* to go over page boundaries. This never seems to happen,
* but if it did we'd need to fix this code. */
dma.len[0] = count;
dma.len[1] = 0;
dma.addr[0] = __pa(buf);
struct scatterlist sg[1];
unsigned int len;
/* This is a convenient routine to initialize a single-elem sg list */
sg_init_one(sg, buf, count);
/* add_buf wants a token to identify this buffer: we hand it any
* non-NULL pointer, since there's only ever one buffer. */
if (out_vq->vq_ops->add_buf(out_vq, sg, 1, 0, (void *)1) == 0) {
/* Tell Host to go! */
out_vq->vq_ops->kick(out_vq);
/* Chill out until it's done with the buffer. */
while (!out_vq->vq_ops->get_buf(out_vq, &len))
cpu_relax();
}
lguest_send_dma(LGUEST_CONSOLE_DMA_KEY, &dma);
/* We're expected to return the amount of data we wrote: all of it. */
return count;
}
/* Create a scatter-gather list representing our input buffer and put it in the
* queue. */
static void add_inbuf(void)
{
struct scatterlist sg[1];
sg_init_one(sg, inbuf, PAGE_SIZE);
/* We should always be able to add one buffer to an empty queue. */
if (in_vq->vq_ops->add_buf(in_vq, sg, 0, 1, inbuf) != 0)
BUG();
in_vq->vq_ops->kick(in_vq);
}
/*D:350 get_chars() is the callback from the hvc_console infrastructure when
* an interrupt is received.
*
* Firstly we see if our buffer has been filled: if not, we return. The rest
* of the code deals with the fact that the hvc_console() infrastructure only
* asks us for 16 bytes at a time. We keep a "cons_offset" variable for
* partially-read buffers. */
* Most of the code deals with the fact that the hvc_console() infrastructure
* only asks us for 16 bytes at a time. We keep in_offset and in_used fields
* for partially-filled buffers. */
static int get_chars(u32 vtermno, char *buf, int count)
{
static int cons_offset;
/* If we don't have an input queue yet, we can't get input. */
BUG_ON(!in_vq);
/* Nothing left to see here... */
if (!cons_input.used_len)
/* No buffer? Try to get one. */
if (!in_len) {
in = in_vq->vq_ops->get_buf(in_vq, &in_len);
if (!in)
return 0;
}
/* You want more than we have to give? Well, try wanting less! */
if (cons_input.used_len - cons_offset < count)
count = cons_input.used_len - cons_offset;
if (in_len < count)
count = in_len;
/* Copy across to their buffer and increment offset. */
memcpy(buf, inbuf + cons_offset, count);
cons_offset += count;
/* Finished? Zero offset, and reset cons_input so Host will use it
* again. */
if (cons_offset == cons_input.used_len) {
cons_offset = 0;
cons_input.used_len = 0;
}
memcpy(buf, in, count);
in += count;
in_len -= count;
/* Finished? Re-register buffer so Host will use it again. */
if (in_len == 0)
add_inbuf();
return count;
}
/*:*/
static struct hv_ops lguest_cons = {
.get_chars = get_chars,
.put_chars = put_chars,
};
/*D:320 Console drivers are initialized very early so boot messages can go
* out. At this stage, the console is output-only. Our driver checks we're a
* Guest, and if so hands hvc_instantiate() the console number (0), priority
* (0), and the struct hv_ops containing the put_chars() function. */
static int __init cons_init(void)
/*D:320 Console drivers are initialized very early so boot messages can go out,
* so we do things slightly differently from the generic virtio initialization
* of the net and block drivers.
*
* At this stage, the console is output-only. It's too early to set up a
* virtqueue, so we let the drivers do some boutique early-output thing. */
int __init virtio_cons_early_init(int (*put_chars)(u32, const char *, int))
{
if (strcmp(pv_info.name, "lguest") != 0)
return 0;
return hvc_instantiate(0, 0, &lguest_cons);
virtio_cons.put_chars = put_chars;
return hvc_instantiate(0, 0, &virtio_cons);
}
console_initcall(cons_init);
/*D:370 To set up and manage our virtual console, we call hvc_alloc() and
* stash the result in the private pointer of the "struct lguest_device".
* Since we never remove the console device we never need this pointer again,
* but using ->private is considered good form, and you never know who's going
* to copy your driver.
/*D:370 Once we're further in boot, we get probed like any other virtio device.
* At this stage we set up the output virtqueue.
*
* Once the console is set up, we bind our input buffer ready for input. */
static int lguestcons_probe(struct lguest_device *lgdev)
* To set up and manage our virtual console, we call hvc_alloc(). Since we
* never remove the console device we never need this pointer again.
*
* Finally we put our input buffer in the input queue, ready to receive. */
static int virtcons_probe(struct virtio_device *dev)
{
int err;
struct hvc_struct *hvc;
vdev = dev;
/* This is the scratch page we use to receive console input */
inbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!inbuf) {
err = -ENOMEM;
goto fail;
}
/* Find the input queue. */
/* FIXME: This is why we want to wean off hvc: we do nothing
* when input comes in. */
in_vq = vdev->config->find_vq(vdev, NULL);
if (IS_ERR(in_vq)) {
err = PTR_ERR(in_vq);
goto free;
}
out_vq = vdev->config->find_vq(vdev, NULL);
if (IS_ERR(out_vq)) {
err = PTR_ERR(out_vq);
goto free_in_vq;
}
/* Start using the new console output. */
virtio_cons.get_chars = get_chars;
virtio_cons.put_chars = put_chars;
/* The first argument of hvc_alloc() is the virtual console number, so
* we use zero. The second argument is the interrupt number.
* we use zero. The second argument is the interrupt number; we
* currently leave this as zero: it would be better not to use the
* hvc mechanism and fix this (FIXME!).
*
* The third argument is a "struct hv_ops" containing the put_chars()
* and get_chars() pointers. The final argument is the output buffer
* size: we use 256 and expect the Host to have room for us to send
* that much. */
lgdev->private = hvc_alloc(0, lgdev_irq(lgdev), &lguest_cons, 256);
if (IS_ERR(lgdev->private))
return PTR_ERR(lgdev->private);
/* We bind a single DMA buffer at key LGUEST_CONSOLE_DMA_KEY.
* "cons_input" is that statically-initialized global DMA buffer we saw
* above, and we also give the interrupt we want. */
err = lguest_bind_dma(LGUEST_CONSOLE_DMA_KEY, &cons_input, 1,
lgdev_irq(lgdev));
if (err)
printk("lguest console: failed to bind buffer.\n");
* size: we can do any size, so we put PAGE_SIZE here. */
hvc = hvc_alloc(0, 0, &virtio_cons, PAGE_SIZE);
if (IS_ERR(hvc)) {
err = PTR_ERR(hvc);
goto free_out_vq;
}
/* Register the input buffer the first time. */
add_inbuf();
return 0;
free_out_vq:
vdev->config->del_vq(out_vq);
free_in_vq:
vdev->config->del_vq(in_vq);
free:
kfree(inbuf);
fail:
return err;
}
/* Note the use of lgdev_irq() for the interrupt number. We tell hvc_alloc()
* to expect input when this interrupt is triggered, and then tell
* lguest_bind_dma() that is the interrupt to send us when input comes in. */
/*D:360 From now on the console driver follows standard Guest driver form:
* register_lguest_driver() registers the device type and probe function, and
* the probe function sets up the device.
*
* The standard "struct lguest_driver": */
static struct lguest_driver lguestcons_drv = {
.name = "lguestcons",
.owner = THIS_MODULE,
.device_type = LGUEST_DEVICE_T_CONSOLE,
.probe = lguestcons_probe,
static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_CONSOLE, VIRTIO_DEV_ANY_ID },
{ 0 },
};
static struct virtio_driver virtio_console = {
.driver.name = KBUILD_MODNAME,
.driver.owner = THIS_MODULE,
.id_table = id_table,
.probe = virtcons_probe,
};
/* The standard init function */
static int __init hvc_lguest_init(void)
static int __init init(void)
{
return register_lguest_driver(&lguestcons_drv);
return register_virtio_driver(&virtio_console);
}
module_init(hvc_lguest_init);
module_init(init);
MODULE_DEVICE_TABLE(virtio, id_table);
MODULE_DESCRIPTION("Virtio console driver");
MODULE_LICENSE("GPL");
......@@ -47,4 +47,8 @@ config KVM_AMD
Provides support for KVM on AMD processors equipped with the AMD-V
(SVM) extensions.
# OK, it's a little counter-intuitive to do this, but it puts it neatly under
# the virtualization menu.
source drivers/lguest/Kconfig
endif # VIRTUALIZATION
config LGUEST
tristate "Linux hypervisor example code"
depends on X86 && PARAVIRT && EXPERIMENTAL && !X86_PAE && FUTEX
select LGUEST_GUEST
depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX && !(X86_VISWS || X86_VOYAGER)
select HVC_DRIVER
---help---
This is a very simple module which allows you to run
......@@ -18,13 +17,3 @@ config LGUEST_GUEST
The guest needs code built-in, even if the host has lguest
support as a module. The drivers are tiny, so we build them
in too.
config LGUEST_NET
tristate
default y
depends on LGUEST_GUEST && NET
config LGUEST_BLOCK
tristate
default y
depends on LGUEST_GUEST && BLOCK
# Guest requires the paravirt_ops replacement and the bus driver.
obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_asm.o lguest_bus.o
# Guest requires the device configuration and probing code.
obj-$(CONFIG_LGUEST_GUEST) += lguest_device.o
# Host requires the other files, which can be a module.
obj-$(CONFIG_LGUEST) += lg.o
lg-y := core.o hypercalls.o page_tables.o interrupts_and_traps.o \
segments.o io.o lguest_user.o switcher.o
lg-y = core.o hypercalls.o page_tables.o interrupts_and_traps.o \
segments.o lguest_user.o
lg-$(CONFIG_X86_32) += x86/switcher_32.o x86/core.o
Preparation Preparation!: PREFIX=P
Guest: PREFIX=G
......
此差异已折叠。
......@@ -25,17 +25,13 @@
#include <linux/mm.h>
#include <asm/page.h>
#include <asm/pgtable.h>
#include <irq_vectors.h>
#include "lg.h"
/*H:120 This is the core hypercall routine: where the Guest gets what it
* wants. Or gets killed. Or, in the case of LHCALL_CRASH, both.
*
* Remember from the Guest: %eax == which call to make, and the arguments are
* packed into %edx, %ebx and %ecx if needed. */
static void do_hcall(struct lguest *lg, struct lguest_regs *regs)
/*H:120 This is the core hypercall routine: where the Guest gets what it wants.
* Or gets killed. Or, in the case of LHCALL_CRASH, both. */
static void do_hcall(struct lguest *lg, struct hcall_args *args)
{
switch (regs->eax) {
switch (args->arg0) {
case LHCALL_FLUSH_ASYNC:
/* This call does nothing, except by breaking out of the Guest
* it makes us process all the asynchronous hypercalls. */
......@@ -51,7 +47,7 @@ static void do_hcall(struct lguest *lg, struct lguest_regs *regs)
char msg[128];
/* If the lgread fails, it will call kill_guest() itself; the
* kill_guest() with the message will be ignored. */
lgread(lg, msg, regs->edx, sizeof(msg));
__lgread(lg, msg, args->arg1, sizeof(msg));
msg[sizeof(msg)-1] = '\0';
kill_guest(lg, "CRASH: %s", msg);
break;
......@@ -59,67 +55,49 @@ static void do_hcall(struct lguest *lg, struct lguest_regs *regs)
case LHCALL_FLUSH_TLB:
/* FLUSH_TLB comes in two flavors, depending on the
* argument: */
if (regs->edx)
if (args->arg1)
guest_pagetable_clear_all(lg);
else
guest_pagetable_flush_user(lg);
break;
case LHCALL_BIND_DMA:
/* BIND_DMA really wants four arguments, but it's the only call
* which does. So the Guest packs the number of buffers and
* the interrupt number into the final argument, and we decode
* it here. This can legitimately fail, since we currently
* place a limit on the number of DMA pools a Guest can have.
* So we return true or false from this call. */
regs->eax = bind_dma(lg, regs->edx, regs->ebx,
regs->ecx >> 8, regs->ecx & 0xFF);
break;
/* All these calls simply pass the arguments through to the right
* routines. */
case LHCALL_SEND_DMA:
send_dma(lg, regs->edx, regs->ebx);
break;
case LHCALL_LOAD_GDT:
load_guest_gdt(lg, regs->edx, regs->ebx);
break;
case LHCALL_LOAD_IDT_ENTRY:
load_guest_idt_entry(lg, regs->edx, regs->ebx, regs->ecx);
break;
case LHCALL_NEW_PGTABLE:
guest_new_pagetable(lg, regs->edx);
guest_new_pagetable(lg, args->arg1);
break;
case LHCALL_SET_STACK:
guest_set_stack(lg, regs->edx, regs->ebx, regs->ecx);
guest_set_stack(lg, args->arg1, args->arg2, args->arg3);
break;
case LHCALL_SET_PTE:
guest_set_pte(lg, regs->edx, regs->ebx, mkgpte(regs->ecx));
guest_set_pte(lg, args->arg1, args->arg2, __pte(args->arg3));
break;
case LHCALL_SET_PMD:
guest_set_pmd(lg, regs->edx, regs->ebx);
break;
case LHCALL_LOAD_TLS:
guest_load_tls(lg, regs->edx);
guest_set_pmd(lg, args->arg1, args->arg2);
break;
case LHCALL_SET_CLOCKEVENT:
guest_set_clockevent(lg, regs->edx);
guest_set_clockevent(lg, args->arg1);
break;
case LHCALL_TS:
/* This sets the TS flag, as we saw used in run_guest(). */
lg->ts = regs->edx;
lg->ts = args->arg1;
break;
case LHCALL_HALT:
/* Similarly, this sets the halted flag for run_guest(). */
lg->halted = 1;
break;
case LHCALL_NOTIFY:
lg->pending_notify = args->arg1;
break;
default:
kill_guest(lg, "Bad hypercall %li\n", regs->eax);
if (lguest_arch_do_hcall(lg, args))
kill_guest(lg, "Bad hypercall %li\n", args->arg0);
}
}
/*:*/
/* Asynchronous hypercalls are easy: we just look in the array in the Guest's
* "struct lguest_data" and see if there are any new ones marked "ready".
/*H:124 Asynchronous hypercalls are easy: we just look in the array in the
* Guest's "struct lguest_data" to see if any new ones are marked "ready".
*
* We are careful to do these in order: obviously we respect the order the
* Guest put them in the ring, but we also promise the Guest that they will
......@@ -134,10 +112,9 @@ static void do_async_hcalls(struct lguest *lg)
if (copy_from_user(&st, &lg->lguest_data->hcall_status, sizeof(st)))
return;
/* We process "struct lguest_data"s hcalls[] ring once. */
for (i = 0; i < ARRAY_SIZE(st); i++) {
struct lguest_regs regs;
struct hcall_args args;
/* We remember where we were up to from last time. This makes
* sure that the hypercalls are done in the order the Guest
* places them in the ring. */
......@@ -152,18 +129,16 @@ static void do_async_hcalls(struct lguest *lg)
if (++lg->next_hcall == LHCALL_RING_SIZE)
lg->next_hcall = 0;
/* We copy the hypercall arguments into a fake register
* structure. This makes life simple for do_hcall(). */
if (get_user(regs.eax, &lg->lguest_data->hcalls[n].eax)
|| get_user(regs.edx, &lg->lguest_data->hcalls[n].edx)
|| get_user(regs.ecx, &lg->lguest_data->hcalls[n].ecx)
|| get_user(regs.ebx, &lg->lguest_data->hcalls[n].ebx)) {
/* Copy the hypercall arguments into a local copy of
* the hcall_args struct. */
if (copy_from_user(&args, &lg->lguest_data->hcalls[n],
sizeof(struct hcall_args))) {
kill_guest(lg, "Fetching async hypercalls");
break;
}
/* Do the hypercall, same as a normal one. */
do_hcall(lg, &regs);
do_hcall(lg, &args);
/* Mark the hypercall done. */
if (put_user(0xFF, &lg->lguest_data->hcall_status[n])) {
......@@ -171,9 +146,9 @@ static void do_async_hcalls(struct lguest *lg)
break;
}
/* Stop doing hypercalls if we've just done a DMA to the
* Launcher: it needs to service this first. */
if (lg->dma_is_pending)
/* Stop doing hypercalls if they want to notify the Launcher:
* it needs to service this first. */
if (lg->pending_notify)
break;
}
}
......@@ -182,76 +157,35 @@ static void do_async_hcalls(struct lguest *lg)
* Guest makes a hypercall, we end up here to set things up: */
static void initialize(struct lguest *lg)
{
u32 tsc_speed;
/* You can't do anything until you're initialized. The Guest knows the
* rules, so we're unforgiving here. */
if (lg->regs->eax != LHCALL_LGUEST_INIT) {
kill_guest(lg, "hypercall %li before LGUEST_INIT",
lg->regs->eax);
if (lg->hcall->arg0 != LHCALL_LGUEST_INIT) {
kill_guest(lg, "hypercall %li before INIT", lg->hcall->arg0);
return;
}
/* We insist that the Time Stamp Counter exist and doesn't change with
* cpu frequency. Some devious chip manufacturers decided that TSC
* changes could be handled in software. I decided that time going
* backwards might be good for benchmarks, but it's bad for users.
*
* We also insist that the TSC be stable: the kernel detects unreliable
* TSCs for its own purposes, and we use that here. */
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && !check_tsc_unstable())
tsc_speed = tsc_khz;
else
tsc_speed = 0;
/* The pointer to the Guest's "struct lguest_data" is the only
* argument. */
lg->lguest_data = (struct lguest_data __user *)lg->regs->edx;
/* If we check the address they gave is OK now, we can simply
* copy_to_user/from_user from now on rather than using lgread/lgwrite.
* I put this in to show that I'm not immune to writing stupid
* optimizations. */
if (!lguest_address_ok(lg, lg->regs->edx, sizeof(*lg->lguest_data))) {
if (lguest_arch_init_hypercalls(lg))
kill_guest(lg, "bad guest page %p", lg->lguest_data);
return;
}
/* The Guest tells us where we're not to deliver interrupts by putting
* the range of addresses into "struct lguest_data". */
if (get_user(lg->noirq_start, &lg->lguest_data->noirq_start)
|| get_user(lg->noirq_end, &lg->lguest_data->noirq_end)
/* We tell the Guest that it can't use the top 4MB of virtual
* addresses used by the Switcher. */
|| put_user(4U*1024*1024, &lg->lguest_data->reserve_mem)
|| put_user(tsc_speed, &lg->lguest_data->tsc_khz)
/* We also give the Guest a unique id, as used in lguest_net.c. */
|| put_user(lg->guestid, &lg->lguest_data->guestid))
|| get_user(lg->noirq_end, &lg->lguest_data->noirq_end))
kill_guest(lg, "bad guest page %p", lg->lguest_data);
/* We write the current time into the Guest's data page once now. */
write_timestamp(lg);
/* page_tables.c will also do some setup. */
page_table_guest_data_init(lg);
/* This is the one case where the above accesses might have been the
* first write to a Guest page. This may have caused a copy-on-write
* fault, but the Guest might be referring to the old (read-only)
* page. */
guest_pagetable_clear_all(lg);
}
/* Now we've examined the hypercall code; our Guest can make requests. There
* is one other way we can do things for the Guest, as we see in
* emulate_insn(). */
/*H:110 Tricky point: we mark the hypercall as "done" once we've done it.
* Normally we don't need to do this: the Guest will run again and update the
* trap number before we come back around the run_guest() loop to
* do_hypercalls().
*
* However, if we are signalled or the Guest sends DMA to the Launcher, that
* loop will exit without running the Guest. When it comes back it would try
* to re-run the hypercall. */
static void clear_hcall(struct lguest *lg)
{
lg->regs->trapnum = 255;
}
/*H:100
* Hypercalls
......@@ -261,16 +195,12 @@ static void clear_hcall(struct lguest *lg)
*/
void do_hypercalls(struct lguest *lg)
{
/* Not initialized yet? */
/* Not initialized yet? This hypercall must do it. */
if (unlikely(!lg->lguest_data)) {
/* Did the Guest make a hypercall? We might have come back for
* some other reason (an interrupt, a different trap). */
if (lg->regs->trapnum == LGUEST_TRAP_ENTRY) {
/* Set up the "struct lguest_data" */
initialize(lg);
/* The hypercall is done. */
clear_hcall(lg);
}
/* Hcall is done. */
lg->hcall = NULL;
return;
}
......@@ -280,12 +210,21 @@ void do_hypercalls(struct lguest *lg)
do_async_hcalls(lg);
/* If we stopped reading the hypercall ring because the Guest did a
* SEND_DMA to the Launcher, we want to return now. Otherwise if the
* Guest asked us to do a hypercall, we do it. */
if (!lg->dma_is_pending && lg->regs->trapnum == LGUEST_TRAP_ENTRY) {
do_hcall(lg, lg->regs);
/* The hypercall is done. */
clear_hcall(lg);
* NOTIFY to the Launcher, we want to return now. Otherwise we do
* the hypercall. */
if (!lg->pending_notify) {
do_hcall(lg, lg->hcall);
/* Tricky point: we reset the hcall pointer to mark the
* hypercall as "done". We use the hcall pointer rather than
* the trap number to indicate a hypercall is pending.
* Normally it doesn't matter: the Guest will run again and
* update the trap number before we come back here.
*
* However, if we are signalled or the Guest sends DMA to the
* Launcher, the run_guest() loop will exit without running the
* Guest. When it comes back it would try to re-run the
* hypercall. */
lg->hcall = NULL;
}
}
......@@ -295,6 +234,6 @@ void write_timestamp(struct lguest *lg)
{
struct timespec now;
ktime_get_real_ts(&now);
if (put_user(now, &lg->lguest_data->time))
if (copy_to_user(&lg->lguest_data->time, &now, sizeof(struct timespec)))
kill_guest(lg, "Writing timestamp");
}
......@@ -12,8 +12,14 @@
* them first, so we also have a way of "reflecting" them into the Guest as if
* they had been delivered to it directly. :*/
#include <linux/uaccess.h>
#include <linux/interrupt.h>
#include <linux/module.h>
#include "lg.h"
/* Allow Guests to use a non-128 (ie. non-Linux) syscall trap. */
static unsigned int syscall_vector = SYSCALL_VECTOR;
module_param(syscall_vector, uint, 0444);
/* The address of the interrupt handler is split into two bits: */
static unsigned long idt_address(u32 lo, u32 hi)
{
......@@ -39,7 +45,7 @@ static void push_guest_stack(struct lguest *lg, unsigned long *gstack, u32 val)
{
/* Stack grows upwards: move stack then write value. */
*gstack -= 4;
lgwrite_u32(lg, *gstack, val);
lgwrite(lg, *gstack, u32, val);
}
/*H:210 The set_guest_interrupt() routine actually delivers the interrupt or
......@@ -56,8 +62,9 @@ static void push_guest_stack(struct lguest *lg, unsigned long *gstack, u32 val)
* it). */
static void set_guest_interrupt(struct lguest *lg, u32 lo, u32 hi, int has_err)
{
unsigned long gstack;
unsigned long gstack, origstack;
u32 eflags, ss, irq_enable;
unsigned long virtstack;
/* There are two cases for interrupts: one where the Guest is already
* in the kernel, and a more complex one where the Guest is in
......@@ -65,8 +72,10 @@ static void set_guest_interrupt(struct lguest *lg, u32 lo, u32 hi, int has_err)
if ((lg->regs->ss&0x3) != GUEST_PL) {
/* The Guest told us their kernel stack with the SET_STACK
* hypercall: both the virtual address and the segment */
gstack = guest_pa(lg, lg->esp1);
virtstack = lg->esp1;
ss = lg->ss1;
origstack = gstack = guest_pa(lg, virtstack);
/* We push the old stack segment and pointer onto the new
* stack: when the Guest does an "iret" back from the interrupt
* handler the CPU will notice they're dropping privilege
......@@ -75,8 +84,10 @@ static void set_guest_interrupt(struct lguest *lg, u32 lo, u32 hi, int has_err)
push_guest_stack(lg, &gstack, lg->regs->esp);
} else {
/* We're staying on the same Guest (kernel) stack. */
gstack = guest_pa(lg, lg->regs->esp);
virtstack = lg->regs->esp;
ss = lg->regs->ss;
origstack = gstack = guest_pa(lg, virtstack);
}
/* Remember that we never let the Guest actually disable interrupts, so
......@@ -102,7 +113,7 @@ static void set_guest_interrupt(struct lguest *lg, u32 lo, u32 hi, int has_err)
/* Now we've pushed all the old state, we change the stack, the code
* segment and the address to execute. */
lg->regs->ss = ss;
lg->regs->esp = gstack + lg->page_offset;
lg->regs->esp = virtstack + (gstack - origstack);
lg->regs->cs = (__KERNEL_CS|GUEST_PL);
lg->regs->eip = idt_address(lo, hi);
......@@ -165,7 +176,7 @@ void maybe_do_interrupt(struct lguest *lg)
/* Look at the IDT entry the Guest gave us for this interrupt. The
* first 32 (FIRST_EXTERNAL_VECTOR) entries are for traps, so we skip
* over them. */
idt = &lg->idt[FIRST_EXTERNAL_VECTOR+irq];
idt = &lg->arch.idt[FIRST_EXTERNAL_VECTOR+irq];
/* If they don't have a handler (yet?), we just ignore it */
if (idt_present(idt->a, idt->b)) {
/* OK, mark it no longer pending and deliver it. */
......@@ -183,6 +194,47 @@ void maybe_do_interrupt(struct lguest *lg)
* timer interrupt. */
write_timestamp(lg);
}
/*:*/
/* Linux uses trap 128 for system calls. Plan9 uses 64, and Ron Minnich sent
* me a patch, so we support that too. It'd be a big step for lguest if half
* the Plan 9 user base were to start using it.
*
* Actually now I think of it, it's possible that Ron *is* half the Plan 9
* userbase. Oh well. */
static bool could_be_syscall(unsigned int num)
{
/* Normal Linux SYSCALL_VECTOR or reserved vector? */
return num == SYSCALL_VECTOR || num == syscall_vector;
}
/* The syscall vector it wants must be unused by Host. */
bool check_syscall_vector(struct lguest *lg)
{
u32 vector;
if (get_user(vector, &lg->lguest_data->syscall_vec))
return false;
return could_be_syscall(vector);
}
int init_interrupts(void)
{
/* If they want some strange system call vector, reserve it now */
if (syscall_vector != SYSCALL_VECTOR
&& test_and_set_bit(syscall_vector, used_vectors)) {
printk("lg: couldn't reserve syscall %u\n", syscall_vector);
return -EBUSY;
}
return 0;
}
void free_interrupts(void)
{
if (syscall_vector != SYSCALL_VECTOR)
clear_bit(syscall_vector, used_vectors);
}
/*H:220 Now we've got the routines to deliver interrupts, delivering traps
* like page fault is easy. The only trick is that Intel decided that some
......@@ -197,14 +249,14 @@ int deliver_trap(struct lguest *lg, unsigned int num)
{
/* Trap numbers are always 8 bit, but we set an impossible trap number
* for traps inside the Switcher, so check that here. */
if (num >= ARRAY_SIZE(lg->idt))
if (num >= ARRAY_SIZE(lg->arch.idt))
return 0;
/* Early on the Guest hasn't set the IDT entries (or maybe it put a
* bogus one in): if we fail here, the Guest will be killed. */
if (!idt_present(lg->idt[num].a, lg->idt[num].b))
if (!idt_present(lg->arch.idt[num].a, lg->arch.idt[num].b))
return 0;
set_guest_interrupt(lg, lg->idt[num].a, lg->idt[num].b, has_err(num));
set_guest_interrupt(lg, lg->arch.idt[num].a, lg->arch.idt[num].b, has_err(num));
return 1;
}
......@@ -218,28 +270,20 @@ int deliver_trap(struct lguest *lg, unsigned int num)
* system calls down from 1750ns to 270ns. Plus, if lguest didn't do it, all
* the other hypervisors would tease it.
*
* This routine determines if a trap can be delivered directly. */
static int direct_trap(const struct lguest *lg,
const struct desc_struct *trap,
unsigned int num)
* This routine indicates if a particular trap number could be delivered
* directly. */
static int direct_trap(unsigned int num)
{
/* Hardware interrupts don't go to the Guest at all (except system
* call). */
if (num >= FIRST_EXTERNAL_VECTOR && num != SYSCALL_VECTOR)
if (num >= FIRST_EXTERNAL_VECTOR && !could_be_syscall(num))
return 0;
/* The Host needs to see page faults (for shadow paging and to save the
* fault address), general protection faults (in/out emulation) and
* device not available (TS handling), and of course, the hypercall
* trap. */
if (num == 14 || num == 13 || num == 7 || num == LGUEST_TRAP_ENTRY)
return 0;
/* Only trap gates (type 15) can go direct to the Guest. Interrupt
* gates (type 14) disable interrupts as they are entered, which we
* never let the Guest do. Not present entries (type 0x0) also can't
* go direct, of course 8) */
return idt_type(trap->a, trap->b) == 0xF;
return num != 14 && num != 13 && num != 7 && num != LGUEST_TRAP_ENTRY;
}
/*:*/
......@@ -348,15 +392,11 @@ void load_guest_idt_entry(struct lguest *lg, unsigned int num, u32 lo, u32 hi)
* to copy this again. */
lg->changed |= CHANGED_IDT;
/* The IDT which we keep in "struct lguest" only contains 32 entries
* for the traps and LGUEST_IRQS (32) entries for interrupts. We
* ignore attempts to set handlers for higher interrupt numbers, except
* for the system call "interrupt" at 128: we have a special IDT entry
* for that. */
if (num < ARRAY_SIZE(lg->idt))
set_trap(lg, &lg->idt[num], num, lo, hi);
else if (num == SYSCALL_VECTOR)
set_trap(lg, &lg->syscall_idt, num, lo, hi);
/* Check that the Guest doesn't try to step outside the bounds. */
if (num >= ARRAY_SIZE(lg->arch.idt))
kill_guest(lg, "Setting idt entry %u", num);
else
set_trap(lg, &lg->arch.idt[num], num, lo, hi);
}
/* The default entry for each interrupt points into the Switcher routines which
......@@ -399,20 +439,21 @@ void copy_traps(const struct lguest *lg, struct desc_struct *idt,
/* We can simply copy the direct traps, otherwise we use the default
* ones in the Switcher: they will return to the Host. */
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) {
if (direct_trap(lg, &lg->idt[i], i))
idt[i] = lg->idt[i];
for (i = 0; i < ARRAY_SIZE(lg->arch.idt); i++) {
/* If no Guest can ever override this trap, leave it alone. */
if (!direct_trap(i))
continue;
/* Only trap gates (type 15) can go direct to the Guest.
* Interrupt gates (type 14) disable interrupts as they are
* entered, which we never let the Guest do. Not present
* entries (type 0x0) also can't go direct, of course. */
if (idt_type(lg->arch.idt[i].a, lg->arch.idt[i].b) == 0xF)
idt[i] = lg->arch.idt[i];
else
/* Reset it to the default. */
default_idt_entry(&idt[i], i, def[i]);
}
/* Don't forget the system call trap! The IDT entries for other
* interupts never change, so no need to copy them. */
i = SYSCALL_VECTOR;
if (direct_trap(lg, &lg->syscall_idt, i))
idt[i] = lg->syscall_idt;
else
default_idt_entry(&idt[i], i, def[i]);
}
void guest_set_clockevent(struct lguest *lg, unsigned long delta)
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -73,14 +73,14 @@ static void fixup_gdt_table(struct lguest *lg, unsigned start, unsigned end)
/* Segment descriptors contain a privilege level: the Guest is
* sometimes careless and leaves this as 0, even though it's
* running at privilege level 1. If so, we fix it here. */
if ((lg->gdt[i].b & 0x00006000) == 0)
lg->gdt[i].b |= (GUEST_PL << 13);
if ((lg->arch.gdt[i].b & 0x00006000) == 0)
lg->arch.gdt[i].b |= (GUEST_PL << 13);
/* Each descriptor has an "accessed" bit. If we don't set it
* now, the CPU will try to set it when the Guest first loads
* that entry into a segment register. But the GDT isn't
* writable by the Guest, so bad things can happen. */
lg->gdt[i].b |= 0x00000100;
lg->arch.gdt[i].b |= 0x00000100;
}
}
......@@ -106,12 +106,12 @@ void setup_default_gdt_entries(struct lguest_ro_state *state)
void setup_guest_gdt(struct lguest *lg)
{
/* Start with full 0-4G segments... */
lg->gdt[GDT_ENTRY_KERNEL_CS] = FULL_EXEC_SEGMENT;
lg->gdt[GDT_ENTRY_KERNEL_DS] = FULL_SEGMENT;
lg->arch.gdt[GDT_ENTRY_KERNEL_CS] = FULL_EXEC_SEGMENT;
lg->arch.gdt[GDT_ENTRY_KERNEL_DS] = FULL_SEGMENT;
/* ...except the Guest is allowed to use them, so set the privilege
* level appropriately in the flags. */
lg->gdt[GDT_ENTRY_KERNEL_CS].b |= (GUEST_PL << 13);
lg->gdt[GDT_ENTRY_KERNEL_DS].b |= (GUEST_PL << 13);
lg->arch.gdt[GDT_ENTRY_KERNEL_CS].b |= (GUEST_PL << 13);
lg->arch.gdt[GDT_ENTRY_KERNEL_DS].b |= (GUEST_PL << 13);
}
/* Like the IDT, we never simply use the GDT the Guest gives us. We set up the
......@@ -126,7 +126,7 @@ void copy_gdt_tls(const struct lguest *lg, struct desc_struct *gdt)
unsigned int i;
for (i = GDT_ENTRY_TLS_MIN; i <= GDT_ENTRY_TLS_MAX; i++)
gdt[i] = lg->gdt[i];
gdt[i] = lg->arch.gdt[i];
}
/* This is the full version */
......@@ -138,7 +138,7 @@ void copy_gdt(const struct lguest *lg, struct desc_struct *gdt)
* replaced. See ignored_gdt() above. */
for (i = 0; i < GDT_ENTRIES; i++)
if (!ignored_gdt(i))
gdt[i] = lg->gdt[i];
gdt[i] = lg->arch.gdt[i];
}
/* This is where the Guest asks us to load a new GDT (LHCALL_LOAD_GDT). */
......@@ -146,12 +146,12 @@ void load_guest_gdt(struct lguest *lg, unsigned long table, u32 num)
{
/* We assume the Guest has the same number of GDT entries as the
* Host, otherwise we'd have to dynamically allocate the Guest GDT. */
if (num > ARRAY_SIZE(lg->gdt))
if (num > ARRAY_SIZE(lg->arch.gdt))
kill_guest(lg, "too many gdt entries %i", num);
/* We read the whole thing in, then fix it up. */
lgread(lg, lg->gdt, table, num * sizeof(lg->gdt[0]));
fixup_gdt_table(lg, 0, ARRAY_SIZE(lg->gdt));
__lgread(lg, lg->arch.gdt, table, num * sizeof(lg->arch.gdt[0]));
fixup_gdt_table(lg, 0, ARRAY_SIZE(lg->arch.gdt));
/* Mark that the GDT changed so the core knows it has to copy it again,
* even if the Guest is run on the same CPU. */
lg->changed |= CHANGED_GDT;
......@@ -159,9 +159,9 @@ void load_guest_gdt(struct lguest *lg, unsigned long table, u32 num)
void guest_load_tls(struct lguest *lg, unsigned long gtls)
{
struct desc_struct *tls = &lg->gdt[GDT_ENTRY_TLS_MIN];
struct desc_struct *tls = &lg->arch.gdt[GDT_ENTRY_TLS_MIN];
lgread(lg, tls, gtls, sizeof(*tls)*GDT_ENTRY_TLS_ENTRIES);
__lgread(lg, tls, gtls, sizeof(*tls)*GDT_ENTRY_TLS_ENTRIES);
fixup_gdt_table(lg, GDT_ENTRY_TLS_MIN, GDT_ENTRY_TLS_MAX+1);
lg->changed |= CHANGED_GDT_TLS;
}
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
obj-$(CONFIG_VIRTIO) += virtio.o
obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
unifdef-y += sisfb.h uvesafb.h
unifdef-y += edid.h
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册