提交 · dd1a239f6f2d4d3eedd318583ec319aa145b324c · OpenHarmony / kernel_linux

28 4月, 2008 9 次提交

mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f

由 Mel Gorman 提交于 4月 28, 2008

Filtering zonelists requires very frequent use of zone_idx().  This is costly
as it involves a lookup of another structure and a substraction operation.  As
the zone_idx is often required, it should be quickly accessible.  The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index.  The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary.  Helpers are given for accessing the zone index as
well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Acked-by: NChristoph Lameter <clameter@sgi.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dd1a239f

mm: use two zonelist that are filtered by GFP mask · 54a6eb5c

由 Mel Gorman 提交于 4月 28, 2008

Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations.  Based on the zones
allowed by a gfp mask, one of these zonelists is selected.  All of these
zonelists consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists.  The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages.  The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.

An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Acked-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

54a6eb5c

mm: remember what the preferred zone is for zone_statistics · 18ea7e71

由 Mel Gorman 提交于 4月 28, 2008

On NUMA, zone_statistics() is used to record events like numa hit, miss and
foreign.  It assumes that the first zone in a zonelist is the preferred zone.
When multiple zonelists are replaced by one that is filtered, this is no
longer the case.

This patch records what the preferred zone is rather than assuming the first
zone in the zonelist is it.  This simplifies the reading of later patches in
this set.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: NChristoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

18ea7e71

mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d

由 Mel Gorman 提交于 4月 28, 2008

Introduce a node_zonelist() helper function.  It is used to lookup the
appropriate zonelist given a node and a GFP mask.  The patch on its own is a
cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
necessary, it can be merged with the next patch in this set without problems.
Reviewed-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0e88460d

mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b

由 Mel Gorman 提交于 4月 28, 2008

The following patches replace multiple zonelists per node with two zonelists
that are filtered based on the GFP flags.  The patches as a set fix a bug with
regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
MPOL_BIND will apply to the two highest zones when the highest zone is
ZONE_MOVABLE.  This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
only custom zonelists.

The first patch cleans up an inconsistency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch introduces a helper function node_zonelist() for looking up
the appropriate zonelist for a GFP mask which simplifies patches later in the
set.

The third patch defines/remembers the "preferred zone" for numa statistics, as
it is no longer always the first zone in a zonelist.

The forth patch replaces multiple zonelists with two zonelists that are
filtered.  The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fifth patch introduces helper macros for retrieving the zone and node
indices of entries in a zonelist.

The final patch introduces filtering of the zonelists based on a nodemask.
Two zonelists exist per node, one for normal allocations and one for
__GFP_THISNODE.

Performance results varied depending on the machine configuration.  In real
workloads the gain/loss will depend on how much the userspace portion of the
benchmark benefits from having more cache available due to reduced referencing
of zonelists.

These are the range of performance losses/gains when running against
2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.
			     loss   to  gain
Total CPU time on Kernbench: -0.86% to  1.13%
Elapsed   time on Kernbench: -0.79% to  0.76%
page_test from aim9:         -4.37% to  0.79%
brk_test  from aim9:         -0.71% to  4.07%
fork_test from aim9:         -1.84% to  4.60%
exec_test from aim9:         -0.71% to  1.08%

This patch:

The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation.  Similarly, direct reclaim of pages
iterates over an array of zones.  For consistency, this patch converts direct
reclaim to use a zonelist.  No functionality is changed by this patch.  This
simplifies zonelist iterators in the next patch.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Acked-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dac1d27b

mm: remove nopage · 3c18ddd1

由 Nick Piggin 提交于 4月 28, 2008

Nothing in the tree uses nopage any more.  Remove support for it in the
core mm code and documentation (and a few stray references to it in
comments).
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3c18ddd1

remove sparse warning for mmzone.h · ddc81ed2

由 Harvey Harrison 提交于 4月 28, 2008

include/linux/mmzone.h:640:22: warning: potentially expensive pointer subtraction

Calculate the offset into the node_zones array rather than the index
using casts to (char *) and comparing against the index * sizeof(struct zone).

On X86_32 this saves a sar, but code size increases by one byte per
is_highmem() use due to 32-bit cmps rather than 16 bit cmps.

Before:
 207:   2b 80 8c 07 00 00       sub    0x78c(%eax),%eax
 20d:   c1 f8 0b                sar    $0xb,%eax
 210:   83 f8 02                cmp    $0x2,%eax
 213:   74 16                   je     22b <kmap_atomic_prot+0x144>
 215:   83 f8 03                cmp    $0x3,%eax
 218:   0f 85 8f 00 00 00       jne    2ad <kmap_atomic_prot+0x1c6>
 21e:   83 3d 00 00 00 00 02    cmpl   $0x2,0x0
 225:   0f 85 82 00 00 00       jne    2ad <kmap_atomic_prot+0x1c6>
 22b:   64 a1 00 00 00 00       mov    %fs:0x0,%eax

After:
 207:   2b 80 8c 07 00 00       sub    0x78c(%eax),%eax
 20d:   3d 00 10 00 00          cmp    $0x1000,%eax
 212:   74 18                   je     22c <kmap_atomic_prot+0x145>
 214:   3d 00 18 00 00          cmp    $0x1800,%eax
 219:   0f 85 8f 00 00 00       jne    2ae <kmap_atomic_prot+0x1c7>
 21f:   83 3d 00 00 00 00 02    cmpl   $0x2,0x0
 226:   0f 85 82 00 00 00       jne    2ae <kmap_atomic_prot+0x1c7>
 22c:   64 a1 00 00 00 00       mov    %fs:0x0,%eax

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ddc81ed2

Remove set_migrateflags() · 488514d1

由 Christoph Lameter 提交于 4月 28, 2008

Migrate flags must be set on slab creation as agreed upon when the antifrag
logic was reviewed.  Otherwise some slabs of a slabcache will end up in the
unmovable and others in the reclaimable section depending on which flag was
active when a new slab page was allocated.

This likely slid in somehow when antifrag was merged. Remove it.

The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
SLAB_RECLAIM_ACCOUNT option is set.  The set_migrateflags() never had any
effect there.

Radix tree allocations are not directly reclaimable but they are allocated
with __GFP_RECLAIMABLE set on each allocation.  We now set
SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
tree slabs are consistently placed in the reclaimable section.  Radix tree
slabs will also be accounted as such.

There is then no user left of set_migratepages. So remove it.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

488514d1

hotplug memory remove: generic __remove_pages() support · ea01ea93

由 Badari Pulavarty 提交于 4月 28, 2008

Generic helper function to remove section mappings and sysfs entries for the
section of the memory we are removing.  offline_pages() correctly adjusted
zone and marked the pages reserved.

TODO: Yasunori Goto is working on patches to free up allocations from bootmem.
Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ea01ea93

27 4月, 2008 31 次提交

KVM: kill file->f_count abuse in kvm · 66c0b394

由 Al Viro 提交于 4月 19, 2008

Use kvm own refcounting instead of playing with ->filp->f_count.
That will allow to get rid of a lot of crap in anon_inode_getfd() and
kill a race in kvm_dev_ioctl_create_vm() (file might have been closed
immediately by another thread, so ->filp might point to already freed
struct file when we get around to setting it).
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

66c0b394

KVM: ppc: Add DCR access information to struct kvm_run · b2312f05

由 Hollis Blanchard 提交于 4月 16, 2008

Device Control Registers are essentially another address space found on PowerPC
4xx processors, analogous to PIO on x86. DCRs are always 32 bits, and can be
identified by a 32-bit number. We forward most DCR accesses to userspace for
emulation (with the exception of CPR0 registers, which can be read directly
for simplicity in timebase frequency determination).
Signed-off-by: NHollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

b2312f05

KVM: Rename debugfs_dir to kvm_debugfs_dir · 76f7c879

由 Hollis Blanchard 提交于 4月 15, 2008

It's a globally exported symbol now.
Signed-off-by: NHollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

76f7c879

KVM: add ioctls to save/store mpstate · 62d9f0db

由 Marcelo Tosatti 提交于 4月 11, 2008

So userspace can save/restore the mpstate during migration.

[avi: export the #define constants describing the value]
[christian: add s390 stubs]
[avi: ditto for ia64]
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

62d9f0db

ide: remove now unused ide_pci_create_host_proc() · fd0949e6

由 Alexey Dobriyan 提交于 4月 27, 2008

It creates files in proc with obsoleted ->get_info interface.
Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

fd0949e6

ide: add struct ide_io_ports (take 3) · 4c3032d8

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

* Add struct ide_io_ports and use it instead of `unsigned long io_ports[]`
  in ide_hwif_t.

* Rename io_ports[] in hw_regs_t to io_ports_array[].

* Use un-named union for 'unsigned long io_ports_array[]' and 'struct
  ide_io_ports io_ports' in hw_regs_t.

* Remove IDE_*_OFFSET defines.

v2:
* scc_pata.c build fix from Stephen Rothwell.

v3:
* Fix ctl_adrr typo in Sparc-specific part of ns87415.c.
  (Noticed by Andrew Morton)
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

4c3032d8

ide: make ide_unregister() take 'ide_hwif_t *' as an argument (take 2) · 387750c3

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

* Make ide_unregister() take 'ide_hwif_t *hwif' instead of 'unsigned int
  index' (hwif->index) as an argument and update all users accordingly.

While at it:

* Remove unnecessary checks for hwif != NULL from ide-pnp.c::idepnp_remove()
  and delkin_cb.c::delkin_cb_remove().

* Remove needless hwif->chipset assignment from scc_pata.c::scc_remove().

v2:
* Fixup ide_unregister() documentation.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

387750c3

ide: add "noacpi" / "acpigtf" / "acpionboot" parameters · 1dbfeb4b

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

* Rename ide_noacpi{tfs,onboot} to ide_acpi{gtf,onboot} (+ reverse logic).

* Move ide_*acpi* variables to ide-acpi.c and remove unnecessary initializers.

* Add "noacpi" / "acpigtf" / "acpionboot" parameters.

* Obsolete "ide=noacpi" / "ide=acpigtf" / "ide=acpionboot" kernel parameters.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

1dbfeb4b

ide: remove obsoleted "hdx=autotune" kernel parameter · 207daeaa

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

* Remove obsoleted "hdx=autotune" kernel parameter
  (we always auto-tune PIO if possible nowadays).

* Remove no longer needed ide_drive_t.autotune flag.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

207daeaa

ide: remove IDE_HFLAG_NO_AUTOTUNE host flag · e160124f

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

* Don't set IDE_HFLAG_NO_AUTOTUNE host flag in sgiioc4 and icside
  host drivers - there is no need for it as they don't implement
  ->set_pio_mode method.

* Remove no longer needed IDE_HFLAG_NO_AUTOTUNE host flag.

There should be no functional changes caused by this patch.
Acked-by: NSergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

e160124f

ide: add "vlb|pci_clock=" parameter · ebae41a5

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

* Add "vlb_clock=" parameter for specifying VLB clock frequency (in MHz).

* Add "pci_clock=" parameter for specifying PCI bus clock frequency (in MHz).

While at it:

* qd65xx.c: rename {active,recovery}_cycle variables to {act,rec}_cyc.

Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: NSergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

ebae41a5

ide: remove obsoleted "hdx=noautotune" kernel parameter · cc12175f

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

Remove obsoleted "hdx=noautotune" kernel parameter
(it has been obsoleted since 1 Nov 2004).

Then make ide_hwif_t.autotune a single bit flag
and remove no longer needed IDE_TUNE_* defines.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

cc12175f

ide: remove obsoleted "idex=reset" kernel parameter · e460a597

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

Remove obsoleted "idex=reset" kernel parameter
(it has been obsoleted since 1 Nov 2004).

Then remove corresponding code from ide_probe_port()
and no longer used ->reset field from ide_hwif_t.
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

e460a597

ide: add "ignore_cable" parameter (take 2) · 9fd91d95

由 Bartlomiej Zolnierkiewicz 提交于 4月 27, 2008

Add "ignore_cable" parameter:

* "ide_core.ignore_cable=[interface_number]" boot option if IDE is built-in
  (i.e. "ide_core.ignore_cable=1" to force ignoring cable for "ide1")

* "ignore_cable=[interface_number]" module parameter (for ide_core module)
  if IDE is compiled as module

v2:
* Add ide_port_apply_params() helper
  - use it in ide_device_add_all() and ide_scan_port().

* Make it possible to later disable ignoring cable detection by passing
  "[interface_number]:0" to /sys/module/ide_core/parameters/ignore_cable
  (however sysfs interface is not enabled yet since it needs some other
   IDE changes to make it work reliable).
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

9fd91d95

KVM: hlt emulation should take in-kernel APIC/PIT timers into account · 3d80840d

由 Marcelo Tosatti 提交于 4月 11, 2008

Timers that fire between guest hlt and vcpu_block's add_wait_queue() are
ignored, possibly resulting in hangs.

Also make sure that atomic_inc and waitqueue_active tests happen in the
specified order, otherwise the following race is open:

CPU0                                        CPU1
                                            if (waitqueue_active(wq))
add_wait_queue()
if (!atomic_read(pit_timer->pending))
    schedule()
                                            atomic_inc(pit_timer->pending)
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

3d80840d

KVM: Add kvm trace userspace interface · d4c9ff2d

由 Feng(Eric) Liu 提交于 4月 10, 2008

This interface allows user a space application to read the trace of kvm
related events through relayfs.
Signed-off-by: NFeng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

d4c9ff2d

KVM: Add trace markers · 2714d1d3

由 Feng (Eric) Liu 提交于 4月 10, 2008

Trace markers allow userspace to trace execution of a virtual machine
in order to monitor its performance.
Signed-off-by: NFeng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

2714d1d3

KVM: MMU: Don't assume struct page for x86 · 35149e21

由 Anthony Liguori 提交于 4月 02, 2008

This patch introduces a gfn_to_pfn() function and corresponding functions like
kvm_release_pfn_dirty().  Using these new functions, we can modify the x86
MMU to no longer assume that it can always get a struct page for any given gfn.

We don't want to eliminate gfn_to_page() entirely because a number of places
assume they can do gfn_to_page() and then kmap() the results.  When we support
IO memory, gfn_to_page() will fail for IO pages although gfn_to_pfn() will
succeed.

This does not implement support for avoiding reference counting for reserved
RAM or for IO memory.  However, it should make those things pretty straight
forward.

Since we're only introducing new common symbols, I don't think it will break
the non-x86 architectures but I haven't tested those.  I've tested Intel,
AMD, NPT, and hugetlbfs with Windows and Linux guests.

[avi: fix overflow when shifting left pfns by adding casts]
Signed-off-by: NAnthony Liguori <aliguori@us.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

35149e21

KVM: add vm refcounting · d39f13b0

由 Izik Eidus 提交于 3月 30, 2008

the main purpose of adding this functions is the abilaty to release the
spinlock that protect the kvm list while still be able to do operations
on a specific kvm in a safe way.
Signed-off-by: NIzik Eidus <izike@qumranet.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

d39f13b0

KVM: s390: intercepts for diagnose instructions · e28acfea

由 Christian Borntraeger 提交于 3月 25, 2008

This patch introduces interpretation of some diagnose instruction intercepts.
Diagnose is our classic architected way of doing a hypercall. This patch
features the following diagnose codes:
- vm storage size, that tells the guest about its memory layout
- time slice end, which is used by the guest to indicate that it waits
  for a lock and thus cannot use up its time slice in a useful way
- ipl functions, which a guest can use to reset and reboot itself

In order to implement ipl functions, we also introduce an exit reason that
causes userspace to perform various resets on the virtual machine. All resets
are described in the principles of operation book, except KVM_S390_RESET_IPL
which causes a reboot of the machine.
Acked-by: NMartin Schwidefsky <martin.schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

e28acfea

KVM: s390: interrupt subsystem, cpu timer, waitpsw · ba5c1e9b

由 Carsten Otte 提交于 3月 25, 2008

This patch contains the s390 interrupt subsystem (similar to in kernel apic)
including timer interrupts (similar to in-kernel-pit) and enabled wait
(similar to in kernel hlt).

In order to achieve that, this patch also introduces intercept handling
for instruction intercepts, and it implements load control instructions.

This patch introduces an ioctl KVM_S390_INTERRUPT which is valid for both
the vm file descriptors and the vcpu file descriptors. In case this ioctl is
issued against a vm file descriptor, the interrupt is considered floating.
Floating interrupts may be delivered to any virtual cpu in the configuration.

The following interrupts are supported:
SIGP STOP       - interprocessor signal that stops a remote cpu
SIGP SET PREFIX - interprocessor signal that sets the prefix register of a
                  (stopped) remote cpu
INT EMERGENCY   - interprocessor interrupt, usually used to signal need_reshed
                  and for smp_call_function() in the guest.
PROGRAM INT     - exception during program execution such as page fault, illegal
                  instruction and friends
RESTART         - interprocessor signal that starts a stopped cpu
INT VIRTIO      - floating interrupt for virtio signalisation
INT SERVICE     - floating interrupt for signalisations from the system
                  service processor

struct kvm_s390_interrupt, which is submitted as ioctl parameter when injecting
an interrupt, also carrys parameter data for interrupts along with the interrupt
type. Interrupts on s390 usually have a state that represents the current
operation, or identifies which device has caused the interruption on s390.

kvm_s390_handle_wait() does handle waitpsw in two flavors: in case of a
disabled wait (that is, disabled for interrupts), we exit to userspace. In case
of an enabled wait we set up a timer that equals the cpu clock comparator value
and sleep on a wait queue.

[christian: change virtio interrupt to 0x2603]
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

ba5c1e9b

KVM: s390: sie intercept handling · 8f2abe6a

由 Christian Borntraeger 提交于 3月 25, 2008

This path introduces handling of sie intercepts in three flavors: Intercepts
are either handled completely in-kernel by kvm_handle_sie_intercept(),
or passed to userspace with corresponding data in struct kvm_run in case
kvm_handle_sie_intercept() returns -ENOTSUPP.
In case of partial execution in kernel with the need of userspace support,
kvm_handle_sie_intercept() may choose to set up struct kvm_run and return
-EREMOTE.

The trivial intercept reasons are handled in this patch:
handle_noop() just does nothing for intercepts that don't require our support
  at all
handle_stop() is called when a cpu enters stopped state, and it drops out to
  userland after updating our vcpu state
handle_validity() faults in the cpu lowcore if needed, or passes the request
  to userland
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

8f2abe6a

KVM: s390: arch backend for the kvm kernel module · b0c632db

由 Heiko Carstens 提交于 3月 25, 2008

This patch contains the port of Qumranet's kvm kernel module to IBM zSeries
(aka s390x, mainframe) architecture. It uses the mainframe's virtualization
instruction SIE to run virtual machines with up to 64 virtual CPUs each.
This port is only usable on 64bit host kernels, and can only run 64bit guest
kernels. However, running 31bit applications in guest userspace is possible.

The following source files are introduced by this patch
arch/s390/kvm/kvm-s390.c similar to arch/x86/kvm/x86.c, this implements all
arch callbacks for kvm. __vcpu_run calls back into
sie64a to enter the guest machine context
arch/s390/kvm/sie64a.S assembler function sie64a, which enters guest
context via SIE, and switches world before and after that
include/asm-s390/kvm_host.h contains all vital data structures needed to run
virtual machines on the mainframe
include/asm-s390/kvm.h defines kvm_regs and friends for user access to
guest register content
arch/s390/kvm/gaccess.h functions similar to uaccess to access guest memory
arch/s390/kvm/kvm-s390.h header file for kvm-s390 internals, extended by
later patches
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

b0c632db

s390: KVM preparation: provide hook to enable pgstes in user pagetable · 402b0862

由 Carsten Otte 提交于 3月 25, 2008

The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.

Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

402b0862

A
KVM: Move some x86 specific constants and structures to include/asm-x86 · 69a9f69b
由 Avi Kivity 提交于 3月 21, 2008
```
Signed-off-by: NAvi Kivity <avi@qumranet.com>
```
69a9f69b

KVM: kvm.h: __user requires compiler.h · 97646202

由 Christian Borntraeger 提交于 3月 12, 2008

include/linux/kvm.h defines struct kvm_dirty_log to
	[...]
	union {
		void __user *dirty_bitmap; /* one bit per page */
		__u64 padding;
	};

__user requires compiler.h to compile. Currently, this works on x86
only coincidentally due to other include files. This patch makes
kvm.h compile in all cases.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

97646202

KVM: MMU: hypercall based pte updates and TLB flushes · 2f333bcb

由 Marcelo Tosatti 提交于 2月 22, 2008

Hypercall based pte updates are faster than faults, and also allow use
of the lazy MMU mode to batch operations.

Don't report the feature if two dimensional paging is enabled.

[avi:
 - one mmu_op hypercall instead of one per op
 - allow 64-bit gpa on hypercall
 - don't pass host errors (-ENOMEM) to guest]

[akpm: warning fix on i386]
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

2f333bcb

x86: KVM guest: add basic paravirt support · 0cf1bfd2

由 Marcelo Tosatti 提交于 2月 22, 2008

Add basic KVM paravirt support. Avoid vm-exits on IO delays.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

0cf1bfd2

KVM: add basic paravirt support · a28e4f5a

由 Marcelo Tosatti 提交于 2月 22, 2008

Add basic KVM paravirt support. Avoid vm-exits on IO delays.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

a28e4f5a

S
KVM: Add save/restore supporting of in kernel PIT · e0f63cb9
由 Sheng Yang 提交于 3月 04, 2008
```
Signed-off-by: NSheng Yang <sheng.yang@intel.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>
```
e0f63cb9

KVM: In kernel PIT model · 7837699f

由 Sheng Yang 提交于 1月 28, 2008

The patch moves the PIT model from userspace to kernel, and increases
the timer accuracy greatly.

[marcelo: make last_injected_time per-guest]
Signed-off-by: NSheng Yang <sheng.yang@intel.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Tested-and-Acked-by: NAlex Davis <alex14641@yahoo.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

7837699f

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多