提交 · 6dbde3530850d4d8bfc1b6bd4006d92786a2787f · openeuler / raspberrypi-kernel

16 1月, 2009 2 次提交

percpu: add optimized generic percpu accessors · 6dbde353

由 Ingo Molnar 提交于 1月 15, 2009

It is an optimization and a cleanup, and adds the following new
generic percpu methods:

  percpu_read()
  percpu_write()
  percpu_add()
  percpu_sub()
  percpu_and()
  percpu_or()
  percpu_xor()

and implements support for them on x86. (other architectures will fall
back to a default implementation)

The advantage is that for example to read a local percpu variable,
instead of this sequence:

 return __get_cpu_var(var);

 ffffffff8102ca2b:	48 8b 14 fd 80 09 74 	mov    -0x7e8bf680(,%rdi,8),%rdx
 ffffffff8102ca32:	81
 ffffffff8102ca33:	48 c7 c0 d8 59 00 00 	mov    $0x59d8,%rax
 ffffffff8102ca3a:	48 8b 04 10          	mov    (%rax,%rdx,1),%rax

We can get a single instruction by using the optimized variants:

 return percpu_read(var);

 ffffffff8102ca3f:	65 48 8b 05 91 8f fd 	mov    %gs:0x7efd8f91(%rip),%rax

I also cleaned up the x86-specific APIs and made the x86 code use
these new generic percpu primitives.

tj: * fixed generic percpu_sub() definition as Roel Kluin pointed out
    * added percpu_and() for completeness's sake
    * made generic percpu ops atomic against preemption
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NTejun Heo <tj@kernel.org>

6dbde353

x86: misc clean up after the percpu update · 004aa322

由 Tejun Heo 提交于 1月 13, 2009

Do the following cleanups:

* kill x86_64_init_pda() which now is equivalent to pda_init()

* use per_cpu_offset() instead of cpu_pda() when initializing
  initial_gs
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

004aa322

12 1月, 2009 1 次提交

x86: change flush_tlb_others to take a const struct cpumask · 4595f962

由 Rusty Russell 提交于 1月 10, 2009

Impact: reduce stack usage, use new cpumask API.

This is made a little more tricky by uv_flush_tlb_others which
actually alters its argument, for an IPI to be sent to the remaining
cpus in the mask.

I solve this by allocating a cpumask_var_t for this case and falling back
to IPI should this fail.

To eliminate temporaries in the caller, all flush_tlb_others implementations
now do the this-cpu-elimination step themselves.

Note also the curious "cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask)"
which has been there since pre-git and yet f->flush_cpumask is always zero
at this point.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NMike Travis <travis@sgi.com>

4595f962

17 12月, 2008 2 次提交

xen: clean up asm/xen/hypervisor.h · ecbf29cd

由 Jeremy Fitzhardinge 提交于 12月 16, 2008

Impact: cleanup

hypervisor.h had accumulated a lot of crud, including lots of spurious
#includes.  Clean it all up, and go around fixing up everything else
accordingly.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ecbf29cd

xen: whitespace/checkpatch cleanup · f63c2f24

由 Tej 提交于 12月 16, 2008

Impact: cleanup
Signed-off-by: NTej <bewith.tej@gmail.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f63c2f24

07 11月, 2008 1 次提交

xen: make sure stray alias mappings are gone before pinning · d05fdf31

由 Jeremy Fitzhardinge 提交于 10月 28, 2008

Xen requires that all mappings of pagetable pages are read-only, so
that they can't be updated illegally.  As a result, if a page is being
turned into a pagetable page, we need to make sure all its mappings
are RO.

If the page had been used for ioremap or vmalloc, it may still have
left over mappings as a result of not having been lazily unmapped.
This change makes sure we explicitly mop them all up before pinning
the page.

Unlike aliases created by kmap, the there can be vmalloc aliases even
for non-high pages, so we must do the flush unconditionally.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Linux Memory Management List <linux-mm@kvack.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d05fdf31

20 10月, 2008 1 次提交

mm: rewrite vmap layer · db64fe02

由 Nick Piggin 提交于 10月 18, 2008

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap.  Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache.  This is all done under a global lock.  As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
 This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock.  It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems.  The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated.  So the addresses aren't allocated again until
a subsequent TLB flush.  A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap.  Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages.  Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
in nanoseconds per map+touch+unmap.

threads           vanilla         vmap rewrite
1                 14700           2900
2                 33600           3000
4                 49500           2800
8                 70631           2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram...  along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system.  I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now.  vmap is pretty well blown off the
profiles.

Before:
1352059 total                                      0.1401
798784 _write_lock                              8320.6667 <- vmlist_lock
529313 default_idle                             1181.5022
 15242 smp_call_function                         15.8771  <- vmap tlb flushing
  2472 __get_vm_area_node                         1.9312  <- vmap
  1762 remove_vm_area                             4.5885  <- vunmap
   316 map_vm_area                                0.2297  <- vmap
   312 kfree                                      0.1950
   300 _spin_lock                                 3.1250
   252 sn_send_IPI_phys                           0.4375  <- tlb flushing
   238 vmap                                       0.8264  <- vmap
   216 find_lock_page                             0.5192
   196 find_next_bit                              0.3603
   136 sn2_send_IPI                               0.2024
   130 pio_phys_write_mmr                         2.0312
   118 unmap_kernel_range                         0.1229

After:
 78406 total                                      0.0081
 40053 default_idle                              89.4040
 33576 ia64_spinlock_contention                 349.7500
  1650 _spin_lock                                17.1875
   319 __reg_op                                   0.5538
   281 _atomic_dec_and_lock                       1.0977
   153 mutex_unlock                               1.5938
   123 iget_locked                                0.1671
   117 xfs_dir_lookup                             0.1662
   117 dput                                       0.1406
   114 xfs_iget_core                              0.0268
    92 xfs_da_hashname                            0.1917
    75 d_alloc                                    0.0670
    68 vmap_page_range                            0.0462 <- vmap
    58 kmem_cache_alloc                           0.0604
    57 memset                                     0.0540
    52 rb_next                                    0.1625
    50 __copy_user                                0.0208
    49 bitmap_find_free_region                    0.2188 <- vmap
    46 ia64_sn_udelay                             0.1106
    45 find_inode_fast                            0.1406
    42 memcmp                                     0.2188
    42 finish_task_switch                         0.1094
    42 __d_lookup                                 0.0410
    40 radix_tree_lookup_slot                     0.1250
    37 _spin_unlock_irqrestore                    0.3854
    36 xfs_bmapi                                  0.0050
    36 kmem_cache_free                            0.0256
    35 xfs_vn_getattr                             0.0322
    34 radix_tree_lookup                          0.1062
    33 __link_path_walk                           0.0035
    31 xfs_da_do_buf                              0.0091
    30 _xfs_buf_find                              0.0204
    28 find_get_page                              0.0875
    27 xfs_iread                                  0.0241
    27 __strncpy_from_user                        0.2812
    26 _xfs_buf_initialize                        0.0406
    24 _xfs_buf_lookup_pages                      0.0179
    24 vunmap_page_range                          0.0250 <- vunmap
    23 find_lock_page                             0.0799
    22 vm_map_ram                                 0.0087 <- vmap
    20 kfree                                      0.0125
    19 put_page                                   0.0330
    18 __kmalloc                                  0.0176
    17 xfs_da_node_lookup_int                     0.0086
    17 _read_lock                                 0.0885
    17 page_waitqueue                             0.0664

vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.

[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db64fe02

10 10月, 2008 1 次提交

xen: do not reserve 2 pages of padding between hypervisor and fixmap. · 5dc64a34

由 Ian Campbell 提交于 10月 10, 2008

When reserving space for the hypervisor the Xen paravirt backend adds
an extra two pages (this was carried forward from the 2.6.18-xen tree
which had them "for safety"). Depending on various CONFIG options this
can cause the boot time fixmaps to span multiple PMDs which is not
supported and triggers a WARN in early_ioremap_init().

This was exposed by 2216d199 which
moved the dmi table parsing earlier.
    x86: fix CONFIG_X86_RESERVE_LOW_64K=y

    The bad_bios_dmi_table() quirk never triggered because we do DMI setup
    too late. Move it a bit earlier.

There is no real reason to reserve these two extra pages and the
fixmap already incorporates FIX_HOLE which serves the same
purpose. None of the other callers of reserve_top_address do this.
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

5dc64a34

03 10月, 2008 1 次提交

xen: clean up x86-64 warnings · db053b86

由 Jeremy Fitzhardinge 提交于 10月 02, 2008

There are a couple of Xen features which rely on directly accessing
per-cpu data via a segment register, which is not yet available on
x86-64. In the meantime, just disable direct access to the vcpu info
structure; this leaves some of the code as dead, but it will come to
life in time, and the warnings are suppressed.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

db053b86

10 9月, 2008 1 次提交

xen: fix pinning when not using split pte locks · 6a9e9184

由 Jeremy Fitzhardinge 提交于 9月 09, 2008

We only pin PTE pages when using split PTE locks, so don't do the
pin/unpin when attaching/detaching pte pages to a pinned pagetable.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6a9e9184

07 9月, 2008 1 次提交

x86, xen: Use native_pte_flags instead of native_pte_val for .pte_flags · e4a6be4d

由 Eduardo Habkost 提交于 7月 24, 2008

Using native_pte_val triggers the BUG_ON() in the paravirt_ops
version of pte_flags().
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Acked-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e4a6be4d

22 8月, 2008 1 次提交

x86, paravirt_ops: use unsigned long instead of u32 for alloc_p*() pfn args · f8639939

由 Eduardo Habkost 提交于 7月 30, 2008

This patch changes the pfn args from 'u32' to 'unsigned long'
on alloc_p*() functions on paravirt_ops, and the corresponding
implementations for Xen and VMI. The prototypes for CONFIG_PARAVIRT=n
are already using unsigned long, so paravirt.h now matches the prototypes
on asm-x86/pgalloc.h.

It shouldn't result in any changes on generated code on 32-bit, with
or without CONFIG_PARAVIRT. On both cases, 'codiff -f' didn't show any
change after applying this patch.

On 64-bit, there are (expected) binary changes only when CONFIG_PARAVIRT
is enabled, as the patch is really supposed to change the size of the
pfn args.

[ v2: KVM_GUEST: use the right parameter type on kvm_release_pt() ]
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Acked-by: NJeremy Fitzhardinge <jeremy@goop.org>
Acked-by: NZachary Amsden <zach@vmware.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f8639939

20 8月, 2008 1 次提交

xen: clean up domain mode predicates · 6e833587

由 Jeremy Fitzhardinge 提交于 8月 19, 2008

There are four operating modes Xen code may find itself running in:
 - native
 - hvm domain
 - pv dom0
 - pv domU

Clean up predicates for testing for these states to make them more consistent.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6e833587

31 7月, 2008 3 次提交

xen_alloc_ptpage: cast PFN_PHYS() argument to unsigned long · 169ad16b

由 Eduardo Habkost 提交于 7月 28, 2008

Currently paravirt_ops alloc_p*() uses u32 for the pfn args. We should
change that later, but while the pfn parameter is still u32, we need to
cast the PFN_PHYS() argument at xen_alloc_ptpage() to unsigned long,
otherwise it will lose bits on the shift.

I think PFN_PHYS() should behave better when fed with smaller integers,
but a cast to unsigned long won't be enough for all cases on 32-bit PAE,
and a cast to u64 would be overkill for most users of PFN_PHYS().

We could have two different flavors of PFN_PHYS: one for low pages
only (unsigned long) and another that works for any page (u64)),
but while we don't have it, we will need the cast to unsigned long on
xen_alloc_ptpage().
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Acked-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

169ad16b

xen: fix allocation and use of large ldts, cleanup · cef43bf6

由 Jeremy Fitzhardinge 提交于 7月 28, 2008

Add a proper comment for set_aliased_prot() and fix an
unsigned long/void * warning.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cef43bf6

xen: compile irq functions without -pg for ftrace · 0d1edf46

由 Jeremy Fitzhardinge 提交于 7月 28, 2008

For some reason I managed to miss a bunch of irq-related functions
which also need to be compiled without -pg when using ftrace.  This
patch moves them into their own file, and starts a cleanup process
I've been meaning to do anyway.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: "Alex Nixon (Intern)" <Alex.Nixon@eu.citrix.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0d1edf46

28 7月, 2008 2 次提交

xen: suppress known wrmsrs · d89961e2

由 Jeremy Fitzhardinge 提交于 7月 24, 2008

In general, Xen doesn't support wrmsr from an unprivileged domain; it
just ends up ignoring the instruction and printing a message on the
console.

Given that there are sets of MSRs we know the kernel will try to write
to, but we don't care, just eat them in xen_write_msr to cut down on
console noise.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d89961e2

xen: fix allocation and use of large ldts · a05d2eba

由 Jeremy Fitzhardinge 提交于 7月 27, 2008

When the ldt gets to more than 1 page in size, the kernel uses vmalloc
to allocate it.  This means that:

 - when making the ldt RO, we must update the pages in both the vmalloc
   mapping and the linear mapping to make sure there are no RW aliases.

 - we need to use arbitrary_virt_to_machine to compute the machine addr
   for each update
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a05d2eba

26 7月, 2008 1 次提交

x86, xen: Use native_pte_flags instead of native_pte_val for .pte_flags · b56afe1d

由 Eduardo Habkost 提交于 7月 24, 2008

Using native_pte_val triggers the BUG_ON() in the paravirt_ops
version of pte_flags().
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Acked-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b56afe1d

24 7月, 2008 1 次提交

x86/paravirt/xen: properly fill out the ldt ops · 38ffbe66

由 Jeremy Fitzhardinge 提交于 7月 23, 2008

LTP testing showed that Xen does not properly implement
sys_modify_ldt().  This patch does the final little bits needed to
make the ldt work properly.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

38ffbe66

22 7月, 2008 1 次提交

x86: rename PTE_MASK to PTE_PFN_MASK · 59438c9f

由 Jeremy Fitzhardinge 提交于 7月 21, 2008

Rusty, in his peevish way, complained that macros defining constants
should have a name which somewhat accurately reflects the actual
purpose of the constant.

Aside from the fact that PTE_MASK gives no clue as to what's actually
being masked, and is misleadingly similar to the functionally entirely
different PMD_MASK, PUD_MASK and PGD_MASK, I don't really see what the
problem is.

But if this patch silences the incessent noise, then it will have
achieved its goal (TODO: write test-case).
Signed-off-by: NJeremy Fitzhardinge <jeremy@goop.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

59438c9f

20 7月, 2008 1 次提交

x86, xen: fix apic_ops build on UP · caf43bf7

由 Ingo Molnar 提交于 7月 20, 2008

fix:

 arch/x86/xen/enlighten.c:615: error: variable ‘xen_basic_apic_ops’ has initializer but incomplete type
 arch/x86/xen/enlighten.c:616: error: unknown field ‘read’ specified in initializer
 [...]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

caf43bf7

18 7月, 2008 2 次提交

xen: report hypervisor version · 95c7c23b

由 Jeremy Fitzhardinge 提交于 7月 15, 2008

Various versions of the hypervisor have differences in what ABIs and
features they support.  Print some details into the boot log to help
with remote debugging.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

95c7c23b

x86: APIC: remove apic_write_around(); use alternatives · 593f4a78

由 Maciej W. Rozycki 提交于 7月 16, 2008

Use alternatives to select the workaround for the 11AP Pentium erratum
for the affected steppings on the fly rather than build time.  Remove the
X86_GOOD_APIC configuration option and replace all the calls to
apic_write_around() with plain apic_write(), protecting accesses to the
ESR as appropriate due to the 3AP Pentium erratum.  Remove
apic_read_around() and all its invocations altogether as not needed.
Remove apic_write_atomic() and all its implementing backends.  The use of
ASM_OUTPUT2() is not strictly needed for input constraints, but I have
used it for readability's sake.

I had the feeling no one else was brave enough to do it, so I went ahead
and here it is.  Verified by checking the generated assembly and tested
with both a 32-bit and a 64-bit configuration, also with the 11AP
"feature" forced on and verified with gdb on /proc/kcore to work as
expected (as an 11AP machines are quite hard to get hands on these days).
Some script complained about the use of "volatile", but apic_write() needs
it for the same reason and is effectively a replacement for writel(), so I
have disregarded it.

I am not sure what the policy wrt defconfig files is, they are generated
and there is risk of a conflict resulting from an unrelated change, so I
have left changes to them out.  The option will get removed from them at
the next run.

Some testing with machines other than mine will be needed to avoid some
stupid mistake, but despite its volume, the change is not really that
intrusive, so I am fairly confident that because it works for me, it will
everywhere.
Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

593f4a78

16 7月, 2008 16 次提交

xen64: fix build error on 32-bit + !HIGHMEM · b3fe1243

由 Ingo Molnar 提交于 7月 09, 2008

fix:

arch/x86/xen/enlighten.c: In function 'xen_set_fixmap':
arch/x86/xen/enlighten.c:1127: error: 'FIX_KMAP_BEGIN' undeclared (first use in this function)
arch/x86/xen/enlighten.c:1127: error: (Each undeclared identifier is reported only once
arch/x86/xen/enlighten.c:1127: error: for each function it appears in.)
arch/x86/xen/enlighten.c:1127: error: 'FIX_KMAP_END' undeclared (first use in this function)
make[1]: *** [arch/x86/xen/enlighten.o] Error 1
make: *** [arch/x86/xen/enlighten.o] Error 2

FIX_KMAP_BEGIN is only available on HIGHMEM.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b3fe1243

xen: implement Xen write_msr operation · 1153968a

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

64-bit uses MSRs for important things like the base for fs and
gs-prefixed addresses.  It's more efficient to use a hypercall to
update these, rather than go via the trap and emulate path.

Other MSR writes are just passed through; in an unprivileged domain
they do nothing, but it might be useful later.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1153968a

xen64: set up userspace syscall patch · bf18bf94

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

64-bit userspace expects the vdso to be mapped at a specific fixed
address, which happens to be in the middle of the kernel address
space.  Because we have split user and kernel pagetables, we need to
make special arrangements for the vsyscall mapping to appear in the
kernel part of the user pagetable.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

bf18bf94

xen64: set up syscall and sysenter entrypoints for 64-bit · 6fcac6d3

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

We set up entrypoints for syscall and sysenter.  sysenter is only used
for 32-bit compat processes, whereas syscall can be used in by both 32
and 64-bit processes.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6fcac6d3

xen64: allocate and manage user pagetables · d6182fbf

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

Because the x86_64 architecture does not enforce segment limits, Xen
cannot protect itself with them as it does in 32-bit mode.  Therefore,
to protect itself, it runs the guest kernel in ring 3.  Since it also
runs the guest userspace in ring3, the guest kernel must maintain a
second pagetable for its userspace, which does not map kernel space.
Naturally, the guest kernel pagetables map both kernel and userspace.

The userspace pagetable is attached to the corresponding kernel
pagetable via the pgd's page->private field.  It is allocated and
freed at the same time as the kernel pgd via the
paravirt_pgd_alloc/free hooks.

Fortunately, the user pagetable is almost entirely shared with the
kernel pagetable; the only difference is the pgd page itself.  set_pgd
will populate all entries in the kernel pagetable, and also set the
corresponding user pgd entry if the address is less than
STACK_TOP_MAX.

The user pagetable must be pinned and unpinned with the kernel one,
but because the pagetables are aliased, pgd_walk() only needs to be
called on the kernel pagetable.  The user pgd page is then
pinned/unpinned along with the kernel pgd page.

xen_write_cr3 must write both the kernel and user cr3s.

The init_mm.pgd pagetable never has a user pagetable allocated for it,
because it can never be used while running usermode.

One awkward area is that early in boot the page structures are not
available.  No user pagetable can exist at that point, but it
complicates the logic to avoid looking at the page structure.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d6182fbf

xen64: Clear %fs on xen_load_tls() · 8a95408e

由 Eduardo Habkost 提交于 7月 08, 2008

We need to do this, otherwise we can get a GPF on hypercall return
after TLS descriptor is cleared but %fs is still pointing to it.
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8a95408e

xen: make sure the kernel command line is right · b7c3c5c1

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

Point the boot params cmd_line_ptr to the domain-builder-provided
command line.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b7c3c5c1

xen64: implement xen_load_gs_index() · a8fc1089

由 Eduardo Habkost 提交于 7月 08, 2008

xen-64: implement xen_load_gs_index()
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a8fc1089

xen64: add identity irq->vector map · 0725cbb9

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

The x86_64 interrupt subsystem is oriented towards vectors, as opposed
to a flat irq space as it is in x86-32.  This patch adds a simple
identity irq->vector mapping so that we can continue to feed irqs into
do_IRQ() and get a good result.

Ideally x86_32 will unify with the 64-bit code and use vectors too.
At that point we can move to mapping event channels to vectors, which
will allow us to economise on irqs (so per-cpu event channels can
share irqs, rather than having to allocte one per cpu, for example).
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0725cbb9

xen64: add pvop for swapgs · 952d1d70

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

swapgs is a no-op under Xen, because the hypervisor makes sure the
right version of %gs is current when switching between user and kernel
modes.  This means that the swapgs "implementation" can be inlined and
used when the stack is unsafe (usermode).  Unfortunately, it means
that disabling patching will result in a non-booting kernel...
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

952d1d70

xen64: deal with extra words Xen pushes onto exception frames · 997409d3

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

Xen pushes two extra words containing the values of rcx and r11.  This
pvop hook copies the words back into their appropriate registers, and
cleans them off the stack.  This leaves the stack in native form, so
the normal handler can run unchanged.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

997409d3

xen64: xen_write_idt_entry() and cvt_gate_to_trap() · e176d367

由 Eduardo Habkost 提交于 7月 08, 2008

Changed to use the (to-be-)unified descriptor structs.
Signed-off-by: NEduardo Habkost <ehabkost@Rawhide-64.localdomain>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e176d367

xen64: defer setting pagetable alloc/release ops · 8745f8b0

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

We need to wait until the page structure is available to use the
proper pagetable page alloc/release operations, since they use struct
page to determine if a pagetable is pinned.

This happened to work in 32bit because nobody allocated new pagetable
pages in the interim between xen_pagetable_setup_done and
xen_post_allocator_init, but the 64-bit kenrel needs to allocate more
pagetable levels.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8745f8b0

xen32: create initial mappings like 64-bit · 39dbc5bd

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

Rearrange the pagetable initialization to share code with the 64-bit
kernel.  Rather than deferring anything to pagetable_setup_start, just
set up an initial pagetable in swapper_pg_dir early at startup, and
create an additional 8MB of physical memory mappings.  This matches
the native head_32.S mappings to a large degree, and allows the rest
of the pagetable setup to continue without much Xen vs. native
difference.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

39dbc5bd

xen64: map an initial chunk of physical memory · d114e198

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

Early in boot, map a chunk of extra physical memory for use later on.
We need a pool of mapped pages to allocate further pages to construct
pagetables mapping all physical memory.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d114e198

xen64: 64-bit starts using set_pte from very early · 22911b3f

由 Jeremy Fitzhardinge 提交于 7月 08, 2008

It also doesn't need the 32-bit hack version of set_pte for initial
pagetable construction, so just make it use the real thing.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

22911b3f