提交 · f1d7062a235d057e5d85ed2860bef609e0160cde · openeuler / Kernel

31 8月, 2009 2 次提交

x86: Move xen_post_allocator_init into xen_pagetable_setup_done · f1d7062a

由 Thomas Gleixner 提交于 8月 20, 2009

We really do not need two paravirt/x86_init_ops functions which are
called in two consecutive source lines. Move the only user of
post_allocator_init into the already existing pagetable_setup_done
function.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

f1d7062a

T
x86: Move paravirt pagetable_setup to x86_init_ops · 030cb6c0
由 Thomas Gleixner 提交于 8月 20, 2009
```
Replace more paravirt hackery by proper x86_init_ops.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
```
030cb6c0

13 5月, 2009 1 次提交

xen: use header for EXPORT_SYMBOL_GPL · 44408ad7

由 Randy Dunlap 提交于 5月 12, 2009

mmu.c needs to #include module.h to prevent these warnings:

 arch/x86/xen/mmu.c:239: warning: data definition has no type or storage class
 arch/x86/xen/mmu.c:239: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
 arch/x86/xen/mmu.c:239: warning: parameter names (without types) in function declaration

[ Impact: cleanup ]
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Acked-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

44408ad7

08 5月, 2009 1 次提交

x86: xen, i386: reserve Xen pagetables · 33df4db0

由 Jeremy Fitzhardinge 提交于 5月 07, 2009

The Xen pagetables are no longer implicitly reserved as part of the other
i386_start_kernel reservations, so make sure we explicitly reserve them.
This prevents them from being released into the general kernel free page
pool and reused.

[ Impact: fix Xen guest crash ]
Also-Bisected-by: NBryan Donlan <bdonlan@gmail.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4A032EEC.30509@goop.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

33df4db0

11 4月, 2009 1 次提交

x86: fix set_fixmap to use phys_addr_t · 9b987aeb

由 Masami Hiramatsu 提交于 4月 09, 2009

Impact: fix kprobes crash on 32-bit with RAM above 4G

Use phys_addr_t for receiving a physical address argument
instead of unsigned long. This allows fixmap to handle
pages higher than 4GB on x86-32.
Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
Acked-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: systemtap-ml <systemtap@sources.redhat.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <49DE3695.6040800@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9b987aeb

10 4月, 2009 2 次提交

x86: fix set_fixmap to use phys_addr_t · 3b3809ac

由 Masami Hiramatsu 提交于 4月 09, 2009

Use phys_addr_t for receiving a physical address argument instead of
unsigned long.  This allows fixmap to handle pages higher than 4GB on
x86-32.
Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3b3809ac

xen: add FIX_TEXT_POKE to fixmap · e7c06488

由 Jeremy Fitzhardinge 提交于 3月 07, 2009

FIX_TEXT_POKE[01] are used to map kernel addresses, so they're mapping
pfns, not mfns.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

e7c06488

09 4月, 2009 6 次提交

xen: add FIX_TEXT_POKE to fixmap · 3ecb1b7d

由 Jeremy Fitzhardinge 提交于 3月 07, 2009

FIX_TEXT_POKE[01] are used to map kernel addresses, so they're mapping
pfns, not mfns.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

3ecb1b7d

xen/mmu: weaken flush_tlb_other test · e3f8a74e

由 Jeremy Fitzhardinge 提交于 3月 04, 2009

Impact: fixes crashing bug

There's no particular problem with getting an empty cpu mask,
so just shortcut-return if we get one.

Avoids crash reported by Christophe Saout <christophe@saout.de>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

e3f8a74e

xen/mmu: some early pagetable cleanups · b96229b5

由 Jeremy Fitzhardinge 提交于 3月 17, 2009

1. make sure early-allocated ptes are pinned, so they can be later
   unpinned
2. don't pin pmd+pud, just make them RO
3. scatter some __inits around
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

b96229b5

xen: split construction of p2m mfn tables from registration · cdaead6b

由 Jeremy Fitzhardinge 提交于 2月 27, 2009

Build the p2m_mfn_list_list early with the rest of the p2m table, but
register it later when the real shared_info structure is in place.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

cdaead6b

xen: separate p2m allocation from setting · e791ca0f

由 Jeremy Fitzhardinge 提交于 2月 26, 2009

When doing very early p2m setting, we need to separate setting
from allocation, so split things up accordingly.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

e791ca0f

xen: disable preempt for leave_lazy_mmu · d6382bf7

由 Jeremy Fitzhardinge 提交于 2月 20, 2009

xen_mc_flush() requires preemption to be disabled for its own sanity,
so disable it while we're flushing.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

d6382bf7

31 3月, 2009 3 次提交

xen/mmu: weaken flush_tlb_other test · 8de07bbd

由 Jeremy Fitzhardinge 提交于 3月 04, 2009

Impact: fixes crashing bug

There's no particular problem with getting an empty cpu mask,
so just shortcut-return if we get one.

Avoids crash reported by Christophe Saout <christophe@saout.de>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

8de07bbd

xen/mmu: some early pagetable cleanups · 4185f354

由 Jeremy Fitzhardinge 提交于 3月 17, 2009

1. make sure early-allocated ptes are pinned, so they can be later
   unpinned
2. don't pin pmd+pud, just make them RO
3. scatter some __inits around
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

4185f354

xen: split construction of p2m mfn tables from registration · 7571a604

由 Jeremy Fitzhardinge 提交于 2月 27, 2009

7571a604

30 3月, 2009 5 次提交

xen: separate p2m allocation from setting · 59d71871

由 Jeremy Fitzhardinge 提交于 2月 26, 2009

When doing very early p2m setting, we need to separate setting
from allocation, so split things up accordingly.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

59d71871

xen: disable preempt for leave_lazy_mmu · 5caecb94

由 Jeremy Fitzhardinge 提交于 2月 20, 2009

xen_mc_flush() requires preemption to be disabled for its own sanity,
so disable it while we're flushing.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

5caecb94

x86/paravirt: allow preemption with lazy mmu mode · 2829b449

由 Jeremy Fitzhardinge 提交于 2月 17, 2009

Impact: remove obsolete checks, simplification

Lift restrictions on preemption with lazy mmu mode, as it is now allowed.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

2829b449

x86/paravirt: flush pending mmu updates on context switch · b407fc57

由 Jeremy Fitzhardinge 提交于 2月 17, 2009

Impact: allow preemption during lazy mmu updates

If we're in lazy mmu mode when context switching, leave
lazy mmu mode, but remember the task's state in
TIF_LAZY_MMU_UPDATES.  When we resume the task, check this
flag and re-enter lazy mmu mode if its set.

This sets things up for allowing lazy mmu mode while preemptible,
though that won't actually be active until the next change.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

b407fc57

x86/pvops: replace arch_enter_lazy_cpu_mode with arch_start_context_switch · 7fd7d83d

由 Jeremy Fitzhardinge 提交于 2月 17, 2009

Impact: simplification, prepare for later changes

Make lazy cpu mode more specific to context switching, so that
it makes sense to do more context-switch specific things in
the callbacks.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

7fd7d83d

19 3月, 2009 1 次提交

x86: with the last user gone, remove set_pte_present · 71ff49d7

由 Jeremy Fitzhardinge 提交于 3月 18, 2009

Impact: cleanup

set_pte_present() is no longer used, directly or indirectly,
so remove it.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
LKML-Reference: <1237406613-2929-2-git-send-email-jeremy@goop.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

71ff49d7

15 3月, 2009 1 次提交

x86: add brk allocation for very, very early allocations · 93dbda7c

由 Jeremy Fitzhardinge 提交于 2月 26, 2009

Impact: new interface

Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations.  This is better than
using statically allocated arrays for data in subsystems which may never
get used.

The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.

The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit).  The kernel text, data and bss is reserved
up to __bss_stop.

Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.

Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support).  So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

93dbda7c

02 3月, 2009 1 次提交

xen: deal with virtually mapped percpu data · 9976b39b

由 Jeremy Fitzhardinge 提交于 2月 27, 2009

The virtually mapped percpu space causes us two problems:

 - for hypercalls which take an mfn, we need to do a full pagetable
   walk to convert the percpu va into an mfn, and

 - when a hypercall requires a page to be mapped RO via all its aliases,
   we need to make sure its RO in both the percpu mapping and in the
   linear mapping

This primarily affects the gdt and the vcpu info structure.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9976b39b

13 2月, 2009 1 次提交

xen: fix xen_flush_tlb_others · 694aa960

由 Ian Campbell 提交于 2月 12, 2009

The commit
    commit 4595f962
    Author: Rusty Russell <rusty@rustcorp.com.au>
    Date:   Sat Jan 10 21:58:09 2009 -0800

        x86: change flush_tlb_others to take a const struct cpumask

causes xen_flush_tlb_others to allocate a multicall and then issue it
without initializing it in the case where the cpumask is empty,
leading to:

        [    8.354898] 1 multicall(s) failed: cpu 1
        [    8.354921] Pid: 2213, comm: bootclean Not tainted 2.6.29-rc3-x86_32p-xenU-tip #135
        [    8.354937] Call Trace:
        [    8.354955]  [<c01036e3>] xen_mc_flush+0x133/0x1b0
        [    8.354971]  [<c0105d2a>] ? xen_force_evtchn_callback+0x1a/0x30
        [    8.354988]  [<c0105a60>] xen_flush_tlb_others+0xb0/0xd0
        [    8.355003]  [<c0126643>] flush_tlb_page+0x53/0xa0
        [    8.355018]  [<c0176a80>] do_wp_page+0x2a0/0x7c0
        [    8.355034]  [<c0238f0a>] ? notify_remote_via_irq+0x3a/0x70
        [    8.355049]  [<c0178950>] handle_mm_fault+0x7b0/0xa50
        [    8.355065]  [<c0131a3e>] ? wake_up_new_task+0x8e/0xb0
        [    8.355079]  [<c01337b5>] ? do_fork+0xe5/0x320
        [    8.355095]  [<c0121919>] do_page_fault+0xe9/0x240
        [    8.355109]  [<c0121830>] ? do_page_fault+0x0/0x240
        [    8.355125]  [<c032457a>] error_code+0x72/0x78
        [    8.355139]   call  1/1: op=2863311530 arg=[aaaaaaaa] result=-38     xen_flush_tlb_others+0x41/0xd0

Since empty cpumasks are rare and undoing an xen_mc_entry() is tricky
just issue such requests normally.
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

694aa960

05 2月, 2009 1 次提交

xen: fix 32-bit build resulting from mmu move · 1f4f9315

由 Jeremy Fitzhardinge 提交于 2月 02, 2009

Moving the mmu code from enlighten.c to mmu.c inadvertently broke the
32-bit build.  Fix it.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

1f4f9315

31 1月, 2009 2 次提交

x86/paravirt: use callee-saved convention for pte_val/make_pte/etc · da5de7c2

由 Jeremy Fitzhardinge 提交于 1月 28, 2009

Impact: Optimization

In the native case, pte_val, make_pte, etc are all just identity
functions, so there's no need to clobber a lot of registers over them.

(This changes the 32-bit callee-save calling convention to return both
EAX and EDX so functions can return 64-bit values.)
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

da5de7c2

xen: move remaining mmu-related stuff into mmu.c · 319f3ba5

由 Jeremy Fitzhardinge 提交于 1月 28, 2009

Impact: Cleanup

Move remaining mmu-related stuff into mmu.c.
A general cleanup, and lay the groundwork for later patches.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

319f3ba5

18 1月, 2009 1 次提交
- B
  x86-64: Move TLB state from PDA to per-cpu and consolidate with 32-bit. · 9eb912d1
  由 Brian Gerst 提交于 1月 19, 2009
```
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
```
  9eb912d1
16 1月, 2009 1 次提交

percpu: add optimized generic percpu accessors · 6dbde353

由 Ingo Molnar 提交于 1月 15, 2009

It is an optimization and a cleanup, and adds the following new
generic percpu methods:

  percpu_read()
  percpu_write()
  percpu_add()
  percpu_sub()
  percpu_and()
  percpu_or()
  percpu_xor()

and implements support for them on x86. (other architectures will fall
back to a default implementation)

The advantage is that for example to read a local percpu variable,
instead of this sequence:

 return __get_cpu_var(var);

 ffffffff8102ca2b:	48 8b 14 fd 80 09 74 	mov    -0x7e8bf680(,%rdi,8),%rdx
 ffffffff8102ca32:	81
 ffffffff8102ca33:	48 c7 c0 d8 59 00 00 	mov    $0x59d8,%rax
 ffffffff8102ca3a:	48 8b 04 10          	mov    (%rax,%rdx,1),%rax

We can get a single instruction by using the optimized variants:

 return percpu_read(var);

 ffffffff8102ca3f:	65 48 8b 05 91 8f fd 	mov    %gs:0x7efd8f91(%rip),%rax

I also cleaned up the x86-specific APIs and made the x86 code use
these new generic percpu primitives.

tj: * fixed generic percpu_sub() definition as Roel Kluin pointed out
    * added percpu_and() for completeness's sake
    * made generic percpu ops atomic against preemption
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NTejun Heo <tj@kernel.org>

6dbde353

17 12月, 2008 2 次提交

x86: xen: use smp_call_function_many() · e4d98207

由 Mike Travis 提交于 12月 16, 2008

Impact: use new API, remove cpumask from stack.

Change smp_call_function_mask() callers to smp_call_function_many().

This removes a cpumask from the stack, and falls back should allocating
the cpumask var fail (only possible with CONFIG_CPUMASKS_OFFSTACK).
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NMike Travis <travis@sgi.com>
Cc: jeremy@xensource.com

e4d98207

xen: whitespace/checkpatch cleanup · f63c2f24

由 Tej 提交于 12月 16, 2008

Impact: cleanup
Signed-off-by: NTej <bewith.tej@gmail.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f63c2f24

23 11月, 2008 1 次提交

xen: pin correct PGD on suspend · 86bbc2c2

由 Ian Campbell 提交于 11月 21, 2008

Impact: fix Xen guest boot failure

commit eefb47f6 ("xen: use
spin_lock_nest_lock when pinning a pagetable") changed xen_pgd_walk to
walk over mm->pgd rather than taking pgd as an argument.

This breaks xen_mm_(un)pin_all() because it makes init_mm.pgd readonly
instead of the pgd we are interested in and therefore the pin subsequently
fails.

(XEN) mm.c:2280:d15 Bad type (saw 00000000e8000001 != exp 0000000060000000) for mfn bc464 (pfn 21ca7)
(XEN) mm.c:2665:d15 Error while pinning mfn bc464

[   14.586913] 1 multicall(s) failed: cpu 0
[   14.586926] Pid: 14, comm: kstop/0 Not tainted 2.6.28-rc5-x86_32p-xenU-00172-gee2f6cc7 #200
[   14.586940] Call Trace:
[   14.586955]  [<c030c17a>] ? printk+0x18/0x1e
[   14.586972]  [<c0103df3>] xen_mc_flush+0x163/0x1d0
[   14.586986]  [<c0104bc1>] __xen_pgd_pin+0xa1/0x110
[   14.587000]  [<c015a330>] ? stop_cpu+0x0/0xf0
[   14.587015]  [<c0104d7b>] xen_mm_pin_all+0x4b/0x70
[   14.587029]  [<c022bcb9>] xen_suspend+0x39/0xe0
[   14.587042]  [<c015a330>] ? stop_cpu+0x0/0xf0
[   14.587054]  [<c015a3cd>] stop_cpu+0x9d/0xf0
[   14.587067]  [<c01417cd>] run_workqueue+0x8d/0x150
[   14.587080]  [<c030e4b3>] ? _spin_unlock_irqrestore+0x23/0x40
[   14.587094]  [<c014558a>] ? prepare_to_wait+0x3a/0x70
[   14.587107]  [<c0141918>] worker_thread+0x88/0xf0
[   14.587120]  [<c01453c0>] ? autoremove_wake_function+0x0/0x50
[   14.587133]  [<c0141890>] ? worker_thread+0x0/0xf0
[   14.587146]  [<c014509c>] kthread+0x3c/0x70
[   14.587157]  [<c0145060>] ? kthread+0x0/0x70
[   14.587170]  [<c0109d1b>] kernel_thread_helper+0x7/0x10
[   14.587181]   call  1/3: op=14 arg=[c0415000] result=0
[   14.587192]   call  2/3: op=14 arg=[e1ca2000] result=0
[   14.587204]   call  3/3: op=26 arg=[c1808860] result=-22
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Acked-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

86bbc2c2

07 11月, 2008 2 次提交

xen: make sure stray alias mappings are gone before pinning · d05fdf31

由 Jeremy Fitzhardinge 提交于 10月 28, 2008

Xen requires that all mappings of pagetable pages are read-only, so
that they can't be updated illegally.  As a result, if a page is being
turned into a pagetable page, we need to make sure all its mappings
are RO.

If the page had been used for ioremap or vmalloc, it may still have
left over mappings as a result of not having been lazily unmapped.
This change makes sure we explicitly mop them all up before pinning
the page.

Unlike aliases created by kmap, the there can be vmalloc aliases even
for non-high pages, so we must do the flush unconditionally.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Linux Memory Management List <linux-mm@kvack.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d05fdf31

x86, xen: fix use of pgd_page now that it really does return a page · 47cb2ed9

由 Jeremy Fitzhardinge 提交于 11月 06, 2008

Impact: fix 32-bit Xen guest boot crash

On 32-bit PAE, pud_page, for no good reason, didn't really return a
struct page *.  Since Jan Beulich's fix "i386/PAE: fix pud_page()",
pud_page does return a struct page *.

Because PAE has 3 pagetable levels, the pud level is folded into the
pgd level, so pgd_page() is the same as pud_page(), and now returns
a struct page *.  Update the xen/mmu.c code which uses pgd_page()
accordingly.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

47cb2ed9

27 10月, 2008 1 次提交

xen: fix Xen domU boot with batched mprotect · 9f32d21c

由 Chris Lalancette 提交于 10月 23, 2008

Impact: fix guest kernel boot crash on certain configs

Recent i686 2.6.27 kernels with a certain amount of memory (between
736 and 855MB) have a problem booting under a hypervisor that supports
batched mprotect (this includes the RHEL-5 Xen hypervisor as well as
any 3.3 or later Xen hypervisor).

The problem ends up being that xen_ptep_modify_prot_commit() is using
virt_to_machine to calculate which pfn to update.  However, this only
works for pages that are in the p2m list, and the pages coming from
change_pte_range() in mm/mprotect.c are kmap_atomic pages.  Because of
this, we can run into the situation where the lookup in the p2m table
returns an INVALID_MFN, which we then try to pass to the hypervisor,
which then (correctly) denies the request to a totally bogus pfn.

The right thing to do is to use arbitrary_virt_to_machine, so that we
can be sure we are modifying the right pfn.  This unfortunately
introduces a performance penalty because of a full page-table-walk,
but we can avoid that penalty for pages in the p2m list by checking if
virt_addr_valid is true, and if so, just doing the lookup in the p2m
table.

The attached patch implements this, and allows my 2.6.27 i686 based
guest with 768MB of memory to boot on a RHEL-5 hypervisor again.
Thanks to Jeremy for the suggestions about how to fix this particular
issue.
Signed-off-by: NChris Lalancette <clalance@redhat.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Chris Lalancette <clalance@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9f32d21c

20 10月, 2008 1 次提交

mm: rewrite vmap layer · db64fe02

由 Nick Piggin 提交于 10月 18, 2008

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap.  Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache.  This is all done under a global lock.  As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
 This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock.  It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems.  The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated.  So the addresses aren't allocated again until
a subsequent TLB flush.  A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap.  Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages.  Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
in nanoseconds per map+touch+unmap.

threads           vanilla         vmap rewrite
1                 14700           2900
2                 33600           3000
4                 49500           2800
8                 70631           2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram...  along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system.  I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now.  vmap is pretty well blown off the
profiles.

Before:
1352059 total                                      0.1401
798784 _write_lock                              8320.6667 <- vmlist_lock
529313 default_idle                             1181.5022
 15242 smp_call_function                         15.8771  <- vmap tlb flushing
  2472 __get_vm_area_node                         1.9312  <- vmap
  1762 remove_vm_area                             4.5885  <- vunmap
   316 map_vm_area                                0.2297  <- vmap
   312 kfree                                      0.1950
   300 _spin_lock                                 3.1250
   252 sn_send_IPI_phys                           0.4375  <- tlb flushing
   238 vmap                                       0.8264  <- vmap
   216 find_lock_page                             0.5192
   196 find_next_bit                              0.3603
   136 sn2_send_IPI                               0.2024
   130 pio_phys_write_mmr                         2.0312
   118 unmap_kernel_range                         0.1229

After:
 78406 total                                      0.0081
 40053 default_idle                              89.4040
 33576 ia64_spinlock_contention                 349.7500
  1650 _spin_lock                                17.1875
   319 __reg_op                                   0.5538
   281 _atomic_dec_and_lock                       1.0977
   153 mutex_unlock                               1.5938
   123 iget_locked                                0.1671
   117 xfs_dir_lookup                             0.1662
   117 dput                                       0.1406
   114 xfs_iget_core                              0.0268
    92 xfs_da_hashname                            0.1917
    75 d_alloc                                    0.0670
    68 vmap_page_range                            0.0462 <- vmap
    58 kmem_cache_alloc                           0.0604
    57 memset                                     0.0540
    52 rb_next                                    0.1625
    50 __copy_user                                0.0208
    49 bitmap_find_free_region                    0.2188 <- vmap
    46 ia64_sn_udelay                             0.1106
    45 find_inode_fast                            0.1406
    42 memcmp                                     0.2188
    42 finish_task_switch                         0.1094
    42 __d_lookup                                 0.0410
    40 radix_tree_lookup_slot                     0.1250
    37 _spin_unlock_irqrestore                    0.3854
    36 xfs_bmapi                                  0.0050
    36 kmem_cache_free                            0.0256
    35 xfs_vn_getattr                             0.0322
    34 radix_tree_lookup                          0.1062
    33 __link_path_walk                           0.0035
    31 xfs_da_do_buf                              0.0091
    30 _xfs_buf_find                              0.0204
    28 find_get_page                              0.0875
    27 xfs_iread                                  0.0241
    27 __strncpy_from_user                        0.2812
    26 _xfs_buf_initialize                        0.0406
    24 _xfs_buf_lookup_pages                      0.0179
    24 vunmap_page_range                          0.0250 <- vunmap
    23 find_lock_page                             0.0799
    22 vm_map_ram                                 0.0087 <- vmap
    20 kfree                                      0.0125
    19 put_page                                   0.0330
    18 __kmalloc                                  0.0176
    17 xfs_da_node_lookup_int                     0.0086
    17 _read_lock                                 0.0885
    17 page_waitqueue                             0.0664

vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.

[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db64fe02

09 10月, 2008 1 次提交

xen: use spin_lock_nest_lock when pinning a pagetable · eefb47f6

由 Jeremy Fitzhardinge 提交于 10月 08, 2008

When pinning/unpinning a pagetable with split pte locks, we can end up
holding multiple pte locks at once (we need to hold the locks while
there's a pending batched hypercall affecting the pte page).  Because
all the pte locks are in the same lock class, lockdep thinks that
we're potentially taking a lock recursively.

This warning is spurious because we always take the pte locks while
holding mm->page_table_lock.  lockdep now has spin_lock_nest_lock to
express this kind of dominant lock use, so use it here so that lockdep
knows what's going on.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

eefb47f6

10 9月, 2008 1 次提交

mm: define USE_SPLIT_PTLOCKS rather than repeating expression · f7d0b926

由 Jeremy Fitzhardinge 提交于 9月 09, 2008

Define USE_SPLIT_PTLOCKS as a constant expression rather than repeating
"NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS" all over the place.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f7d0b926

21 8月, 2008 1 次提交

xen: add debugfs support · 994025ca

由 Jeremy Fitzhardinge 提交于 8月 20, 2008

Add support for exporting statistics on mmu updates, multicall
batching and pv spinlocks into debugfs. The base path is xen/ and
each subsystem adds its own directory: mmu, multicalls, spinlocks.

In each directory, writing 1 to "zero_stats" will cause the
corresponding stats to be zeroed the next time they're updated.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NJan Beulich <jbeulich@novell.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

994025ca

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功