提交 · 70e4a369733a21e3d16b059a6ccdad22a344bf57 · openanolis / cloud-kernel

14 2月, 2011 3 次提交

x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32 · 70e4a369

由 Shaohua Li 提交于 1月 17, 2011

Make the maxium TLB invalidate vectors depend on NR_CPUS linearly,
with a maximum of 32 vectors.

We currently only have 8 vectors for TLB invalidation and that is clearly
inadequate. If we have a lot of CPUs, the CPUs need share the 8 vectors and
tlbstate_lock is used to protect them. flush_tlb_page() is
heavily used in page reclaim, which will cause a lot of lock
contention for tlbstate_lock.

Andi Kleen suggested increasing the vectors number to 32, which should be
good for current typical systems to reduce the tlbstate_lock contention.

My test system has 4 sockets and 64G memory, and 64 CPUs. My
workload creates 64 processes. Each process mmap reads a big
empty sparse file. The total size of the files are 2*total_mem,
so this will cause a lot of page reclaim.

Below is the result I get from perf call-graph profiling:

 without the patch:
 ------------------

    24.25%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |--42.15%-- native_flush_tlb_others

 with the patch:
 ------------------

    14.96%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |--13.89%-- native_flush_tlb_others

So this heavily reduces the tlbstate_lock contention.
Suggested-by: NAndi Kleen <andi@firstfloor.org>
Signed-off-by: NShaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1295232727.1949.709.camel@sli10-conroe>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

70e4a369

x86: Allocate 32 tlb_invalidate_interrupt handler stubs · 3a09fb45

由 Shaohua Li 提交于 1月 17, 2011

Add up to 32 invalidate_interrupt handlers. How many handlers are
added depends on NUM_INVALIDATE_TLB_VECTORS. So if
NUM_INVALIDATE_TLB_VECTORS is smaller than 32, we reduce code
size.
Signed-off-by: NShaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <1295232725.1949.708.camel@sli10-conroe>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3a09fb45

x86: Cleanup vector usage · 60f6e65d

由 Shaohua Li 提交于 1月 17, 2011

Cleanup the vector usage and make them continuous if possible.
Signed-off-by: NShaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <1295232722.1949.707.camel@sli10-conroe>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

60f6e65d

10 2月, 2011 1 次提交

KVM: SVM: Make sure KERNEL_GS_BASE is valid when loading gs_index · 893a5ab6

由 Joerg Roedel 提交于 1月 14, 2011

The gs_index loading code uses the swapgs instruction to
switch to the user gs_base temporarily. This is unsave in an
lightweight exit-path in KVM on AMD because the
KERNEL_GS_BASE MSR is switches lazily. An NMI happening in
the critical path of load_gs_index may use the wrong GS_BASE
value then leading to unpredictable behavior, e.g. a
triple-fault.

This patch fixes the issue by making sure that load_gs_index
is called only with a valid KERNEL_GS_BASE value loaded in
KVM.
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

893a5ab6

07 2月, 2011 1 次提交

x86, nx: Mark the ACPI resume trampoline code as +x · d344e38b

由 H. Peter Anvin 提交于 2月 06, 2011

We reserve lowmem for the things that need it, like the ACPI
wakeup code, way early to guarantee availability.  This happens
before we set up the proper pagetables, so set_memory_x() has no
effect.

Until we have a better solution, use an initcall to mark the
wakeup code executable.
Originally-by: NMatthieu Castet <castet.matthieu@free.fr>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Cc: Matthias Hopf <mhopf@suse.de>
Cc: rjw@sisk.pl
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <4D4F8019.2090104@zytor.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d344e38b

05 2月, 2011 1 次提交

x86-32: Make sure the stack is set up before we use it · 11d4c3f9

由 H. Peter Anvin 提交于 2月 04, 2011

Since checkin ebba638a we call
verify_cpu even in 32-bit mode.  Unfortunately, calling a function
means using the stack, and the stack pointer was not initialized in
the 32-bit setup code!  This code initializes the stack pointer, and
simplifies the interface slightly since it is easier to rely on just a
pointer value rather than a descriptor; we need to have different
values for the segment register anyway.

This retains start_stack as a virtual address, even though a physical
address would be more convenient for 32 bits; the 64-bit code wants
the other way around...
Reported-by: NMatthieu Castet <castet.matthieu@free.fr>
LKML-Reference: <4D41E86D.8060205@free.fr>
Tested-by: NKees Cook <kees.cook@canonical.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

11d4c3f9

04 2月, 2011 1 次提交

x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask after switching mm · 831d52bc

由 Suresh Siddha 提交于 2月 03, 2011

Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb
IPI's while the cr3 is still pointing to the prev mm.  And this window
can lead to the possibility of bogus TLB fills resulting in strange
failures.  One such problematic scenario is mentioned below.

 T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI
     etc between the point of clearing the cpu from the mm_cpumask(mm1)
     and before reloading the cr3 with the new mm2.

 T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with
     flushing the TLB for mm1.  It doesn't send the flush TLB to CPU-1
     as it doesn't see that cpu listed in the mm_cpumask(mm1).

 T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
     page-table pages associated with the removed vma mapping.

 T4. CPU-2 now allocates those freed page-table pages for something
     else.

 T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1
     can potentially speculate and walk through the page-table caches
     and can insert new TLB entries.  As the page-table pages are
     already freed and being used on CPU-2, this page walk can
     potentially insert a bogus global TLB entry depending on the
     (random) contents of the page that is being used on CPU-2.

 T6. This bogus TLB entry being global will be active across future CR3
     changes and can result in weird memory corruption etc.

To avoid this issue, for the prev mm that is handing over the cpu to
another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is
changed.

Marking it for -stable, though we haven't seen any reported failure that
can be attributed to this.
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Acked-by: NIngo Molnar <mingo@elte.hu>
Cc: stable@kernel.org	[v2.6.32+]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

831d52bc

03 2月, 2011 2 次提交

x86, mtrr: Avoid MTRR reprogramming on BP during boot on UP platforms · f7448548

由 Suresh Siddha 提交于 2月 02, 2011

Markus Kohn ran into a hard hang regression on an acer aspire
1310, when acpi is enabled. git bisect showed the following
commit as the bad one that introduced the boot regression.

	commit d0af9eed
	Author: Suresh Siddha <suresh.b.siddha@intel.com>
	Date:   Wed Aug 19 18:05:36 2009 -0700

	    x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init

Because of the UP configuration of that platform,
native_smp_prepare_cpus() bailed out (in smp_sanity_check())
before doing the set_mtrr_aps_delayed_init()

Further down the boot path, native_smp_cpus_done() will call the
delayed MTRR initialization for the AP's (mtrr_aps_init()) with
mtrr_aps_delayed_init not set. This resulted in the boot
processor reprogramming its MTRR's to the values seen during the
start of the OS boot. While this is not needed ideally, this
shouldn't have caused any side-effects. This is because the
reprogramming of MTRR's (set_mtrr_state() that gets called via
set_mtrr()) will check if the live register contents are
different from what is being asked to write and will do the actual
write only if they are different.

BP's mtrr state is read during the start of the OS boot and
typically nothing would have changed when we ask to reprogram it
on BP again because of the above scenario on an UP platform. So
on a normal UP platform no reprogramming of BP MTRR MSR's
happens and all is well.

However, on this platform, bios seems to be modifying the fixed
mtrr range registers between the start of OS boot and when we
double check the live registers for reprogramming BP MTRR
registers. And as the live registers are modified, we end up
reprogramming the MTRR's to the state seen during the start of
the OS boot.

During ACPI initialization, something in the bios (probably smi
handler?) don't like this fact and results in a hard lockup.

We didn't see this boot hang issue on this platform before the
commit d0af9eed, because only
the AP's (if any) will program its MTRR's to the value that BP
had at the start of the OS boot.

Fix this issue by checking mtrr_aps_delayed_init before
continuing further in the mtrr_aps_init(). Now, only AP's (if
any) will program its MTRR's to the BP values during boot.

Addresses https://bugzilla.novell.com/show_bug.cgi?id=623393

  [ By the way, this behavior of the bios modifying MTRR's after the start
    of the OS boot is not common and the kernel is not prepared to
    handle this situation well. Irrespective of this issue, during
    suspend/resume, linux kernel will try to reprogram the BP's MTRR values
    to the values seen during the start of the OS boot. So suspend/resume might
    be already broken on this platform for all linux kernel versions. ]
Reported-and-bisected-by: NMarkus Kohn <jabber@gmx.org>
Tested-by: NMarkus Kohn <jabber@gmx.org>
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Cc: Thomas Renninger <trenn@novell.com>
Cc: Rafael Wysocki <rjw@novell.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: stable@kernel.org # [v2.6.32+]
LKML-Reference: <1296694975.4418.402.camel@sbsiddha-MOBL3.sc.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f7448548

x86, nx: Don't force pages RW when setting NX bits · f12d3d04

由 Matthieu CASTET 提交于 1月 20, 2011

Xen want page table pages read only.

But the initial page table (from head_*.S) live in .data or .bss.

That was broken by 64edc8ed.  There is
absolutely no reason to force these pages RW after they have already
been marked RO.
Signed-off-by: NMatthieu CASTET <castet.matthieu@free.fr>
Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

f12d3d04

28 1月, 2011 2 次提交

perf: Fix Pentium4 raw event validation · d038b12c

由 Stephane Eranian 提交于 1月 25, 2011

This patch fixes some issues with raw event validation on
Pentium 4 (Netburst) based processors.

As I was testing libpfm4 Netburst support, I ran into two
problems in the p4_validate_raw_event() function:

   - the shared field must be checked ONLY when HT is on
   - the binding to ESCR register was missing

The second item was causing raw events to not be encoded
correctly compared to generic PMU events.

With this patch, I can now pass Netburst events to libpfm4
examples and get meaningful results:

  $ task -e global_power_events:running:u  noploop 1
  noploop for 1 seconds
  3,206,304,898 global_power_events:running
Signed-off-by: NStephane Eranian <eranian@google.com>
Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
Cc: peterz@infradead.org
Cc: paulus@samba.org
Cc: davem@davemloft.net
Cc: fweisbec@gmail.com
Cc: perfmon2-devel@lists.sf.net
Cc: eranian@gmail.com
Cc: robert.richter@amd.com
Cc: acme@redhat.com
Cc: gorcunov@gmail.com
Cc: ming.m.lin@intel.com
LKML-Reference: <4d3efb2f.1252d80a.1a80.ffffc83f@mx.google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d038b12c

xen/setup: Route halt operations to safe_halt pvop. · 23febedd

由 Stefano Stabellini 提交于 1月 26, 2011

With this patch, the cpuidle driver does not load and
does not issue the mwait operations. Instead the hypervisor
is doing them (b/c we call the safe_halt pvops call).

This fixes quite a lot of bootup issues wherein the user had
to force interrupts for the continuation of the bootup.

Details are discussed in:

http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00535.html

[v2: Wrote the commit description]
Reported-by: NDaniel De Graaf <dgdegra@tycho.nsa.gov>
Tested-by: NDaniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

23febedd

27 1月, 2011 2 次提交

xen/e820: Guard against E820_RAM not having page-aligned size or start. · 7cb31b75

由 Stefano Stabellini 提交于 1月 27, 2011

Under Dell Inspiron 1525, and Intel SandyBridge SDP's the
BIOS e820 RAM is not page-aligned:

[   0.000000]  Xen: 0000000000100000 - 00000000df66d800 (usable)

We were not handling that and ended up setting up a pagetable
that included up to df66e000 with the disastrous effect that when

        memset(NODE_DATA(nodeid), 0, sizeof(pg_data_t));

tried to clear the page it would crash at the 2K mark.

Initially reported by Michael Young @
http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00108.html

The fix is to page-align the size and also take into consideration
the start of the E820 (in case that is not page-aligned either). This
fixes the bootup failure on those affected machines.

This patch is a rework of the Micheal A Young initial patch and
considers the case if the start is not page-aligned.
Reported-by: NMichael A Young <m.a.young@durham.ac.uk>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NMichael A Young <m.a.young@durham.ac.uk>

7cb31b75

xen/p2m: Mark INVALID_P2M_ENTRY the mfn_list past max_pfn. · cf04d120

由 Stefan Bader 提交于 1月 27, 2011

In case the mfn_list does not have enough entries to fill
a p2m page we do not want the entries from max_pfn up to
the boundary to be filled with unknown values. Hence
set them to INVALID_P2M_ENTRY.
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

cf04d120

26 1月, 2011 3 次提交

percpu, x86: Fix percpu_xchg_op() · 889a7a6a

由 Eric Dumazet 提交于 1月 25, 2011

These recent percpu commits:

  2485b646: x86,percpu: Move out of place 64 bit ops into X86_64 section
  8270137a: cpuops: Use cmpxchg for xchg to avoid lock semantics

Caused this 'perf top' crash:

 Kernel panic - not syncing: Fatal exception in interrupt
 Pid: 0, comm: swapper Tainted: G     D
 2.6.38-rc2-00181-gef71723 #413 Call Trace: <IRQ> [<ffffffff810465b5>]
    ? panic
    ? kmsg_dump
    ? kmsg_dump
    ? oops_end
    ? no_context
    ? __bad_area_nosemaphore
    ? perf_output_begin
    ? bad_area_nosemaphore
    ? do_page_fault
    ? __task_pid_nr_ns
    ? perf_event_tid
    ? __perf_event_header__init_id
    ? validate_chain
    ? perf_output_sample
    ? trace_hardirqs_off
    ? page_fault
    ? irq_work_run
    ? update_process_times
    ? tick_sched_timer
    ? tick_sched_timer
    ? __run_hrtimer
    ? hrtimer_interrupt
    ? account_system_vtime
    ? smp_apic_timer_interrupt
    ? apic_timer_interrupt
 ...

Looking at assembly code, I found:

list = this_cpu_xchg(irq_work_list, NULL);

gives this wrong code : (gcc-4.1.2 cross compiler)

ffffffff810bc45e:
	mov    %gs:0xead0,%rax
	cmpxchg %rax,%gs:0xead0
	jne    ffffffff810bc45e <irq_work_run+0x3e>
	test   %rax,%rax
	je     ffffffff810bc4aa <irq_work_run+0x8a>

Tell gcc we dirty eax/rax register in percpu_xchg_op()

Compiler must use another register to store pxo_new__

We also dont need to reload percpu value after a jump,
since a 'failed' cmpxchg already updated eax/rax

Wrong generated code was :
	xor     %rax,%rax   /* load 0 into %rax */
1:	mov     %gs:0xead0,%rax
	cmpxchg %rax,%gs:0xead0
	jne     1b
	test    %rax,%rax

After patch :

	xor     %rdx,%rdx   /* load 0 into %rdx */
	mov     %gs:0xead0,%rax
1:	cmpxchg %rdx,%gs:0xead0
	jne     1b:
	test    %rax,%rax
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
LKML-Reference: <1295973114.3588.312.camel@edumazet-laptop>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

889a7a6a

x86: Remove left over system_64.h · 9a57c3e4

由 Yinghai Lu 提交于 1月 24, 2011

Left-over from the x86 merge ...
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
LKML-Reference: <4D3E23D1.7010405@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9a57c3e4

thp: fix PARAVIRT x86 32bit noPAE · cacf061c

由 Andrea Arcangeli 提交于 1月 25, 2011

This fixes TRANSPARENT_HUGEPAGE=y with PARAVIRT=y and HIGHMEM64=n.

The #ifdef that this patch removes was erratically introduced to fix a
build error for noPAE (where pmd.pmd doesn't exist).  So then the kernel
built but it failed at runtime because set_pmd_at was a noop.  This will
correct it by enabling set_pmd_at for noPAE mode too.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reported-by: Nwerner <w.landgraf@ru.ru>
Reported-by: NMinchan Kim <minchan.kim@gmail.com>
Tested-by: NMinchan Kim <minchan.kim@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cacf061c

25 1月, 2011 1 次提交

x86-64: Don't use pointer to out-of-scope variable in dump_trace() · 2e5aa682

由 Jesper Juhl 提交于 1月 24, 2011

In arch/x86/kernel/dumpstack_64.c::dump_trace() we have this code:

...
  		if (!stack) {
  			unsigned long dummy;
  			stack = &dummy;
  			if (task && task != current)
  				stack = (unsigned long *)task->thread.sp;
  		}

  		bp = stack_frame(task, regs);
  		/*
  		 * Print function call entries in all stacks, starting at the
  		 * current stack address. If the stacks consist of nested
  		 * exceptions
  		 */
  		tinfo = task_thread_info(task);

  		for (;;) {
  			char *id;
  			unsigned long *estack_end;
  			estack_end = in_exception_stack(cpu, (unsigned long)stack,
  							&used, &id);
...

You'll notice that we assign to 'stack' the address of the variable
'dummy' which is only in-scope inside the 'if (!stack)'. So when we later
access stack (at the end of the above, and assuming we did not take the
'if (task && task != current)' branch) we'll be using the address of a
variable that is no longer in scope. I believe this patch is the proper
fix, but I freely admit that I'm not 100% certain.
Signed-off-by: NJesper Juhl <jj@chaosbits.net>
LKML-Reference: <alpine.LNX.2.00.1101242232590.10252@swampdragon.chaosbits.net>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

2e5aa682

23 1月, 2011 1 次提交

x86: Fix jump label with RO/NX module protection crash · 89696913

由 matthieu castet 提交于 1月 23, 2011

If we use jump table in module init, there are marked
as removed in __jump_table section after init is done.

But we already applied ro permissions on the module, so
we can't modify a read only section (crash in
remove_jump_label_module_init).

Make the __jump_table section rw.
Signed-off-by: NMatthieu CASTET <castet.matthieu@free.fr>
Cc: Xiaotian Feng <xtfeng@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Siarhei Liakh <sliakh.lkml@gmail.com>
Cc: Xuxian Jiang <jiang@cs.ncsu.edu>
Cc: James Morris <jmorris@namei.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dave Jones <davej@redhat.com>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4D3C3F20.7030203@free.fr>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

89696913

22 1月, 2011 2 次提交

x86, hotplug: Fix powersavings with offlined cores on AMD · 93789b32

由 Borislav Petkov 提交于 1月 20, 2011

ea530692 made a CPU use monitor/mwait
when offline. This is not the optimal choice for AMD wrt to powersavings
and we'd prefer our cores to halt (i.e. enter C1) instead. For this, the
same selection whether to use monitor/mwait has to be used as when we
select the idle routine for the machine.

With this patch, offlining cores 1-5 on a X6 machine allows core0 to
boost again.

[ hpa: putting this in urgent since it is a (power) regression fix ]
Reported-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Cc: stable@kernel.org # 37.x
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.hl>
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <1295534572-10730-1-git-send-email-bp@amd64.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

93789b32

xen: p2m: correctly initialize partial p2m leaf · 8e1b4cf2

由 Stefan Bader 提交于 1月 20, 2011

After changing the p2m mapping to a tree by

  commit 58e05027
    xen: convert p2m to a 3 level tree

and trying to boot a DomU with 615MB of memory, the following crash was
observed in the dump:

kernel direct mapping tables up to 26f00000 @ 1ec4000-1fff000
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<c0107397>] xen_set_pte+0x27/0x60
*pdpt = 0000000000000000 *pde = 0000000000000000

Adding further debug statements showed that when trying to set up
pfn=0x26700 the returned mapping was invalid.

pfn=0x266ff calling set_pte(0xc1fe77f8, 0x6b3003)
pfn=0x26700 calling set_pte(0xc1fe7800, 0x3)

Although the last_pfn obtained from the startup info is 0x26700, which
should in turn not be hit, the additional 8MB which are added as extra
memory normally seem to be ok. This lead to looking into the initial
p2m tree construction, which uses the smaller value and assuming that
there is other code handling the extra memory.

When the p2m tree is set up, the leaves are directly pointed to the
array which the domain builder set up. But if the mapping is not on a
boundary that fits into one p2m page, this will result in the last leaf
being only partially valid. And as the invalid entries are not
initialized in that case, things go badly wrong.

I am trying to fix that by checking whether the current leaf is a
complete map and if not, allocate a completely new page and copy only
the valid pointers there. This may not be the most efficient or elegant
solution, but at least it seems to allow me booting DomUs with memory
assignments all over the range.

BugLink: http://bugs.launchpad.net/bugs/686692
[v2: Redid a bit of commit wording and fixed a compile warning]
Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

8e1b4cf2

21 1月, 2011 4 次提交

x86, mcheck, therm_throt.c: Export symbol platform_thermal_notify to allow coretemp to handler intr · f21bbec9

由 Fenghua Yu 提交于 1月 20, 2011

In therm_throt.c, commit
9e76a97e patch doesn't export
the symbol platform_thermal_notify.

Other drivers (e.g. drivers/hwmon/coretemp.c) can not find the
symbol platform_thermal_notify when defining threshould
interrupt handler.

Please apply this patch to allow threshold interrupt handler in
coretemp.
Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
Cc: R Durgadoss <durgadoss.r@intel.com>
Cc: khali@linux-fr.org <khali@linux-fr.org>
Cc: lm-sensors@lm-sensors.org <lm-sensors@lm-sensors.org>
Cc: Guenter Roeck <guenter.roeck@ericsson.com>
LKML-Reference: <20110121041239.GB26954@linux-os.sc.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f21bbec9

x86: Use asm-generic/cacheflush.h · cc67ba63

由 Akinobu Mita 提交于 1月 20, 2011

The implementation of the cache flushing interfaces on the x86
is identical with the default implementation in asm-generic.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: arnd@arndb.de
LKML-Reference: <1295523136-4277-2-git-send-email-akinobu.mita@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cc67ba63

kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT · 6a108a14

由 David Rientjes 提交于 1月 20, 2011

The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
is used to configure any non-standard kernel with a much larger scope than
only small devices.

This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
references to the option throughout the kernel.  A new CONFIG_EMBEDDED
option is added that automatically selects CONFIG_EXPERT when enabled and
can be used in the future to isolate options that should only be
considered for embedded systems (RISC architectures, SLOB, etc).

Calling the option "EXPERT" more accurately represents its intention: only
expert users who understand the impact of the configuration changes they
are making should enable it.
Reviewed-by: NIngo Molnar <mingo@elte.hu>
Acked-by: NDavid Woodhouse <david.woodhouse@intel.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: Greg KH <gregkh@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6a108a14

xen: fix non-ANSI function warning in irq.c · 7d81c3b9

由 Randy Dunlap 提交于 1月 08, 2011

Fix sparse warning for non-ANSI function declaration:

arch/x86/xen/irq.c:129:30: warning: non-ANSI function declaration of function 'xen_init_irq_ops'
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

7d81c3b9

20 1月, 2011 5 次提交

lockdep: Move early boot local IRQ enable/disable status to init/main.c · 2ce802f6

由 Tejun Heo 提交于 1月 20, 2011

During early boot, local IRQ is disabled until IRQ subsystem is
properly initialized.  During this time, no one should enable
local IRQ and some operations which usually are not allowed with
IRQ disabled, e.g. operations which might sleep or require
communications with other processors, are allowed.

lockdep tracked this with early_boot_irqs_off/on() callbacks.
As other subsystems need this information too, move it to
init/main.c and make it generally available.  While at it,
toggle the boolean to early_boot_irqs_disabled instead of
enabled so that it can be initialized with %false and %true
indicates the exceptional condition.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110120110635.GB6036@htj.dyndns.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2ce802f6

x86: Update CPU cache attributes table descriptors · fb87ec38

由 Dave Jones 提交于 1月 19, 2011

Update to latest definitions in:

   http://www.intel.com/Assets/PDF/appnote/241618.pdf

[ Note, this update of the doc has removed some old values which
  we have listed.  I think until we have clarification that they
  were never used in production, they should be left there. ]
Signed-off-by: NDave Jones <davej@redhat.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
LKML-Reference: <20110120012055.GA15985@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fb87ec38

LGUEST_GUEST: fix unmet direct dependencies (VIRTUALIZATION && VIRTIO) · 2b8216e6

由 Randy Dunlap 提交于 1月 01, 2011

Honor the kconfig menu hierarchy to remove kconfig dependency warnings:
VIRTIO and VIRTIO_RING are subordinate to VIRTUALIZATION.

warning: (LGUEST_GUEST) selects VIRTIO which has unmet direct dependencies (VIRTUALIZATION)
warning: (LGUEST_GUEST && VIRTIO_PCI && VIRTIO_BALLOON) selects VIRTIO_RING which has unmet direct dependencies (VIRTUALIZATION && VIRTIO)
Reported-by: NToralf F_rster <toralf.foerster@gmx.de>
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

2b8216e6

lguest: compile fixes · ced05dd7

由 Rusty Russell 提交于 1月 20, 2011

arch/x86/lguest/boot.c: In function ‘lguest_init_IRQ’:
arch/x86/lguest/boot.c:824: error: macro "__this_cpu_write" requires 2 arguments, but only 1 given
arch/x86/lguest/boot.c:824: error: ‘__this_cpu_write’ undeclared (first use in this function)
arch/x86/lguest/boot.c:824: error: (Each undeclared identifier is reported only once
arch/x86/lguest/boot.c:824: error: for each function it appears in.)

drivers/lguest/x86/core.c: In function ‘copy_in_guest_info’:
drivers/lguest/x86/core.c:94: error: lvalue required as left operand of assignment
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

ced05dd7

lguest: Use this_cpu_ops · c9f29549

由 Christoph Lameter 提交于 11月 30, 2010

Use this_cpu_ops in a couple of places in lguest.
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

c9f29549

19 1月, 2011 2 次提交

x86: Unify "numa=" command line option handling · 90321602

由 Jan Beulich 提交于 1月 19, 2011

In order to be able to suppress the use of SRAT tables that
32-bit Linux can't deal with (in one case known to lead to a
non-bootable system, unless disabling ACPI altogether), move the
"numa=" option handling to common code.
Signed-off-by: NJan Beulich <jbeulich@novell.com>
Reviewed-by: NThomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Renninger <trenn@suse.de>
LKML-Reference: <4D36B581020000780002D0FF@vpn.id2.novell.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

90321602

Revert "x86: Make relocatable kernel work with new binutils" · 6b35eb9d

由 Ingo Molnar 提交于 1月 19, 2011

This reverts commit 86b1e8dd ("x86: Make relocatable kernel work with
new binutils").

Markus Trippelsdorf reported a boot failure caused by this patch.

The real solution to the original patch will likely involve an
arch-generic solution to define an overlaid jiffies_64 and jiffies
variables.

Until that's done and tested on all architectures revert this commit to
solve the regression.
Reported-and-bisected-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "Lu, Hongjiu" <hongjiu.lu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Cc: Sam Ravnborg <sam@ravnborg.org>
LKML-Reference: <4D36A759.60704@intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6b35eb9d

18 1月, 2011 2 次提交

x86: Clear irqstack thread_info · 7b698ea3

由 Brian Gerst 提交于 1月 17, 2011

Mathias Merz reported that v2.6.37 failed to boot on his
system.

Make sure that the thread_info part of the irqstack is
initialized to zeroes.
Reported-and-Tested-by: NMatthias Merz <linux@merz-ka.de>
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Acked-by: NPekka Enberg <penberg@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <AANLkTimyKXfJ1x8tgwrr1hYnNLrPfgE1NTe4z7L6tUDm@mail.gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7b698ea3

x86: Make relocatable kernel work with new binutils · 86b1e8dd

由 Shaohua Li 提交于 1月 18, 2011

The CONFIG_RELOCATABLE=y option is broken with new binutils, which will make
boot panic.

According to Lu Hongjiu, the affected binutils are from 2.20.51.0.12 to
2.21.51.0.3, which are release since Oct 22 this year. At least ubuntu 10.10 is
using such binutils. See:

    http://sourceware.org/bugzilla/show_bug.cgi?id=12327

The reason of the boot panic is that we have 'jiffies = jiffies_64;' in
vmlinux.lds.S. The jiffies isn't in any section. In kernel build, there is
warning saying jiffies is an absolute address and can't be relocatable. At
runtime, jiffies will have virtual address 0.

Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Cc: Lu Hongjiu<hongjiu.lu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
LKML-Reference: <1295312269.1949.725.camel@sli10-conroe>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

86b1e8dd

15 1月, 2011 7 次提交

xen: export arbitrary_virt_to_machine · de23be5f

由 Stephen Rothwell 提交于 1月 15, 2011

Fixes this build error:

 ERROR: "arbitrary_virt_to_machine" [drivers/xen/xen-gntdev.ko] undefined!
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

de23be5f

x86, olpc: Add missing Kconfig dependencies · 76d1f7bf

由 H. Peter Anvin 提交于 1月 14, 2011

OLPC uses select for OLPC_OPENFIRMWARE, which means OLPC has to
enforce the dependencies for OLPC_OPENFIRMWARE.  Make sure it does so.
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Cc: Daniel Drake <dsd@laptop.org>
Cc: Andres Salomon <dilinger@queued.net>
Cc: Grant Likely <grant.likely@secretlab.ca>
LKML-Reference: <20100923162846.D8D409D401B@zog.reactivated.net>
Cc: <stable@kernel.org> 2.6.37

76d1f7bf

x86, mrst: Set correct APB timer IRQ affinity for secondary cpu · 6550904d

由 Jacob Pan 提交于 1月 13, 2011

Offlining the secondary CPU causes the timer irq affinity to be set to
CPU 0. When the secondary CPU is back online again, the wrong irq
affinity will be used.

This patch ensures secondary per CPU timer always has the correct
IRQ affinity when enabled.
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
LKML-Reference: <1294963604-18111-1-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@kernel.org> 2.6.37

6550904d

x86: tsc: Fix calibration refinement conditionals to avoid divide by zero · 62627bec

由 John Stultz 提交于 1月 14, 2011

Konrad Wilk reported that the new delayed calibration crashes with a
divide by zero on Xen. The reason is that Xen sets the pmtimer
address, but reading from it returns 0xffffff. That results in the
ref_start and ref_stop value being the same, so the delta is zero
which causes the divide by zero later in the calculation.

The conditional (!hpet && !ref_start && !ref_stop) which sanity checks
the calibration reference values doesn't really make sense. If the
refs are null, but hpet is on, we still want to break out.

The div by zero would be possible to trigger by chance if both reads
from the hardware provided the exact same value (due to hardware
wrapping).

So checking if both the ref values are the same should handle if we
don't have hardware (both null) or if they are the same value (either by
invalid hardware, or by chance), avoiding the div by zero issue.

[ tglx: Applied the same fix to native_calibrate_tsc() where this
  	check was copied from ]
Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
LKML-Reference: <1295024788-15619-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

62627bec

x86/PCI: make Broadcom CNB20LE driver EMBEDDED and EXPERIMENTAL · 64a5fed6

由 Bjorn Helgaas 提交于 1月 06, 2011

This functionality is known to be incomplete, so discourage its use in
general-purpose kernels.

The only reason to use this driver is to support PCI hotplug on CNB20LE-
based machines that don't have ACPI, and there are very few such
systems.

Reference: https://bugzilla.redhat.com/show_bug.cgi?id=665109Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

64a5fed6

x86/PCI: don't use native Broadcom CNB20LE driver when ACPI is available · 30e664af

由 Bjorn Helgaas 提交于 1月 06, 2011

The broadcom_bus.c quirk was written (without benefit of documentation)
to support PCI hotplug on an old system that doesn't have ACPI. As
such, we should only use it when the system doesn't have ACPI.

If the system does have ACPI and we need the host bridge description, we
should get it from the ACPI _CRS method. On machines older than 2008,
we currently ignore _CRS, but that doesn't mean we should use
broadcom_bus.c. It means we should either (a) do what we've done in the
past and assume everything in the PCI gap is routed to bus 0 (so hotplug
may not work), or (b) arrange to use _CRS. This patch does (a).

Reference: https://bugzilla.redhat.com/show_bug.cgi?id=665109Acked-by: NIra W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

30e664af

PCI: enable pci=bfsort by default on future Dell systems · 6e8af08d

由 Narendra_K@Dell.com 提交于 12月 14, 2010

This patch enables pci=bfsort by default on future Dell systems.
It reads SMBIOS type 0xB1 vendor specific record and sets pci=bfsort
accordingly.

Offset  Name    Length  Value   Description

04      Flags0  Word    Varies  Bits 9-10
                                - 10:9 = 00  Unknown
                                - 10:9 = 01  Breadth First
                                - 10:9 = 10  Depth First
                                - 10:9 = 11  Reserved

1. Any time pci=bfsort has to be enabled on a system, we need to add the
   model number of the system to the white list. With this patch, that
   is not required.

2. Typically, model number has to be added to the white list when the
   system is under development. With this change, that is not required.
Signed-off-by: NJordan Hargrave <jordan_hargrave@dell.com>
Signed-off-by: NNarendra K <narendra_k@dell.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

6e8af08d

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功