提交 · 42181186ad4db986fcaa40ca95c6e407e9e79372 · openanolis / cloud-kernel

22 12月, 2011 6 次提交

x86: Add counter when debug stack is used with interrupts enabled · 42181186

由 Steven Rostedt 提交于 12月 16, 2011

Mathieu Desnoyers pointed out a case that can cause issues with
NMIs running on the debug stack:

  int3 -> interrupt -> NMI -> int3

Because the interrupt changes the stack, the NMI will not see that
it preempted the debug stack. Looking deeper at this case,
interrupts only happen when the int3 is from userspace or in
an a location in the exception table (fixup).

  userspace -> int3 -> interurpt -> NMI -> int3

All other int3s that happen in the kernel should be processed
without ever enabling interrupts, as the do_trap() call will
panic the kernel if it is called to process any other location
within the kernel.

Adding a counter around the sections that enable interrupts while
using the debug stack allows the NMI to also check that case.
If the NMI sees that it either interrupted a task using the debug
stack or the debug counter is non-zero, then it will have to
change the IDT table to make the int3 not change stacks (which will
corrupt the stack if it does).

Note, I had to move the debug_usage functions out of processor.h
and into debugreg.h because of the static inlined functions to
inc and dec the debug_usage counter. __get_cpu_var() requires
smp.h which includes processor.h, and would fail to build.

Link: http://lkml.kernel.org/r/1323976535.23971.112.camel@gandalf.stny.rr.comReported-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

42181186

x86: Allow NMIs to hit breakpoints in i386 · ccd49c23

由 Steven Rostedt 提交于 12月 13, 2011

With i386, NMIs and breakpoints use the current stack and they
do not reset the stack pointer to a fix point that might corrupt
a previous NMI or breakpoint (as it does in x86_64). But NMIs are
still not made to be re-entrant, and need to prevent the case that
an NMI hitting a breakpoint (which does an iret), doesn't allow
another NMI to run.

The fix is to let the NMI be in 3 different states:

1) not running
2) executing
3) latched

When no NMI is executing on a given CPU, the state is "not running".
When the first NMI comes in, the state is switched to "executing".
On exit of that NMI, a cmpxchg is performed to switch the state
back to "not running" and if that fails, the NMI is restarted.

If a breakpoint is hit and does an iret, which re-enables NMIs,
and another NMI comes in before the first NMI finished, it will
detect that the state is not in the "not running" state and the
current NMI is nested. In this case, the state is switched to "latched"
to let the interrupted NMI know to restart the NMI handler, and
the nested NMI exits without doing anything.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

ccd49c23

x86: Keep current stack in NMI breakpoints · 228bdaa9

由 Steven Rostedt 提交于 12月 09, 2011

We want to allow NMI handlers to have breakpoints to be able to
remove stop_machine from ftrace, kprobes and jump_labels. But if
an NMI interrupts a current breakpoint, and then it triggers a
breakpoint itself, it will switch to the breakpoint stack and
corrupt the data on it for the breakpoint processing that it
interrupted.

Instead, have the NMI check if it interrupted breakpoint processing
by checking if the stack that is currently used is a breakpoint
stack. If it is, then load a special IDT that changes the IST
for the debug exception to keep the same stack in kernel context.
When the NMI is done, it puts it back.

This way, if the NMI does trigger a breakpoint, it will keep
using the same stack and not stomp on the breakpoint data for
the breakpoint it interrupted.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

228bdaa9

x86: Add workaround to NMI iret woes · 3f3c8b8c

由 Steven Rostedt 提交于 12月 08, 2011

In x86, when an NMI goes off, the CPU goes into an NMI context that
prevents other NMIs to trigger on that CPU. If an NMI is suppose to
trigger, it has to wait till the previous NMI leaves NMI context.
At that time, the next NMI can trigger (note, only one more NMI will
trigger, as only one can be latched at a time).

The way x86 gets out of NMI context is by calling iret. The problem
with this is that this causes problems if the NMI handle either
triggers an exception, or a breakpoint. Both the exception and the
breakpoint handlers will finish with an iret. If this happens while
in NMI context, the CPU will leave NMI context and a new NMI may come
in. As NMI handlers are not made to be re-entrant, this can cause
havoc with the system, not to mention, the nested NMI will write
all over the previous NMI's stack.

Linus Torvalds proposed the following workaround to this problem:

https://lkml.org/lkml/2010/7/14/264

"In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like

 NMI entry:
 - load percpu NMI stack frame pointer
 - if non-zero we know we're nested, and should ignore this NMI:
    - we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
    - before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
  - set the NMI stack pointer to the current stack pointer

 NMI exit (not the above "immediate exit because we nested"):
   clear the percpu NMI stack pointer
   Just do the iret.

Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.

And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy."

I first tried to follow this advice but as I started implementing this
code, a few gotchas showed up.

One, is accessing per-cpu variables in the NMI handler.

The problem is that per-cpu variables use the %gs register to get the
variable for the given CPU. But as the NMI may happen in userspace,
we must first perform a SWAPGS to get to it. The NMI handler already
does this later in the code, but its too late as we have saved off
all the registers and we don't want to do that for a disabled NMI.

Peter Zijlstra suggested to keep all variables on the stack. This
simplifies things greatly and it has the added benefit of cache locality.

Two, faulting on the iret.

I really wanted to make this work, but it was becoming very hacky, and
I never got it to be stable. The iret already had a fault handler for
userspace faulting with bad segment registers, and getting NMI to trigger
a fault and detect it was very tricky. But for strange reasons, the system
would usually take a double fault and crash. I never figured out why
and decided to go with a simple "jmp" approach. The new approach I took
also simplified things.

Finally, the last problem with Linus's approach was to have the nested
NMI handler do a ret instead of an iret to give the first NMI NMI-context
again.

The problem is that ret is much more limited than an iret. I couldn't figure
out how to get the stack back where it belonged. I could have copied the
current stack, pushed the return onto it, but my fear here is that there
may be some place that writes data below the stack pointer. I know that
is not something code should depend on, but I don't want to chance it.
I may add this feature later, but for now, an NMI handler that loses NMI
context will not get it back.

Here's what is done:

When an NMI comes in, the HW pushes the interrupt stack frame onto the
per cpu NMI stack that is selected by the IST.

A special location on the NMI stack holds a variable that is set when
the first NMI handler runs. If this variable is set then we know that
this is a nested NMI and we process the nested NMI code.

There is still a race when this variable is cleared and an NMI comes
in just before the first NMI does the return. For this case, if the
variable is cleared, we also check if the interrupted stack is the
NMI stack. If it is, then we process the nested NMI code.

Why the two tests and not just test the interrupted stack?

If the first NMI hits a breakpoint and loses NMI context, and then it
hits another breakpoint and while processing that breakpoint we get a
nested NMI. When processing a breakpoint, the stack changes to the
breakpoint stack. If another NMI comes in here we can't rely on the
interrupted stack to be the NMI stack.

If the variable is not set and the interrupted task's stack is not the
NMI stack, then we know this is the first NMI and we can process things
normally. But in order to do so, we need to do a few things first.

1) Set the stack variable that tells us that we are in an NMI handler

2) Make two copies of the interrupt stack frame.
   One copy is used to return on iret
   The other is used to restore the first one if we have a nested NMI.

This is what the stack will look like:

	  +-------------------------+
	  | original SS             |
	  | original Return RSP     |
	  | original RFLAGS         |
	  | original CS             |
	  | original RIP            |
	  +-------------------------+
	  | temp storage for rdx    |
	  +-------------------------+
	  | NMI executing variable  |
	  +-------------------------+
	  | Saved SS                |
	  | Saved Return RSP        |
	  | Saved RFLAGS            |
	  | Saved CS                |
	  | Saved RIP               |
	  +-------------------------+
	  | copied SS               |
	  | copied Return RSP       |
	  | copied RFLAGS           |
	  | copied CS               |
	  | copied RIP              |
	  +-------------------------+
	  | pt_regs                 |
	  +-------------------------+

The original stack frame contains what the HW put in when we entered
the NMI.

We store %rdx as a temp variable to use. Both the original HW stack
frame and this %rdx storage will be clobbered by nested NMIs so we
can not rely on them later in the first NMI handler.

The next item is the special stack variable that is set when we execute
the rest of the NMI handler.

Then we have two copies of the interrupt stack. The second copy is
modified by any nested NMIs to let the first NMI know that we triggered
a second NMI (latched) and that we should repeat the NMI handler.

If the first NMI hits an exception or breakpoint that takes it out of
NMI context, if a second NMI comes in before the first one finishes,
it will update the copied interrupt stack to point to a fix up location
to trigger another NMI.

When the first NMI calls iret, it will instead jump to the fix up
location. This fix up location will copy the saved interrupt stack back
to the copy and execute the nmi handler again.

Note, the nested NMI knows enough to check if it preempted a previous
NMI handler while it is in the fixup location. If it has, it will not
modify the copied interrupt stack and will just leave as if nothing
happened. As the NMI handle is about to execute again, there's no reason
to latch now.

To test all this, I forced the NMI handler to call iret and take itself
out of NMI context. I also added assemble code to write to the serial to
make sure that it hits the nested path as well as the fix up path.
Everything seems to be working fine.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

3f3c8b8c

x86: Document the NMI handler about not using paranoid_exit · 1fd466ef

由 Steven Rostedt 提交于 12月 08, 2011

Linus cleaned up the NMI handler but it still needs some comments to
explain why it uses save_paranoid but not paranoid_exit. Just to keep
others from adding that in the future, document why it's not used.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

1fd466ef

x86: Do not schedule while still in NMI context · 549c89b9

由 Linus Torvalds 提交于 11月 29, 2011

The NMI handler uses the paranoid_exit routine that checks the
NEED_RESCHED flag, and if it is set and the return is for userspace,
then interrupts are enabled, the stack is swapped to the thread's stack,
and schedule is called. The problem with this is that we are still in an
NMI context until an iret is executed. This means that any new NMIs are
now starved until an interrupt or exception occurs and does the iret.

As NMIs can not be masked and can interrupt any location, they are
treated as a special case. NEED_RESCHED should not be set in an NMI
handler. The interruption by the NMI should not disturb the work flow
for scheduling. Any IPI sent to a processor after sending the
NEED_RESCHED would have to wait for the NMI anyway, and after the IPI
finishes the schedule would be called as required.

There is no reason to do anything special leaving an NMI. Remove the
call to paranoid_exit and do a simple return. This not only fixes the
bug of starved NMIs, but it also cleans up the code.

Link: http://lkml.kernel.org/r/CA+55aFzgM55hXTs4griX5e9=v_O+=ue+7Rj0PTD=M7hFYpyULQ@mail.gmail.comAcked-by: NAndi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

549c89b9

09 12月, 2011 2 次提交

x86, efi: Calling __pa() with an ioremap()ed address is invalid · e8c71062

由 Matt Fleming 提交于 11月 18, 2011

If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.

On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:

  BUG: unable to handle kernel paging request at f7f22280
  IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
  [...]

  Call Trace:
   [<c104f8ca>] ? page_is_ram+0x1a/0x40
   [<c1025aff>] reserve_memtype+0xdf/0x2f0
   [<c1024dc9>] set_memory_uc+0x49/0xa0
   [<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
   [<c19216d4>] start_kernel+0x291/0x2f2
   [<c19211c7>] ? loglevel+0x1b/0x1b
   [<c19210bf>] i386_start_kernel+0xbf/0xc8

A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().

Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,

	https://bugzilla.redhat.com/show_bug.cgi?id=748516

Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.
Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

e8c71062

x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked · 2ded6e6a

由 Mark Langsdorf 提交于 11月 18, 2011

When HPET is operating in RTC mode, the TN_ENABLE bit on timer1
controls whether the HPET or the RTC delivers interrupts to irq8. When
the system goes into suspend, the RTC driver sends a signal to the
HPET driver so that the HPET releases control of irq8, allowing the
RTC to wake the system from suspend. The switchover is accomplished by
a write to the HPET configuration registers which currently only
occurs while servicing the HPET interrupt.

On some systems, I have seen the system suspend before an HPET
interrupt occurs, preventing the write to the HPET configuration
register and leaving the HPET in control of the irq8. As the HPET is
not active during suspend, it does not generate a wake signal and RTC
alarms do not work.

This patch forces the HPET driver to immediately transfer control of
the irq8 channel to the RTC instead of waiting until the next
interrupt event.
Signed-off-by: NMark Langsdorf <mark.langsdorf@amd.com>
Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.comTested-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org

2ded6e6a

06 12月, 2011 2 次提交

x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions · f62ef5f3

由 Andreas Herrmann 提交于 12月 02, 2011

I've received complaints that the numa_node attribute for family
15h model 00-0fh (e.g. Interlagos) northbridge functions shows
-1 instead of the proper node ID.

Correct this with attached quirks (similar to quirks for other
AMD CPU families used in multi-socket systems).
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20111202072143.GA31916@alberich.amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

f62ef5f3

x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms · 35d47699

由 Mathias Nyman 提交于 11月 15, 2011

Intel MID x86 platforms have a memory mapped virtual RTC
instead.  No MID platform have the default ports (and
accessing them may do weird stuff).
Signed-off-by: NMathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Cc: feng.tang@intel.com
Cc: Feng Tang <feng.tang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

35d47699

05 12月, 2011 11 次提交

x86: Fix "Acer Aspire 1" reboot hang · 1ef03890

由 Peter Chubb 提交于 12月 05, 2011

Looks like on some Acer Aspire 1s with older bioses, reboot via bios
fails.  It works on my machine, (with BIOS version 0.3310) but
not on some others (BIOS version 0.3309).

There's a log of problems at:

  https://bbs.archlinux.org/viewtopic.php?id=124136

This patch adds a different callback to the reboot quirk table,
to allow rebooting via keybaord controller.
Reported-by: NUroš Vampl <mobile.leecher@gmail.com>
Tested-by: NVasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: NPeter Chubb <peter.chubb@nicta.com.au>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

1ef03890

x86/mtrr: Resolve inconsistency with Intel processor manual · 8dbf4a30

由 Ajaykumar Hotchandani 提交于 11月 11, 2011

Following is from Notes of section 11.5.3 of Intel processor
manual available at:

  http://www.intel.com/Assets/PDF/manual/325384.pdf

For the Pentium 4 and Intel Xeon processors, after the sequence of
steps given above has been executed, the cache lines containing the
code between the end of the WBINVD instruction and before the
MTRRS have actually been disabled may be retained in the cache
hierarchy. Here, to remove code from the cache completely, a
second WBINVD instruction must be executed after the MTRRs have
been disabled.

This patch provides resolution for that.

Ideally, I will like to make changes only for Pentium 4 and Xeon
processors. But, I am not finding easier way to do it.
And, extra wbinvd() instruction does not hurt much for other
processors.
Signed-off-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Link: http://lkml.kernel.org/r/4EBD1CC5.3030008@oracle.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

8dbf4a30

x86, microcode: Fix the failure path of microcode update driver init code · bd399063

由 Srivatsa S. Bhat 提交于 11月 07, 2011

The microcode update driver's initialization code does not handle
failures correctly. This patch fixes this issue.
Signed-off-by: NJan Beulich <JBeulich@suse.com>
Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20111107123530.12164.31227.stgit@srivatsabhat.in.ibm.com
Link: http://lkml.kernel.org/r/4ED8E2270200007800065120@nat28.tlf.novell.comSigned-off-by: NBorislav Petkov <borislav.petkov@amd.com>

bd399063

Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup · 644ddf58

由 Prarit Bhargava 提交于 10月 18, 2011

TAINT_FIRMWARE_WORKAROUND should be set when an MTRR fixup
is done.
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1318958650-12447-1-git-send-email-prarit@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

644ddf58

x86/mpparse: Account for bus types other than ISA and PCI · 9e686668

由 Bjorn Helgaas 提交于 9月 25, 2011

In commit f8924e77 ("x86: unify mp_bus_info"), the 32-bit
and 64-bit versions of MP_bus_info were rearranged to match each
other better.  Unfortunately it introduced a regression: prior
to that change we used to always set the mp_bus_not_pci bit,
then clear it if we found a PCI bus.  After it, we set
mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave
it alone otherwise.

In the cases of ISA and PCI, there's not much difference.  But
ISA is not the only non-PCI bus, so it's better to always set
mp_bus_not_pci and clear it only for PCI.

Without this change, Dan's Dell PowerEdge 4200 panics on boot
with a log indicating interrupt routing trouble unless the
"noapic" option is supplied.  With this change, the machine
boots reliably without "noapic".

Fixes http://bugs.debian.org/586494Reported-bisected-and-tested-by: NDan McGrath <troubledaemon@gmail.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org	# 2.6.26+
Cc: Dan McGrath <troubledaemon@gmail.com>
Cc: Alexey Starikovskiy <aystarik@gmail.com>
[jrnieder@gmail.com: clarified commit message]
Signed-off-by: NJonathan Nieder <jrnieder@gmail.com>
Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.netSigned-off-by: NIngo Molnar <mingo@elte.hu>

9e686668

x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot · 6be30bb7

由 Rafael J. Wysocki 提交于 11月 16, 2011

Dell OptiPlex 990 is known to require PCI reboot, so add it to
the reboot blacklist in pci_reboot_dmi_table[].
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/201111160019.51303.rjw@sisk.plSigned-off-by: NIngo Molnar <mingo@elte.hu>

6be30bb7

x86, UV: Fix UV2 hub part number · b495e039

由 Jack Steiner 提交于 11月 29, 2011

There was a mixup when the SGI UV2 hub chip was sent to be
fabricated, and it ended up with the wrong part number in the
HRP_NODE_ID mmr. Future versions of the chip will (may) have the
correct part number. Change the UV infrastructure to recognize
both part numbers as valid IDs of a UV2 hub chip.
Signed-off-by: NJack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/20111129210058.GA20452@sgi.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

b495e039

x86: Add user_mode_vm check in stack_overflow_check · 69682b62

由 Mitsuo Hayasaka 提交于 11月 29, 2011

The kernel stack overflow is checked in stack_overflow_check(),
which may wrongly detect the overflow if the stack pointer in
user space points to the kernel stack intentionally or
accidentally. So, the actual overflow is never detected after
this misdetection because WARN_ONCE() is used on the detection
of it.

This patch adds user-mode-vm checking before it to avoid this
problem and bails out early if the user stack is used.
Signed-off-by: NMitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060821.11076.55315.stgit@ltc219.sdl.hitachi.co.jpSigned-off-by: NIngo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>

69682b62

perf, x86: Force IBS LVT offset assignment for family 10h · 16e5294e

由 Robert Richter 提交于 11月 08, 2011

On AMD family 10h we see firmware bug messages like the following:

 [Firmware Bug]: cpu 6, try to use APIC500 (LVT offset 0) for vector 0x10400, but the register is already in use for vector 0xf9 on another cpu
 [Firmware Bug]: cpu 6, IBS interrupt offset 0 not available (MSRC001103A=0x0000000000000100)
 [Firmware Bug]: using offset 1 for IBS interrupts
 [Firmware Bug]: workaround enabled for IBS LVT offset
 perf: AMD IBS detected (0x00000007)

We always see this, since the offsets are not assigned by the BIOS for
this family. Force LVT offset assignment in this case. If the OS
assignment fails, fallback to BIOS settings and try to setup this.

The fallback to BIOS settings weakens the family check since
force_ibs_eilvt_setup() may fail e.g. in case of virtual machines.
But setup may still succeed if BIOS offsets are correct.

Other families don't have a workaround implemented that assigns LVT
offsets. It's ok, to drop calling force_ibs_eilvt_setup() for that
families.

With the patch the [Firmware Bug] messages vanish. We see now:

 IBS: LVT offset 1 assigned
 perf: AMD IBS detected (0x00000007)
Signed-off-by: NRobert Richter <robert.richter@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111109162225.GO12451@erda.amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

16e5294e

perf, x86: Disable PEBS on SandyBridge chips · 6a600a8b

由 Peter Zijlstra 提交于 11月 15, 2011

Cc: Stephane Eranian <eranian@google.com>
Cc: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6a600a8b

x86: Fix boot failures on older AMD CPU's · 8e8da023

由 Linus Torvalds 提交于 12月 04, 2011

People with old AMD chips are getting hung boots, because commit
bcb80e53 ("x86, microcode, AMD: Add microcode revision to
/proc/cpuinfo") moved the microcode detection too early into
"early_init_amd()".

At that point we are *so* early in the booth that the exception tables
haven't even been set up yet, so the whole

	rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);

doesn't actually work: if the rdmsr does a GP fault (due to non-existant
MSR register on older CPU's), we can't fix it up yet, and the boot fails.

Fix it by simply moving the code to a slightly later point in the boot
(init_amd() instead of early_init_amd()), since the kernel itself
doesn't even really care about the microcode patchlevel at this point
(or really ever: it's made available to user space in /proc/cpuinfo, and
updated if you do a microcode load).
Reported-tested-and-bisected-by: NLarry Finger <Larry.Finger@lwfinger.net>
Tested-by: NBob Tracy <rct@gherkin.frus.com>
Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8e8da023

04 12月, 2011 1 次提交

xen/pm_idle: Make pm_idle be default_idle under Xen. · e5fd47bf

由 Konrad Rzeszutek Wilk 提交于 11月 21, 2011

The idea behind commit d91ee586 ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code.  It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).

But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle.  This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.

In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:

  Brought up 2 CPUs
  invalid opcode: 0000 [#1] SMP

  Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
  RIP: e030:[<ffffffff81015d1d>]  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
  ...
  Call Trace:
   [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
   [<ffffffff8149ee78>] cpu_bringup_and_idle+0xe/0x10
  RIP  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
   RSP <ffff8801d28ddf10>

In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall.  Meaning we end up going to
hypervisor twice instead of just once.

The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.

We want to do that, but only for one specific case: Xen.  This patch
does that.

Fixes RH BZ #739499 and Ubuntu #881076
Reported-by: NStefan Bader <stefan.bader@canonical.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e5fd47bf

20 11月, 2011 1 次提交

KVM guest: prevent tracing recursion with kvmclock · 95ef1e52

由 Avi Kivity 提交于 11月 15, 2011

Prevent tracing of preempt_disable() in get_cpu_var() in
kvm_clock_read(). When CONFIG_DEBUG_PREEMPT is enabled,
preempt_disable/enable() are traced and this causes the function_graph
tracer to go into an infinite recursion. By open coding the
preempt_disable() around the get_cpu_var(), we can use the notrace
version which prevents preempt_disable/enable() from being traced and
prevents the recursion.

Based on a similar patch for Xen from Jeremy Fitzhardinge.
Tested-by: NGleb Natapov <gleb@redhat.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAvi Kivity <avi@redhat.com>

95ef1e52

14 11月, 2011 4 次提交

x86: Call stop_machine_text_poke() on all CPUs · 78345d2e

由 Rabin Vincent 提交于 10月 27, 2011

It appears that stop_machine_text_poke() wants to be called on all CPUs,
like it's done from text_poke_smp().  Fix text_poke_smp_batch() to do
this.
Signed-off-by: NRabin Vincent <rabin@rab.in>
Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1319702072-32676-1-git-send-email-rabin@rab.inSigned-off-by: NIngo Molnar <mingo@elte.hu>

78345d2e

perf/x86: Enable raw event access to Intel offcore events · ed13ec58

由 Peter Zijlstra 提交于 11月 14, 2011

Now that the core offcore support is fixed up (thanks Stephane) and we
have sane generic events utilizing them, re-enable the raw access to
the feature as well.

Note that it doesn't matter if you use event 0x1b7 or 0x1bb to specify
an offcore event, either one works and neither guarantees you'll end
up on a particular offcore MSR.

Based on original patch from: Vince Weaver <vweaver1@eecs.utk.edu>.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>.
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1108031200390.703@cl320.eecs.utk.eduSigned-off-by: NIngo Molnar <mingo@elte.hu>

ed13ec58

perf: Don't use -ENOSPC for out of PMU resources · aa2bc1ad

由 Peter Zijlstra 提交于 11月 09, 2011

People (Linus) objected to using -ENOSPC to signal not having enough
resources on the PMU to satisfy the request. Use -EINVAL.
Requested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-xv8geaz2zpbjhlx0svmpp28n@git.kernel.org
[ merged to newer kernel, fixed up MIPS impact ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

aa2bc1ad

perf/x86: Fix PEBS instruction unwind · 57d1c0c0

由 Peter Zijlstra 提交于 10月 07, 2011

Masami spotted that we always try to decode the instruction stream as
64bit instructions when running a 64bit kernel, this doesn't work for
ia32-compat proglets.

Use TIF_IA32 to detect if we need to use the 32bit instruction
decoder.
Reported-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

57d1c0c0

11 11月, 2011 1 次提交

x86, ioapic: Only print ioapic debug information for IRQs belonging to an ioapic chip · 6fd36ba0

由 Mathias Nyman 提交于 11月 10, 2011

with "apic=verbose" the print_IO_APIC() function tries to print
IRQ to pin mappings for every active irq. It assumes chip_data
is of type irq_cfg and may cause an oops if not.

As the print_IO_APIC() is called from a late_initcall other
chained irq chips may already be registered with custom
chip_data information, causing an oops. This is the case with
intel MID SoC devices with gpio demuxers registered as irq_chips.
Signed-off-by: NMathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
[ -v2: fixed build failure ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6fd36ba0

10 11月, 2011 4 次提交

x86/mrst: Avoid reporting wrong nmi status · 064a59b6

由 Jacob Pan 提交于 11月 10, 2011

Moorestown/Medfield platform does not have port 0x61 to report
NMI status, nor does it have external NMI sources. The only NMI
sources are from lapic, as results of perf counter overflow or
IPI, e.g. NMI watchdog or spin lock debug.

Reading port 0x61 on Moorestown will return 0xff which misled
NMI handlers to false critical errors such memory parity error.
The subsequent ioport access for NMI handling can also cause
undefined behavior on Moorestown.

This patch allows kernel process NMI due to watchdog or backrace
dump without unnecessary hangs.
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
[hand applied]
Signed-off-by: NAlan Cox <alan@linux.intel.com>

064a59b6

x86/apic: Allow use of lapic timer early calibration result · 1ade93ef

由 Jacob Pan 提交于 11月 10, 2011

lapic timer calibration can be combined with tsc in platform
specific calibration functions. if such calibration result is
obtained early, we can skip the redundant calibration loops.
Signed-off-by: NJacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1ade93ef

x86/apic: Do not clear nr_irqs_gsi if no legacy irqs · bb84ac2d

由 Jacob Pan 提交于 11月 10, 2011

nr_legacy_irqs is set in probe_nr_irqs_gsi, we should not clear
it after that. Otherwise, the result is that MSI irqs will be
allocated from the wrong range for the systems without legacy
PIC.
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

bb84ac2d

x86/platform: Add a wallclock_init func to x86_platforms ops · cf8ff6b6

由 Feng Tang 提交于 11月 10, 2011

Some wall clock devices use MMIO based HW register, this new
function will give them a chance to do some initialization work
before their get/set_time service get called.
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cf8ff6b6

08 11月, 2011 1 次提交

x86/mce: Make mce_chrdev_ops 'static const' · 66f5ddf3

由 Luck, Tony 提交于 11月 03, 2011

Arjan would like to make struct file_operations const, but
mce-inject directly writes to the mce_chrdev_ops to install its
write handler. In an ideal world mce-inject would have its own
character device, but we have a sizable legacy of test scripts
that hardwire "/dev/mcelog", so it would be painful to switch to
a separate device now. Instead, this patch switches to a stub
function in the mce code, with a registration helper that
mce-inject can call when it is loaded.

Note that this would also allow for a sane process to allow
mce-inject to be unloaded again (with an unregister function,
and appropriate module_{get,put}() calls), but that is left for
potential future patches.
Reported-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4eb2e1971326651a3b@agluck-desktop.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

66f5ddf3

01 11月, 2011 5 次提交

i7core_edac: Drop the edac_mce facility · 4140c542

由 Borislav Petkov 提交于 7月 18, 2011

Remove edac_mce pieces and use the normal MCE decoder notifier chain by
retaining the same functionality with considerably less code.
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>

4140c542

Cross Memory Attach · fcf63409

由 Christopher Yeoh 提交于 10月 31, 2011

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call.  There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
  using it:
  - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
  written to would need to be contiguous.
  - Currently mem_read allows only processes who are currently
  ptrace'ing the target and are still able to ptrace the target to read
  from the target. This check could possibly be moved to the open call,
  but its not clear exactly what race this restriction is stopping
  (reason  appears to have been lost)
  - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
  domain socket is a bit ugly from a userspace point of view,
  especially when you may have hundreds if not (eventually) thousands
  of processes  that all need to do this with each other
  - Doesn't allow for some future use of the interface we would like to
  consider adding in the future (see below)
  - Interestingly reading from /proc/pid/mem currently actually
  involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems.  Since you need the reader and writer working co-operatively if
the pipe is not drained then you block.  Which requires some wrapping to
do non blocking on the send side or polling on the receive.  In all to all
communication it requires ordering otherwise you can deadlock.  And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could.  For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy.  We don't need to keep a copy of the data
from the source.  I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI).  This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication.  And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fcf63409

x86: Fix files explicitly requiring export.h for EXPORT_SYMBOL/THIS_MODULE · 69c60c88

由 Paul Gortmaker 提交于 5月 26, 2011

These files were implicitly getting EXPORT_SYMBOL via device.h
which was including module.h, but that will be fixed up shortly.

By fixing these now, we can avoid seeing things like:

arch/x86/kernel/rtc.c:29: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
arch/x86/kernel/pci-dma.c:20: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
arch/x86/kernel/e820.c:69: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL_GPL’

[ with input from Randy Dunlap <rdunlap@xenotime.net> and also
  from Stephen Rothwell <sfr@canb.auug.org.au> ]
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

69c60c88

x86: fix implicit include of <linux/topology.h> in vsyscall_64 · 29574022

由 Paul Gortmaker 提交于 5月 26, 2011

In removing the presence of <linux/module.h> from some of the
more common <linux/something.h> files, this implict include
of <linux/topology.h> was uncovered.

  CC      arch/x86/kernel/vsyscall_64.o
  arch/x86/kernel/vsyscall_64.c: In function ‘vsyscall_set_cpu’:
  arch/x86/kernel/vsyscall_64.c:259: error: implicit declaration of function ‘cpu_to_node’

Explicitly call it out so the cleanup can take place.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

29574022

x86, MCE: Use notifier chain only for MCE decoding · f0cb5452

由 Borislav Petkov 提交于 7月 18, 2011

Drop the edac_mce custom hook in favor of the generic notifier
mechanism. Also, do not log the error to mcelog if the notified agent
was able to decode it.
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
Acked-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>

f0cb5452

26 10月, 2011 2 次提交

x86/jump_label: add arch_jump_label_transform_static() · e71a5be1

由 Jeremy Fitzhardinge 提交于 9月 29, 2011

This allows jump-label entries to be cheaply updated on code which is
not yet live.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NJason Baron <jbaron@redhat.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

e71a5be1

x86/jump_label: drop arch_jump_label_text_poke_early() · b7e31558

由 Jeremy Fitzhardinge 提交于 9月 29, 2011

It is no longer used.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NJason Baron <jbaron@redhat.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

b7e31558

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功