提交 · 5cdaf1834f43b0edc4a3aa683aa4ec98f6bfe8a7 · openanolis / cloud-kernel

26 4月, 2012 1 次提交

x86: Add task_struct argument to smp_ops.cpu_up · 5cdaf183

由 Thomas Gleixner 提交于 4月 20, 2012

Preparatory patch to use the generic idle thread allocation.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20120420124557.176604405@linutronix.de

5cdaf183

24 4月, 2012 2 次提交

x86, acpi: Call acpi_enter_sleep_state via an asmlinkage C function from assembler · cd74257b

由 Konrad Rzeszutek Wilk 提交于 4月 22, 2012

With commit a2ef5c4f
"ACPI: Move module parameter gts and bfs to sleep.c" the
wake_sleep_flags is required when calling acpi_enter_sleep_state.

The assembler code in wakeup_*.S did not do that. One solution
is to call it from assembler and stick the wake_sleep_flags on
the stack (for 32-bit) or in %esi (for 64-bit). hpa and rafael
both suggested however to create a wrapper function to call
acpi_enter_sleep_state and call said wrapper function
("acpi_enter_s3") from assembler.

For 32-bit, the acpi_enter_s3 ends up looking as so:

  push   %ebp
  mov    %esp,%ebp
  sub    $0x8,%esp
  movzbl 0xc1809314,%eax [wake_sleep_flags]
  movl   $0x3,(%esp)
  mov    %eax,0x4(%esp)
  call   0xc12d1fa0 <acpi_enter_sleep_state>
  leave
  ret

And 64-bit:

  movzbl 0x9afde1(%rip),%esi        [wake_sleep_flags]
  push   %rbp
  mov    $0x3,%edi
  mov    %rsp,%rbp
  callq  0xffffffff812e9800 <acpi_enter_sleep_state>
  leaveq
  retq
Reviewed-by: NH. Peter Anvin <hpa@zytor.com>
Suggested-by: NH. Peter Anvin <hpa@zytor.com>
[v2: Remove extra assembler operations, per hpa review]
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1335150198-21899-3-git-send-email-konrad.wilk@oracle.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

cd74257b

ACPI: Convert wake_sleep_flags to a value instead of function · 2a14e541

由 Konrad Rzeszutek Wilk 提交于 4月 22, 2012

With commit a2ef5c4f
"ACPI: Move module parameter gts and bfs to sleep.c" the wake_sleep_flags
is required when calling acpi_enter_sleep_state, which means
that if there are functions outside the sleep.c code they
can't get the wake_sleep_flags values.

This converts the function in to a exported value and converts
the module config operands to a function.
Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NLin Ming <ming.m.lin@intel.com>
[v2: Parameters can be turned on/off dynamically]
[v3: unsigned char -> u8]
[v4: val -> kp->arg]
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1335150198-21899-2-git-send-email-konrad.wilk@oracle.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

2a14e541

19 4月, 2012 1 次提交

x86, apic: APIC code touches invalid MSR on P5 class machines · cbf2829b

由 Bryan O'Donoghue 提交于 4月 18, 2012

Current APIC code assumes MSR_IA32_APICBASE is present for all systems.
Pentium Classic P5 and friends didn't have this MSR. MSR_IA32_APICBASE
was introduced as an architectural MSR by Intel @ P6.

Code paths that can touch this MSR invalidly are when vendor == Intel &&
cpu-family == 5 and APIC bit is set in CPUID - or when you simply pass
lapic on the kernel command line, on a P5.

The below patch stops Linux incorrectly interfering with the
MSR_IA32_APICBASE for P5 class machines. Other code paths exist that
touch the MSR - however those paths are not currently reachable for a
conformant P5.
Signed-off-by: NBryan O'Donoghue <bryan.odonoghue@linux.intel.com>
Link: http://lkml.kernel.org/r/4F8EEDD3.1080404@linux.intel.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>

cbf2829b

17 4月, 2012 2 次提交

i387: ptrace breaks the lazy-fpu-restore logic · 089f9fba

由 Oleg Nesterov 提交于 4月 16, 2012

Starting from 7e16838d "i387: support lazy restore of FPU state"
we assume that fpu_owner_task doesn't need restore_fpu_checking()
on the context switch, its FPU state should match what we already
have in the FPU on this CPU.

However, debugger can change the tracee's FPU state, in this case
we should reset fpu.last_cpu to ensure fpu_lazy_restore() can't
return true.

Change init_fpu() to do this, it is called by user_regset->set()
methods.
Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20120416204815.GB24884@redhat.com
Cc: <stable@vger.kernel.org> v3.3
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

089f9fba

x86/platform: Remove incorrect error message in x86_default_fixup_cpu_id() · 68894632

由 Andreas Herrmann 提交于 4月 02, 2012

It's only called from amd.c:srat_detect_node(). The introduced
condition for calling the fixup code is true for all AMD
multi-node processors, e.g. Magny-Cours and Interlagos. There we
have 2 NUMA nodes on one socket. Thus there are cores having
different numa-node-id but with equal phys_proc_id.

There is no point to print error messages in such a situation.

The confusing/misleading error message was introduced with
commit 64be4c1c ("x86: Add
x86_init platform override to fix up NUMA core numbering").

Remove the default fixup function (especially the error message)
and replace it by a NULL pointer check, move the
Numascale-specific condition for calling the fixup into the
fixup-function itself and slightly adapt the comment.
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
Cc: <stable@kernel.org>
Cc: <sp@numascale.com>
Cc: <bp@amd64.org>
Cc: <daniel@numascale-asia.com>
Link: http://lkml.kernel.org/r/20120402160648.GR27684@alberich.amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

68894632

16 4月, 2012 1 次提交

x86/amd: Remove broken links from comment and kernel message · d7de8649

由 Andreas Herrmann 提交于 4月 11, 2012

Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/20120411151238.GA4794@alberich.amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

d7de8649

13 4月, 2012 2 次提交

x86, microcode: Ensure that module is only loaded on supported AMD CPUs · 283c1f25

由 Andreas Herrmann 提交于 4月 12, 2012

Exit early when there's no support for a particular CPU family. Also,
fixup the "no support for this CPU vendor" to be issued only when the
driver is attempted to be loaded on an unsupported vendor.

Cc: stable@vger.kernel.org
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20120411163849.GE4794@alberich.amd.com
[Boris: add a commit msg because Andreas is lazy]
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>

283c1f25

x86, microcode: Fix sysfs warning during module unload on unsupported CPUs · a956bd6f

由 Andreas Herrmann 提交于 4月 12, 2012

Loading the microcode driver on an unsupported CPU and subsequently
unloading the driver causes

 WARNING: at fs/sysfs/group.c:138 mc_device_remove+0x5f/0x70 [microcode]()
 Hardware name: 01972NG
 sysfs group ffffffffa00013d0 not found for kobject 'cpu0'
 Modules linked in: snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel btusb snd_hda_codec bluetooth thinkpad_acpi rfkill microcode(-) [last unloaded: cfg80211]
 Pid: 4560, comm: modprobe Not tainted 3.4.0-rc2-00002-g258f7426 #5
 Call Trace:
  [<ffffffff8103113b>] ? warn_slowpath_common+0x7b/0xc0
  [<ffffffff81031235>] ? warn_slowpath_fmt+0x45/0x50
  [<ffffffff81120e74>] ? sysfs_remove_group+0x34/0x120
  [<ffffffffa00000ef>] ? mc_device_remove+0x5f/0x70 [microcode]
  [<ffffffff81331eb9>] ? subsys_interface_unregister+0x69/0xa0
  [<ffffffff81563526>] ? mutex_lock+0x16/0x40
  [<ffffffffa0000c3e>] ? microcode_exit+0x50/0x92 [microcode]
  [<ffffffff8107051d>] ? sys_delete_module+0x16d/0x260
  [<ffffffff810a0065>] ? wait_iff_congested+0x45/0x110
  [<ffffffff815656af>] ? page_fault+0x1f/0x30
  [<ffffffff81565ba2>] ? system_call_fastpath+0x16/0x1b

on recent kernels.

This is due to commit 8a25a2fd ("cpu: convert 'cpu' and
'machinecheck' sysdev_class to a regular subsystem") which renders
commit 6c53cbfc ("x86, microcode: Correct sysdev_add error path")
useless.

See http://marc.info/?l=linux-kernel&m=133416246406478

Avoid above warning by restoring the old driver behaviour before
6c53cbfc ("x86, microcode: Correct sysdev_add error path").

Cc: stable@vger.kernel.org
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20120411163849.GE4794@alberich.amd.comSigned-off-by: NBorislav Petkov <borislav.petkov@amd.com>

a956bd6f

06 4月, 2012 3 次提交

x86: vsyscall: Use NULL instead 0 for a pointer argument · 46ed99d1

由 Emil Goode 提交于 4月 01, 2012

This patch silences the following sparse warning:
arch/x86/kernel/vsyscall_64.c:250:34:
       warning: Using plain integer as NULL pointer
Signed-off-by: NEmil Goode <emilgoode@gmail.com>
Acked-by: NAndy Lutomirski <luto@amacapital.net>
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1333306084-3776-1-git-send-email-emilgoode@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

46ed99d1

simple_open: automatically convert to simple_open() · 234e3405

由 Stephen Boyd 提交于 4月 05, 2012

Many users of debugfs copy the implementation of default_open() when
they want to support a custom read/write function op.  This leads to a
proliferation of the default_open() implementation across the entire
tree.

Now that the common implementation has been consolidated into libfs we
can replace all the users of this function with simple_open().

This replacement was done with the following semantic patch:

<smpl>
@ open @
identifier open_f != simple_open;
identifier i, f;
@@
-int open_f(struct inode *i, struct file *f)
-{
(
-if (i->i_private)
-f->private_data = i->i_private;
|
-f->private_data = i->i_private;
)
-return 0;
-}

@ has_open depends on open @
identifier fops;
identifier open.open_f;
@@
struct file_operations fops = {
...
-.open = open_f,
+.open = simple_open,
...
};
</smpl>

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

234e3405

KVM: Resolve RCU vs. async page fault problem · e0875921

由 Gleb Natapov 提交于 4月 04, 2012

"Page ready" async PF can kick vcpu out of idle state much like IRQ.
We need to tell RCU about this.
Reported-by: NSasha Levin <levinsasha928@gmail.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

e0875921

03 4月, 2012 1 次提交

perf/x86/p4: Add format attributes · 7b8e6da4

由 Peter Zijlstra 提交于 3月 27, 2012

Steven reported his P4 not booting properly, the missing format
attributes cause a NULL ptr deref. Cure this by adding the
missing format specification.

I took the format description out of the comment near
p4_config_pack*() and hope that comment is still relatively
accurate.
Reported-by: NSteven Rostedt <rostedt@goodmis.org>
Reported-by: NBruno Prémont <bonbons@linux-vserver.org>
Tested-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1332859842.16159.227.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>

7b8e6da4

30 3月, 2012 4 次提交

ACPI: Fix use-after-free in acpi_map_lsapic · ac909ec3

由 Petr Vandrovec 提交于 3月 08, 2012

When processor is being hot-added to the system, acpi_map_lsapic invokes
ACPI _MAT method to find APIC ID and flags, verifies that returned structure
is indeed ACPI's local APIC structure, and that flags contain MADT_ENABLED
bit.  Then saves APIC ID, frees structure - and accesses structure when
computing arguments for acpi_register_lapic call.  Which sometime leads
to acpi_register_lapic call being made with second argument zero, failing
to bring processor online with error 'Unable to map lapic to logical cpu
number'.

As lapic->lapic_flags & ACPI_MADT_ENABLED was already confirmed to be non-zero
few lines above, we can just pass unconditional ACPI_MADT_ENABLED to the
acpi_register_lapic.
Signed-off-by: NPetr Vandrovec <petr@vmware.com>
Signed-off-by: NAlok N Kataria <akataria@vmware.com>
Reviewed-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

ac909ec3

idle, x86: Allow off-lined CPU to enter deeper C states · 1a022e3f

由 Boris Ostrovsky 提交于 3月 13, 2012

Currently when a CPU is off-lined it enters either MWAIT-based idle or,
if MWAIT is not desired or supported, HLT-based idle (which places the
processor in C1 state). This patch allows processors without MWAIT
support to stay in states deeper than C1.
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

1a022e3f

x86: Remove the ancient and deprecated disable_hlt() and enable_hlt() facility · f6365201

由 Len Brown 提交于 3月 29, 2012

The X86_32-only disable_hlt/enable_hlt mechanism was used by the
32-bit floppy driver. Its effect was to replace the use of the
HLT instruction inside default_idle() with cpu_relax() - essentially
it turned off the use of HLT.

This workaround was commented in the code as:

 "disable hlt during certain critical i/o operations"

 "This halt magic was a workaround for ancient floppy DMA
  wreckage. It should be safe to remove."

H. Peter Anvin additionally adds:

 "To the best of my knowledge, no-hlt only existed because of
  flaky power distributions on 386/486 systems which were sold to
  run DOS.  Since DOS did no power management of any kind,
  including HLT, the power draw was fairly uniform; when exposed
  to the much hhigher noise levels you got when Linux used HLT
  caused some of these systems to fail.

  They were by far in the minority even back then."

Alan Cox further says:

 "Also for the Cyrix 5510 which tended to go castors up if a HLT
  occurred during a DMA cycle and on a few other boxes HLT during
  DMA tended to go astray.

  Do we care ? I doubt it. The 5510 was pretty obscure, the 5520
  fixed it, the 5530 is probably the oldest still in any kind of
  use."

So, let's finally drop this.
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NJosh Boyer <jwboyer@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
Acked-by: NAlan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Stephen Hemminger <shemminger@vyatta.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/n/tip-3rhk9bzf0x9rljkv488tloib@git.kernel.org
[ If anyone cares then alternative instruction patching could be
  used to replace HLT with a one-byte NOP instruction. Much simpler. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

f6365201

x86,kgdb: Fix DEBUG_RODATA limitation using text_poke() · 3751d3e8

由 Jason Wessel 提交于 3月 23, 2012

There has long been a limitation using software breakpoints with a
kernel compiled with CONFIG_DEBUG_RODATA going back to 2.6.26. For
this particular patch, it will apply cleanly and has been tested all
the way back to 2.6.36.

The kprobes code uses the text_poke() function which accommodates
writing a breakpoint into a read-only page.  The x86 kgdb code can
solve the problem similarly by overriding the default breakpoint
set/remove routines and using text_poke() directly.

The x86 kgdb code will first attempt to use the traditional
probe_kernel_write(), and next try using a the text_poke() function.
The break point install method is tracked such that the correct break
point removal routine will get called later on.

Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@vger.kernel.org # >= 2.6.36
Inspried-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

3751d3e8

29 3月, 2012 5 次提交

x86: Preserve lazy irq disable semantics in fixup_irqs() · 99dd5497

由 Liu, Chuansheng 提交于 3月 26, 2012

The default irq_disable() sematics are to mark the interrupt disabled,
but keep it unmasked. If the interrupt is delivered while marked
disabled, the low level interrupt handler masks it and marks it
pending. This is important for detecting wakeup interrupts during
suspend and for edge type interrupts to avoid losing interrupts.

fixup_irqs() moves the interrupts away from an offlined cpu. For
certain interrupt types it needs to mask the interrupt line before
changing the affinity. After affinity has changed the interrupt line
is unmasked again, but only if it is not marked disabled.

This breaks the lazy irq disable semantics and causes problems in
suspend as the interrupt can be lost or wakeup functionality is
broken.

Check irqd_irq_masked() instead of irqd_irq_disabled() because
irqd_irq_masked() is only set, when the core code actually masked the
interrupt line. If it's not set, we unmask the interrupt and let the
lazy irq disable logic deal with an eventually incoming interrupt.

[ tglx: Massaged changelog and added a comment ]
Signed-off-by: Nliu chuansheng <chuansheng.liu@intel.com>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A05DFB3@SHSMSX101.ccr.corp.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

99dd5497

kdump x86: fix total mem size calculation for reservation · 09c71bfd

由 Dave Young 提交于 3月 28, 2012

crashkernel reservation need know the total memory size.  Current
get_total_mem simply use max_pfn - min_low_pfn.  It is wrong because it
will including memory holes in the middle.

Especially for kvm guest with memory > 0xe0000000, there's below in qemu
code: qemu split memory as below:

    if (ram_size >= 0xe0000000 ) {
        above_4g_mem_size = ram_size - 0xe0000000;
        below_4g_mem_size = 0xe0000000;
    } else {
        below_4g_mem_size = ram_size;
    }

So for 4G mem guest, seabios will insert a 512M usable region beyond of
4G.  Thus in above case max_pfn - min_low_pfn will be more than original
memsize.

Fixing this issue by using memblock_phys_mem_size() to get the total
memsize.
Signed-off-by: NDave Young <dyoung@redhat.com>
Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
Reviewed-by: NSimon Horman <horms@verge.net.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

09c71bfd

x86/apic/amd: Be more verbose about LVT offset assignments · 8abc3122

由 Robert Richter 提交于 3月 27, 2012

Add information about LVT offset assignments to better debug firmware
bugs related to this. See following examples.

 # dmesg | grep -i 'offset\|ibs'
 LVT offset 0 assigned for vector 0xf9
 [Firmware Bug]: cpu 0, try to use APIC500 (LVT offset 0) for vector 0x10400, but the register is already in use for vector 0xf9 on another cpu
 [Firmware Bug]: cpu 0, IBS interrupt offset 0 not available (MSRC001103A=0x0000000000000100)
 Failed to setup IBS, -22

In this case the BIOS assigns both offsets for MCE (0xf9) and IBS
(0x400) vectors to offset 0, which is why the second APIC setup (IBS)
failed.

With correct setup you get:

 # dmesg | grep -i 'offset\|ibs'
 LVT offset 0 assigned for vector 0xf9
 LVT offset 1 assigned for vector 0x400
 IBS: LVT offset 1 assigned
 perf: AMD IBS detected (0x00000007)
 oprofile: AMD IBS detected (0x00000007)

Note: The vector includes also the message type to handle also NMIs
(0x400). In the firmware bug message the format is the same as of the
APIC500 register and includes the mask bit (bit 16) in addition.
Signed-off-by: NRobert Richter <robert.richter@amd.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

8abc3122

Disintegrate asm/system.h for X86 · f05e798a

由 David Howells 提交于 3月 28, 2012

Disintegrate asm/system.h for X86.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NH. Peter Anvin <hpa@zytor.com>
cc: x86@kernel.org

f05e798a

x86, tls: Off by one limit check · 8f0750f1

由 Dan Carpenter 提交于 3月 24, 2012

These are used as offsets into an array of GDT_ENTRY_TLS_ENTRIES members
so GDT_ENTRY_TLS_ENTRIES is one past the end of the array.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Link: http://lkml.kernel.org/r/20120324075250.GA28258@elgon.mountain
Cc: <stable@vger.kernel.org>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

8f0750f1

28 3月, 2012 2 次提交

X86 & IA64: adapt for dma_map_ops changes · baa676fc

由 Andrzej Pietrasiewicz 提交于 3月 27, 2012

Adapt core x86 and IA64 architecture code for dma_map_ops changes: replace
alloc/free_coherent with generic alloc/free methods.
Signed-off-by: NAndrzej Pietrasiewicz <andrzej.p@samsung.com>
Acked-by: NKyungmin Park <kyungmin.park@samsung.com>
[removed swiotlb related changes and replaced it with wrappers,
 merged with IA64 patch to avoid inter-patch dependences in intel-iommu code]
Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NTony Luck <tony.luck@intel.com>

baa676fc

x86/ioapic: Add io_apic_ops driver layer to allow interception · 136d249e

由 Jeremy Fitzhardinge 提交于 3月 21, 2012

Xen dom0 needs to paravirtualize IO operations to the IO APIC,
so add a io_apic_ops for it to intercept.  Do this as ops
structure because there's at least some chance that another
paravirtualized environment may want to intercept these.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Cc: jwboyer@redhat.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/1332385090-18056-2-git-send-email-konrad.wilk@oracle.com
[ Made all the affected code easier on the eyes ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

136d249e

27 3月, 2012 1 次提交

sched/x86/smp: Do not enable IRQs over calibrate_delay() · bc758133

由 Peter Zijlstra 提交于 3月 26, 2012

We should not ever enable IRQs until we're fully set up. This opens up
a window where interrupts can hit the cpu and interrupts can do
wakeups, wakeups need state that isn't set-up yet, in particular this
cpu isn't elegible to run tasks, so if any cpu-affine task that got
created in CPU_UP_PREPARE manages to get a wakeup, its affinity mask
will get broken and we'll run into lots of 'interesting' problems.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-yaezmlbriluh166tfkgni22m@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

bc758133

26 3月, 2012 1 次提交

x86: Merge the x86_32 and x86_64 cpu_idle() functions · 90e24014

由 Richard Weinberger 提交于 3月 25, 2012

Both functions are mostly identical.
The differences are:

- x86_32's cpu_idle() makes use of check_pgt_cache(), which is a
  nop on both x86_32 and x86_64.

- x86_64's cpu_idle() uses enter/__exit_idle/(), on x86_32 these
  function are a nop.

- In contrast to x86_32, x86_64 calls rcu_idle_enter/exit() in
  the innermost loop because idle notifications need RCU.
  Calling these function on x86_32 also in the innermost loop
  does not hurt.

So we can merge both functions.
Signed-off-by: NRichard Weinberger <richard@nod.at>
Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: paulmck@linux.vnet.ibm.com
Cc: josh@joshtriplett.org
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1332709204-22496-1-git-send-email-richard@nod.atSigned-off-by: NIngo Molnar <mingo@kernel.org>

90e24014

24 3月, 2012 5 次提交

x86: vdso: Put declaration before code · 68fe7b23

由 Thomas Gleixner 提交于 3月 24, 2012

Sigh, warnings are there for a reason.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: John Stultz <john.stultz@linaro.org>

68fe7b23

x86: Stop recursive fault in print_context_stack after stack overflow · 65c0ff40

由 Hugh Dickins 提交于 3月 23, 2012

After printing out the first line of a stack backtrace,
print_context_stack() calls print_ftrace_graph_addr() to check
if it's making a graph of function calls, usually not the case.

But unfortunate ordering of assignments causes this to oops if
an earlier stack overflow corrupted threadinfo->task.  Reorder
to avoid that irritation.

( The fact that there was a stack overflow may often be more
  interesting than the stack that can now be shown; but
  integrating that information with this stacktrace is awkward,
  so leave it to overflow reporting. )
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120323225648.15DD5A033B@akpm.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

65c0ff40

x86: use for_each_clear_bit_from() · 0b2f4d4d

由 Akinobu Mita 提交于 3月 23, 2012

Use for_each_clear_bit() to iterate over all the cleared bit in a
memory region.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0b2f4d4d

bitops: rename for_each_set_bit_cont() in favor of analogous list.h function · 307b1cd7

由 Akinobu Mita 提交于 3月 23, 2012

This renames for_each_set_bit_cont() to for_each_set_bit_from() because
it is analogous to list_for_each_entry_from() in list.h rather than
list_for_each_entry_continue().

This doesn't remove for_each_set_bit_cont() for now.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

307b1cd7

x86-64: Simplify and optimize vdso clock_gettime monotonic variants · 91ec87d5

由 Andy Lutomirski 提交于 3月 22, 2012

We used to store the wall-to-monotonic offset and the realtime base.
It's faster to precompute the monotonic base.

This is about a 3% speedup on Sandy Bridge for CLOCK_MONOTONIC.
It's much more impressive for CLOCK_MONOTONIC_COARSE.
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>

91ec87d5

23 3月, 2012 6 次提交

x86/io_apic: Move and reenable irq only when CONFIG_GENERIC_PENDING_IRQ=y · 4da7072a

由 Alexander Gordeev 提交于 3月 20, 2012

This patch removes dead code from certain .config variations.

When CONFIG_GENERIC_PENDING_IRQ=n irq move and reenable code is
never get executed, nor do_unmask_irq variable updates its init
value. Move the code under CONFIG_GENERIC_PENDING_IRQ macro.
Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120320141935.GA24806@dhcp-26-207.brq.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

4da7072a

x86/apic: Add separate apic_id_valid() functions for selected apic drivers · b7157acf

由 Steffen Persvold 提交于 3月 16, 2012

As suggested by Suresh Siddha and Yinghai Lu:

For x2apic pre-enabled systems, apic driver is set already early
through early_acpi_boot_init()/early_acpi_process_madt()/
acpi_parse_madt()/default_acpi_madt_oem_check() path so that
apic_id_valid() checking will be sufficient during MADT and SRAT
parsing.

For non-x2apic pre-enabled systems, all apic ids should be less
than 255.

This allows us to substitute the checks in
arch/x86/kernel/acpi/boot.c::acpi_parse_x2apic() and
arch/x86/mm/srat.c::acpi_numa_x2apic_affinity_init() with
apic->apic_id_valid().

In addition we can avoid feigning the x2apic cpu feature in the
NumaChip apic code.

The following apic drivers have separate apic_id_valid()
functions which will accept x2apic type IDs :

 x2apic_phys
 x2apic_cluster
 x2apic_uv_x
 apic_numachip
Signed-off-by: NSteffen Persvold <sp@numascale.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/1331925935-13372-1-git-send-email-sp@numascale.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b7157acf

x86: Fix excessive MSR print out when show_msr is not specified · 0b8b8078

由 Yinghai Lu 提交于 3月 22, 2012

Dave found:

| During bootup, I now have 162 messages like this..
| [    0.227346]  MSR0000001b: 00000000fee00900
| [    0.227465]  MSR00000021: 0000000000000001
| [    0.227584]  MSR0000002a: 00000000c1c81400
|
| commit 21c3fcf3 looks suspect.
| It claims that it will only print these out if show_msr= is
| passed, but that doesn't seem to be the case.

Fix it by changing to the version that checks the index.
Reported-and-tested-by: NDave Jones <davej@redhat.com>
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1332477103-4595-1-git-send-email-yinghai@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

0b8b8078

perf: Fix mmap_page capabilities and docs · c7206205

由 Peter Zijlstra 提交于 3月 22, 2012

Complete the syscall-less self-profiling feature and address
all complaints, namely:

 - capabilities, so we can detect what is actually available at runtime

     Add a capabilities field to perf_event_mmap_page to indicate
     what is actually available for use.

 - on x86: RDPMC weirdness due to being 40/48 bits and not sign-extending
   properly.

 - ABI documentation as to how all this stuff works.

Also improve the documentation for the new features.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1332433596.2487.33.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>

c7206205

x86-32: Fix endless loop when processing signals for kernel tasks · 29a2e283

由 Dmitry Adamushko 提交于 3月 22, 2012

The problem occurs on !CONFIG_VM86 kernels [1] when a kernel-mode task
returns from a system call with a pending signal.

A real-life scenario is a child of 'khelper' returning from a failed
kernel_execve() in ____call_usermodehelper() [ kernel/kmod.c ].
kernel_execve() fails due to a pending SIGKILL, which is the result of
"kill -9 -1" (at least, busybox's init does it upon reboot).

The loop is as follows:

* syscall_exit_work:
 - work_pending:            // start_of_the_loop
 - work_notify_sig:
   - do_notify_resume()
     - do_signal()
       - if (!user_mode(regs)) return;
 - resume_userspace         // TIF_SIGPENDING is still set
 - work_pending             // so we call work_pending => goto
                            // start_of_the_loop

More information can be found in another LKML thread:
http://www.serverphorums.com/read.php?12,457826

[1] the problem was also seen on MIPS.
Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
Link: http://lkml.kernel.org/r/1332448765.2299.68.camel@dimm
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

29a2e283

kgdb: x86: Return all segment registers also in 64-bit mode · 639077fb

由 Jan Kiszka 提交于 3月 19, 2012

Even if the content is always 0, gdb expects us to return also ds,
es, fs, and gs while in x86-64 mode. Do this to avoid ugly errors on
"info registers".

[jason.wessel@windriver.com: adjust NUMREGBYTES for two new regs]
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

639077fb

22 3月, 2012 2 次提交

mm: search from free_area_cache for the bigger size · b716ad95

由 Xiao Guangrong 提交于 3月 21, 2012

If the required size is bigger than cached_hole_size it is better to
search from free_area_cache - it is easier to get a free region,
specifically for the 64 bit process whose address space is large enough

Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c
Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b716ad95

mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906

由 Andrea Arcangeli 提交于 3月 21, 2012

In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode.  In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.

It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds).  The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().

Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously.  This is
probably why it wasn't common to run into this.  For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.

Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).

The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value.  Even if the real pmd is changing under the
value we hold on the stack, we don't care.  If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).

All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd.  The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds).  I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).

		if (pmd_trans_huge(*pmd)) {
			if (next-addr != HPAGE_PMD_SIZE) {
				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
				split_huge_page_pmd(vma->vm_mm, pmd);
			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
				continue;
			/* fall through */
		}
		if (pmd_none_or_clear_bad(pmd))

Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.

The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.

====== start quote =======
      mapcount 0 page_mapcount 1
      kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

      mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

        143 void pmd_clear_bad(pmd_t *pmd)
        144 {
    ->  145         pmd_ERROR(*pmd);
        146         pmd_clear(pmd);
        147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

       1381         if (mapcount != page_mapcount(page))
       1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
       1383                        mapcount, page_mapcount(page));
    -> 1384         BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

               virtual address space
              .---------------------.
              |                     |
              |                     |
            .-|---------------------|
            | |                     |
            | |                     |<-- B(fault)
            | |                     |
      2 MB  | |/////////////////////|-.
      huge <  |/////////////////////|  > A(range)
      page  | |/////////////////////|-'
            | |                     |
            | |                     |
            '-|---------------------|
              |                     |
              |                     |
              '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
      on the virtual address range "A(range)" shown in the picture.

    sys_madvise
      // Acquire the semaphore in shared mode.
      down_read(&current->mm->mmap_sem)
      ...
      madvise_vma
        switch (behavior)
        case MADV_DONTNEED:
             madvise_dontneed
               zap_page_range
                 unmap_vmas
                   unmap_page_range
                     zap_pud_range
                       zap_pmd_range
                         //
                         // Assume that this huge page has never been accessed.
                         // I.e. content of the PMD entry is zero (not mapped).
                         //
                         if (pmd_trans_huge(*pmd)) {
                             // We don't get here due to the above assumption.
                         }
                         //
                         // Assume that Thread B incurred a page fault and
             .---------> // sneaks in here as shown below.
             |           //
             |           if (pmd_none_or_clear_bad(pmd))
             |               {
             |                 if (unlikely(pmd_bad(*pmd)))
             |                     pmd_clear_bad
             |                     {
             |                       pmd_ERROR
             |                         // Log "bad pmd ..." message here.
             |                       pmd_clear
             |                         // Clear the page's PMD entry.
             |                         // Thread B incremented the map count
             |                         // in page_add_new_anon_rmap(), but
             |                         // now the page is no longer mapped
             |                         // by a PMD entry (-> inconsistency).
             |                     }
             |               }
             |
             v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
      in the picture.

    ...
    do_page_fault
      __do_page_fault
        // Acquire the semaphore in shared mode.
        down_read_trylock(&mm->mmap_sem)
        ...
        handle_mm_fault
          if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
              // We get here due to the above assumption (PMD entry is zero).
              do_huge_pmd_anonymous_page
                alloc_hugepage_vma
                  // Allocate a new transparent huge page here.
                ...
                __do_huge_pmd_anonymous_page
                  ...
                  spin_lock(&mm->page_table_lock)
                  ...
                  page_add_new_anon_rmap
                    // Here we increment the page's map count (starts at -1).
                    atomic_set(&page->_mapcount, 0)
                  set_pmd_at
                    // Here we set the page's PMD entry which will be cleared
                    // when Thread A calls pmd_clear_bad().
                  ...
                  spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read).  Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated.  However, Thread A
    does not synchronize on that lock.

====== end quote =======

[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: NLarry Woodman <lwoodman@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>		[2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1a5a9906

20 3月, 2012 1 次提交

x86: remove the second argument of k[un]map_atomic() · 8fd75e12

由 Cong Wang 提交于 11月 25, 2011

Acked-by: NAvi Kivity <avi@redhat.com>
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NCong Wang <amwang@redhat.com>

8fd75e12

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功