提交 · a35fd28256e7736cc84af8931a16224f0bfaaf6c · openeuler / raspberrypi-kernel

24 12月, 2011 2 次提交

x86, acpi: Skip acpi x2apic entries if the x2apic feature is not present · a35fd282

由 Yinghai Lu 提交于 12月 21, 2011

If the x2apic feature is not present (either the cpu is not capable of it
or the user has disabled the feature using boot-parameter etc), ignore the
x2apic MADT and SRAT entries provided by the ACPI tables.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20111222014632.540896503@sbsiddha-desk.sc.intel.comSigned-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

a35fd282

x86, apic: Add probe() for apic_flat · e8524b2f

由 Yinghai Lu 提交于 12月 21, 2011

Currently we start with the default apic_flat mode and switch to some other
apic model depending on the apic drivers acpi_madt_oem_check() routines and
later followed by the apic drivers probe() routines.

Once we selected non flat mode there was no case where we fall back to
flat mode again.

Upcoming changes allow bios-enabled x2apic mode to be disabled by the OS
if interrupt-remapping etc is not setup properly by the bios.

We now has a case for the apic to fall back to legacy flat mode during
apic driver probe() seqeuence. Add a simple flat_probe() which allows
the apic_flat mode to be the last fallback option.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20111222014632.484984298@sbsiddha-desk.sc.intel.comSigned-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

e8524b2f

21 12月, 2011 1 次提交

x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86' · 141168c3

由 Kevin Winchester 提交于 12月 20, 2011

Several fields in struct cpuinfo_x86 were not defined for the
!SMP case, likely to save space.  However, those fields still
have some meaning for UP, and keeping them allows some #ifdef
removal from other files.  The additional size of the UP kernel
from this change is not significant enough to worry about
keeping up the distinction:

	   text    data     bss     dec     hex filename
	4737168	 506459	 972040	6215667	 5ed7f3	vmlinux.o.before
	4737444	 506459	 972040	6215943	 5ed907	vmlinux.o.after

for a difference of 276 bytes for an example UP config.

If someone wants those 276 bytes back badly then it should
be implemented in a cleaner way.
Signed-off-by: NKevin Winchester <kjwinchester@gmail.com>
Cc: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

141168c3

18 12月, 2011 1 次提交

x86: Convert per-cpu counter icr_read_retry_count into a member of irq_stat · b49d7d87

由 Fernando Luis Vazquez Cao 提交于 12月 15, 2011

LAPIC related statistics are grouped inside the per-cpu
structure irq_stat, so there is no need for icr_read_retry_count
to be a standalone per-cpu variable.

This patch moves icr_read_retry_count to where it belongs.

Suggested-y: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Cc: Jörn Engel <joern@logfs.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b49d7d87

14 12月, 2011 1 次提交

x86: Add per-cpu stat counter for APIC ICR read tries · 346b46be

由 Fernando Luis Vázquez Cao 提交于 12月 13, 2011

In the IPI delivery slow path (NMI delivery) we retry the ICR
read to check for delivery completion a limited number of times.

[ The reason for the limited retries is that some of the places
  where it is used (cpu boot, kdump, etc) IPI delivery might not
  succeed (due to a firmware bug or system crash, for example)
  and in such a case it is better to give up and resume
  execution of other code. ]

This patch adds a new entry to /proc/interrupts, RTR, which
tells user space the number of times we retried the ICR read in
the IPI delivery slow path.

This should give some insight into how well the APIC
message delivery hardware is working - if the counts are way
too large then we are hitting a (very-) slow path way too
often.
Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Cc: Jörn Engel <joern@logfs.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/n/tip-vzsp20lo2xdzh5f70g0eis2s@git.kernel.org
[ extended the changelog ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

346b46be

06 12月, 2011 4 次提交

x86: Fix the !CONFIG_NUMA build of the new CPU ID fixup code support · e4a02b4a

由 Steffen Persvold 提交于 12月 06, 2011

I used "ifdef CONFIG_NUMA" simply because it doesn't make
sense in a non-numa configuration even with SMP enabled.

Besides, the only place where it is called right now is
in kernel/cpu/amd.c:srat_detect_node() within the
"CONFIG_NUMA" protected part.
Signed-off-by: NSteffen Persvold <sp@numascale.com>
Cc: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-2-git-send-email-daniel@numascale-asia.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

e4a02b4a

x86: Add NumaChip support · 44b111b5

由 Steffen Persvold 提交于 12月 06, 2011

Adds support for Numascale NumaChip large-SMP systems. It is
needed to enable the booting of more than ~168 cores.

v2:
 - [Steffen] enumerate only accessible northbridges
 - [Daniel] rediffed and validated against 3.1-rc10

v3:
 - [Daniel] use x86_init core numbering override
 - [Daniel] cleanups as per feedback

v4:
 - [Daniel] use updated x86_cpuinit override

v5:
 - drop disabling interrupts locally, as ISR write is atomic; drop delay
 - added read-mostly annotations where appropriate
 - require CONFIG_SMP, so drop conditional path

Workload tested on 96 cores/16 sockets.
Signed-off-by: NSteffen Persvold <sp@numascale.com>
Signed-off-by: NDaniel J Blueman <daniel@numascale-asia.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323101246-2400-1-git-send-email-daniel@numascale-asia.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

44b111b5

x86: Add x86_init platform override to fix up NUMA core numbering · 64be4c1c

由 Daniel J Blueman 提交于 12月 05, 2011

Add an x86_init vector for handling inconsistent core numbering.
This is useful for multi-fabric platforms, such as Numascale
NumaConnect.

v2:
 - use struct x86_cpuinit_ops
 - provide default fall-back function to warn
Signed-off-by: NDaniel J Blueman <daniel@numascale-asia.com>
Cc: Steffen Persvold <sp@numascale.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-2-git-send-email-daniel@numascale-asia.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

64be4c1c

x86: Make flat_init_apic_ldr() available · 9a0ebfbe

由 Daniel J Blueman 提交于 12月 05, 2011

Allow flat_init_apic_ldr() to be used outside the compilation
unit for similar APIC implementations.
Signed-off-by: NDaniel J Blueman <daniel@numascale-asia.com>
Cc: Steffen Persvold <sp@numascale.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-1-git-send-email-daniel@numascale-asia.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

9a0ebfbe

05 12月, 2011 1 次提交

x86: Fix boot failures on older AMD CPU's · 8e8da023

由 Linus Torvalds 提交于 12月 04, 2011

People with old AMD chips are getting hung boots, because commit
bcb80e53 ("x86, microcode, AMD: Add microcode revision to
/proc/cpuinfo") moved the microcode detection too early into
"early_init_amd()".

At that point we are *so* early in the booth that the exception tables
haven't even been set up yet, so the whole

	rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);

doesn't actually work: if the rdmsr does a GP fault (due to non-existant
MSR register on older CPU's), we can't fix it up yet, and the boot fails.

Fix it by simply moving the code to a slightly later point in the boot
(init_amd() instead of early_init_amd()), since the kernel itself
doesn't even really care about the microcode patchlevel at this point
(or really ever: it's made available to user space in /proc/cpuinfo, and
updated if you do a microcode load).
Reported-tested-and-bisected-by: NLarry Finger <Larry.Finger@lwfinger.net>
Tested-by: NBob Tracy <rct@gherkin.frus.com>
Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8e8da023

04 12月, 2011 1 次提交

xen/pm_idle: Make pm_idle be default_idle under Xen. · e5fd47bf

由 Konrad Rzeszutek Wilk 提交于 11月 21, 2011

The idea behind commit d91ee586 ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code.  It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).

But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle.  This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.

In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:

  Brought up 2 CPUs
  invalid opcode: 0000 [#1] SMP

  Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
  RIP: e030:[<ffffffff81015d1d>]  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
  ...
  Call Trace:
   [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
   [<ffffffff8149ee78>] cpu_bringup_and_idle+0xe/0x10
  RIP  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
   RSP <ffff8801d28ddf10>

In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall.  Meaning we end up going to
hypervisor twice instead of just once.

The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.

We want to do that, but only for one specific case: Xen.  This patch
does that.

Fixes RH BZ #739499 and Ubuntu #881076
Reported-by: NStefan Bader <stefan.bader@canonical.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e5fd47bf

22 11月, 2011 1 次提交

fix braino in um patchset (mea culpa) · cc11f9ed

由 Al Viro 提交于 11月 21, 2011

wrong register returned...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cc11f9ed

20 11月, 2011 1 次提交

KVM guest: prevent tracing recursion with kvmclock · 95ef1e52

由 Avi Kivity 提交于 11月 15, 2011

Prevent tracing of preempt_disable() in get_cpu_var() in
kvm_clock_read(). When CONFIG_DEBUG_PREEMPT is enabled,
preempt_disable/enable() are traced and this causes the function_graph
tracer to go into an infinite recursion. By open coding the
preempt_disable() around the get_cpu_var(), we can use the notrace
version which prevents preempt_disable/enable() from being traced and
prevents the recursion.

Based on a similar patch for Xen from Jeremy Fitzhardinge.
Tested-by: NGleb Natapov <gleb@redhat.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAvi Kivity <avi@redhat.com>

95ef1e52

17 11月, 2011 5 次提交

G
KVM: VMX: Check for automatic switch msr table overflow · e7fc6f93
由 Gleb Natapov 提交于 10月 05, 2011
```
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>
```
e7fc6f93

KVM: VMX: Add support for guest/host-only profiling · d7cd9796

由 Gleb Natapov 提交于 10月 05, 2011

Support guest/host-only profiling by switch perf msrs on
a guest entry if needed.
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

d7cd9796

KVM: VMX: add support for switching of PERF_GLOBAL_CTRL · 8bf00a52

由 Gleb Natapov 提交于 10月 05, 2011

Some cpus have special support for switching PERF_GLOBAL_CTRL msr.
Add logic to detect if such support exists and works properly and extend
msr switching code to use it if available. Also extend number of generic
msr switching entries to 8.
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

8bf00a52

xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs. · 90d4f553

由 Zhenzhong Duan 提交于 10月 27, 2011

PVHVM running with more than 32 vcpus and pv_irq/pv_time enabled
need VCPU placement to work, or else it will softlockup.

CC: stable@kernel.org
Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

90d4f553

xen: map foreign pages for shared rings by updating the PTEs directly · cd12909c

由 David Vrabel 提交于 9月 29, 2011

When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).

After the page is mapped, the usual fault mechanism can be used to
update additional MMs.  This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
[v1: Squashed fix by Michal for no-mmu case]
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NMichal Simek <monstr@monstr.eu>

cd12909c

14 11月, 2011 1 次提交

x86: Call stop_machine_text_poke() on all CPUs · 78345d2e

由 Rabin Vincent 提交于 10月 27, 2011

It appears that stop_machine_text_poke() wants to be called on all CPUs,
like it's done from text_poke_smp().  Fix text_poke_smp_batch() to do
this.
Signed-off-by: NRabin Vincent <rabin@rab.in>
Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1319702072-32676-1-git-send-email-rabin@rab.inSigned-off-by: NIngo Molnar <mingo@elte.hu>

78345d2e

12 11月, 2011 3 次提交

bma023: Add SFI translation for this device · 9f80d8b6

由 William Douglas 提交于 11月 10, 2011

This needed the sfi IRQ 0xFF fix to go in first. It simply plumbs in the
bma023 driver with the firmware naming of it.
Signed-off-by: NWilliam Douglas <william.douglas@intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9f80d8b6

vrtc: change its year offset from 1960 to 1972 · 57e6319d

由 Feng Tang 提交于 11月 10, 2011

Real world year equals the value in vrtc YEAR register plus an offset.
We used 1960 as the offset to make leap year consistent, but for a
device's first use, its YEAR register is 0 and the system year will
be parsed as 1960 which is not a valid UNIX time and will cause many
applications to fail mysteriously. So we use 1972 instead to fix this
issue.

Updated patch which adds a sanity check suggested by Mathias

This isn't a change in behaviour for systems, because 1972 is the one we
actually use. It's the old version in upstream which is out of sync with
all devices.
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

57e6319d

ce4100: fix a build error · f2ee4421

由 Zhang Rui 提交于 11月 10, 2011

Fix a build error. CE4100 with no serial errors because the alternate
function is only a prototype not a null function as intended.
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f2ee4421

11 11月, 2011 1 次提交

x86, ioapic: Only print ioapic debug information for IRQs belonging to an ioapic chip · 6fd36ba0

由 Mathias Nyman 提交于 11月 10, 2011

with "apic=verbose" the print_IO_APIC() function tries to print
IRQ to pin mappings for every active irq. It assumes chip_data
is of type irq_cfg and may cause an oops if not.

As the print_IO_APIC() is called from a late_initcall other
chained irq chips may already be registered with custom
chip_data information, causing an oops. This is the case with
intel MID SoC devices with gpio demuxers registered as irq_chips.
Signed-off-by: NMathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
[ -v2: fixed build failure ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6fd36ba0

10 11月, 2011 5 次提交

x86/mrst: Avoid reporting wrong nmi status · 064a59b6

由 Jacob Pan 提交于 11月 10, 2011

Moorestown/Medfield platform does not have port 0x61 to report
NMI status, nor does it have external NMI sources. The only NMI
sources are from lapic, as results of perf counter overflow or
IPI, e.g. NMI watchdog or spin lock debug.

Reading port 0x61 on Moorestown will return 0xff which misled
NMI handlers to false critical errors such memory parity error.
The subsequent ioport access for NMI handling can also cause
undefined behavior on Moorestown.

This patch allows kernel process NMI due to watchdog or backrace
dump without unnecessary hangs.
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
[hand applied]
Signed-off-by: NAlan Cox <alan@linux.intel.com>

064a59b6

x86/mrst: Add support for Penwell clock calibration · 0a915326

由 Dirk Brandewie 提交于 11月 10, 2011

Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0a915326

x86/apic: Allow use of lapic timer early calibration result · 1ade93ef

由 Jacob Pan 提交于 11月 10, 2011

lapic timer calibration can be combined with tsc in platform
specific calibration functions. if such calibration result is
obtained early, we can skip the redundant calibration loops.
Signed-off-by: NJacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1ade93ef

x86/apic: Do not clear nr_irqs_gsi if no legacy irqs · bb84ac2d

由 Jacob Pan 提交于 11月 10, 2011

nr_legacy_irqs is set in probe_nr_irqs_gsi, we should not clear
it after that. Otherwise, the result is that MSI irqs will be
allocated from the wrong range for the systems without legacy
PIC.
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

bb84ac2d

x86/platform: Add a wallclock_init func to x86_platforms ops · cf8ff6b6

由 Feng Tang 提交于 11月 10, 2011

Some wall clock devices use MMIO based HW register, this new
function will give them a chance to do some initialization work
before their get/set_time service get called.
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cf8ff6b6

08 11月, 2011 1 次提交

x86/mce: Make mce_chrdev_ops 'static const' · 66f5ddf3

由 Luck, Tony 提交于 11月 03, 2011

Arjan would like to make struct file_operations const, but
mce-inject directly writes to the mce_chrdev_ops to install its
write handler. In an ideal world mce-inject would have its own
character device, but we have a sizable legacy of test scripts
that hardwire "/dev/mcelog", so it would be painful to switch to
a separate device now. Instead, this patch switches to a stub
function in the mce code, with a registration helper that
mce-inject can call when it is loaded.

Note that this would also allow for a sane process to allow
mce-inject to be unloaded again (with an unregister function,
and appropriate module_{get,put}() calls), but that is left for
potential future patches.
Reported-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4eb2e1971326651a3b@agluck-desktop.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

66f5ddf3

07 11月, 2011 1 次提交

mrst pmu: update comment · 22f4521d

由 Len Brown 提交于 8月 12, 2011

referenced MeeGo, in particular, but really means Linux, in general.
Signed-off-by: NLen Brown <len.brown@intel.com>

22f4521d

03 11月, 2011 2 次提交

thp: share get_huge_page_tail() · b35a35b5

由 Andrea Arcangeli 提交于 11月 02, 2011

This avoids duplicating the function in every arch gup_fast.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b35a35b5

mm: thp: tail page refcounting fix · 70b50f94

由 Andrea Arcangeli 提交于 11月 02, 2011

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times.  This will guarantee that get_page_unless_zero() can never
succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages.  That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic.  As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns.  It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages.  The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it.  A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead.  So it's worth it.  Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages().  The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reported-by: NMichel Lespinasse <walken@google.com>
Reviewed-by: NMichel Lespinasse <walken@google.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

70b50f94

02 11月, 2011 8 次提交

um: Fix kmalloc argument order in um/vdso/vma.c · 0d65ede0

由 Dave Jones 提交于 10月 24, 2011

kmalloc size is 1st arg, not second.
Signed-off-by: NDave Jones <davej@redhat.com>
Signed-off-by: NRichard Weinberger <richard@nod.at>

Cc: <stable@kernel.org> # 3.0.x
[richard@nod.at: on 3.0 the to be patched file is
arch/um/sys-x86_64/vdso/vma.c]

0d65ede0

R
um: we need sys/user.h only on i386 · 38b64aed
由 Richard Weinberger 提交于 8月 18, 2011
```
Signed-off-by: NRichard Weinberger <richard@nod.at>
```
38b64aed
R
um: merge delay_{32,64}.c · d0af6cbf
由 Richard Weinberger 提交于 8月 18, 2011
```
Signed-off-by: NRichard Weinberger <richard@nod.at>
```
d0af6cbf

um: kill system-um.h · a34978cb

由 Al Viro 提交于 8月 18, 2011

most of it belonged in irqflags.h, actually
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

a34978cb

um: segment.h is x86-only and needed only there · 46ecca8a

由 Al Viro 提交于 8月 18, 2011

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

46ecca8a

um: unify ptrace_user.h · 966e803a

由 Al Viro 提交于 8月 18, 2011

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

966e803a

um: unify KSTK_... · a10c95d8

由 Al Viro 提交于 8月 18, 2011

... and switch get_thread_register() to HOST_... for register numbers
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

a10c95d8

um: fix gcov build breakage · 4d211093

由 Al Viro 提交于 8月 18, 2011

a) exports in gmon_syms.c duplicate kernel/gcov/* ones
b) excluding -pg in vdso compile is not enough - -fprofile-arcs
and -ftest-coverage also needs to be excluded
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

4d211093