提交 · ffb871bc9156ee2e5cf442f61250c5bd6aad17e3 · openeuler / raspberrypi-kernel

07 12月, 2011 2 次提交

x86, perf: Disable non available architectural events · ffb871bc

由 Gleb Natapov 提交于 11月 10, 2011

Intel CPUs report non-available architectural events in cpuid leaf
0AH.EBX. Use it to disable events that are not available according
to CPU.
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-7-git-send-email-gleb@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

ffb871bc

jump_label, x86: Fix section mismatch · 9cdbe1cb

由 Peter Zijlstra 提交于 12月 06, 2011

WARNING: arch/x86/kernel/built-in.o(.text+0x4c71): Section mismatch in
reference from the function arch_jump_label_transform_static() to the
function .init.text:text_poke_early()
The function arch_jump_label_transform_static() references
the function __init text_poke_early().
This is often because arch_jump_label_transform_static lacks a __init
annotation or the annotation of text_poke_early is wrong.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/n/tip-9lefe89mrvurrwpqw5h8xm8z@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

9cdbe1cb

06 12月, 2011 3 次提交

perf, x86: Prefer fixed-purpose counters when scheduling · 4defea85

由 Peter Zijlstra 提交于 11月 10, 2011

This avoids a scheduling failure for cases like:

  cycles, cycles, instructions, instructions (on Core2)

Which would end up being programmed like:

  PMC0, PMC1, FP-instructions, fail

Because all events will have the same weight.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-8tnwb92asqj7xajqqoty4gel@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

4defea85

perf, x86: Fix event scheduler for constraints with overlapping counters · bc1738f6

由 Robert Richter 提交于 11月 18, 2011

The current x86 event scheduler fails to resolve scheduling problems
of certain combinations of events and constraints. This happens if the
counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight, e.g. constraints
of the AMD family 15h pmu:

                        counter mask    weight

 amd_f15_PMC30          0x09            2  <--- overlapping counters
 amd_f15_PMC20          0x07            3
 amd_f15_PMC53          0x38            3

The scheduler does not find then an existing solution. Here is an
example:

 event code     counter         failure         possible solution

 0x02E          PMC[3,0]        0               3
 0x043          PMC[2:0]        1               0
 0x045          PMC[2:0]        2               1
 0x046          PMC[2:0]        FAIL            2

The event scheduler may not select the correct counter in the first
cycle because it needs to know which subsequent events will be
scheduled. It may fail to schedule the events then.

To solve this, we now save the scheduler state of events with
overlapping counter counstraints.  If we fail to schedule the events
we rollback to those states and try to use another free counter.

Constraints with overlapping counters are marked with a new introduced
overlap flag. We set the overlap flag for such constraints to give the
scheduler a hint which events to select for counter rescheduling. The
EVENT_CONSTRAINT_OVERLAP() macro can be used for this.

Care must be taken as the rescheduling algorithm is O(n!) which will
increase scheduling cycles for an over-commited system dramatically.
The number of such EVENT_CONSTRAINT_OVERLAP() macros and its counter
masks must be kept at a minimum. Thus, the current stack is limited to
2 states to limit the number of loops the algorithm takes in the worst
case.

On systems with no overlapping-counter constraints, this
implementation does not increase the loop count compared to the
previous algorithm.

V2:
* Renamed redo -> overlap.
* Reimplementation using perf scheduling helper functions.

V3:
* Added WARN_ON_ONCE() if out of save states.
* Changed function interface of perf_sched_restore_state() to use bool
  as return value.
Signed-off-by: NRobert Richter <robert.richter@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-3-git-send-email-robert.richter@amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

bc1738f6

perf, x86: Implement event scheduler helper functions · 1e2ad28f

由 Robert Richter 提交于 11月 18, 2011

This patch introduces x86 perf scheduler code helper functions. We
need this to later add more complex functionality to support
overlapping counter constraints (next patch).

The algorithm is modified so that the range of weight values is now
generated from the constraints. There shouldn't be other functional
changes.

With the helper functions the scheduler is controlled. There are
functions to initialize, traverse the event list, find unused counters
etc. The scheduler keeps its own state.

V3:
* Added macro for_each_set_bit_cont().
* Changed functions interfaces of perf_sched_find_counter() and
  perf_sched_next_event() to use bool as return value.
* Added some comments to make code better understandable.

V4:
* Fix broken event assignment if weight of the first event is not
  wmin (perf_sched_init()).
Signed-off-by: NRobert Richter <robert.richter@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-2-git-send-email-robert.richter@amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

1e2ad28f

05 12月, 2011 8 次提交

x86/tools: Add decoded instruction dump mode · 9dde9dc0

由 Masami Hiramatsu 提交于 12月 05, 2011

Add instruction dump mode to insn_sanity tool for
checking decoder really decoded instructions.

This mode is enabled when passing double -v (-vv) to
insn_sanity. It is useful for who wants to check whether
the decoder can decode some instructions correctly.
e.g.
 $ echo 0f 73 10 11 | ./insn_sanity -y -vv -i -
 Instruction = {
        .prefixes = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .rex_prefix = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .vex_prefix = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .opcode = {
                .value = 29455, bytes[] = {f, 73, 0, 0},
                .got = 1, .nbytes = 2},
        .modrm = {
                .value = 16, bytes[] = {10, 0, 0, 0},
                .got = 1, .nbytes = 1},
        .sib = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .displacement = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .immediate1 = {
                .value = 17, bytes[] = {11, 0, 0, 0},
                .got = 1, .nbytes = 1},
        .immediate2 = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 0, .nbytes = 0},
        .attr = 44800, .opnd_bytes = 4, .addr_bytes = 8,
        .length = 4, .x86_64 = 1, .kaddr = 0x7fff0f7d9430}
 Success: decoded and checked 1 given instructions with 0 errors (seed:0x0)
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120603.15475.91192.stgit@cloudSigned-off-by: NIngo Molnar <mingo@elte.hu>

9dde9dc0

x86: Update instruction decoder to support new AVX formats · a9c373d0

由 Masami Hiramatsu 提交于 12月 05, 2011

Since new Intel software developers manual introduces
new format for AVX instruction set (including AVX2),
it is important to update x86-opcode-map.txt to fit
those changes.
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120557.15475.13236.stgit@cloudSigned-off-by: NIngo Molnar <mingo@elte.hu>

a9c373d0

x86/tools: Fix insn_sanity message outputs · e70825fc

由 Masami Hiramatsu 提交于 12月 05, 2011

Fix x86 instruction decoder test to dump all error messages
to stderr and others to stdout.
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120550.15475.70149.stgit@cloudSigned-off-by: NIngo Molnar <mingo@elte.hu>

e70825fc

x86/tools: Fix instruction decoder message output · bfbe9015

由 Masami Hiramatsu 提交于 12月 05, 2011

Fix instruction decoder test (insn_sanity), so that
it doesn't show both info and error messages twice on
same instruction. (In that case, show only error message)
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120545.15475.7928.stgit@cloudSigned-off-by: NIngo Molnar <mingo@elte.hu>

bfbe9015

x86: Fix instruction decoder to handle grouped AVX instructions · 130b78b2

由 Masami Hiramatsu 提交于 12月 05, 2011

For reducing memory usage of attribute table, x86 instruction
decoder puts "Group" attribute only on "no-last-prefix"
attribute table (same as vex_p == 0 case).

Thus, the decoder should look no-last-prefix table first, and
then only if it is not a group, move on to "with-last-prefix"
table (vex_p != 0).

However, current implementation, inat_get_avx_attribute()
looks with-last-prefix directly. So, when decoding
a grouped AVX instruction, the decoder fails to find correct
group because there is no "Group" attribute on the table.
This ends up with the mis-decoding of instructions, as Ingo
reported in http://thread.gmane.org/gmane.linux.kernel/1214103

This patch fixes it to check no-last-prefix table first
even if that is an AVX instruction, and get an attribute from
"with last-prefix" table only if that is not a group.
Reported-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120539.15475.91428.stgit@cloudSigned-off-by: NIngo Molnar <mingo@elte.hu>

130b78b2

x86/tools: Fix Makefile to build all test tools · 1056c3e9

由 Masami Hiramatsu 提交于 12月 05, 2011

Fix arch/x86/tools/Makefile to compile both test tools
correctly. This bug leads build error.
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120533.15475.62047.stgit@cloudSigned-off-by: NIngo Molnar <mingo@elte.hu>

1056c3e9

perf, x86: Force IBS LVT offset assignment for family 10h · 16e5294e

由 Robert Richter 提交于 11月 08, 2011

On AMD family 10h we see firmware bug messages like the following:

 [Firmware Bug]: cpu 6, try to use APIC500 (LVT offset 0) for vector 0x10400, but the register is already in use for vector 0xf9 on another cpu
 [Firmware Bug]: cpu 6, IBS interrupt offset 0 not available (MSRC001103A=0x0000000000000100)
 [Firmware Bug]: using offset 1 for IBS interrupts
 [Firmware Bug]: workaround enabled for IBS LVT offset
 perf: AMD IBS detected (0x00000007)

We always see this, since the offsets are not assigned by the BIOS for
this family. Force LVT offset assignment in this case. If the OS
assignment fails, fallback to BIOS settings and try to setup this.

The fallback to BIOS settings weakens the family check since
force_ibs_eilvt_setup() may fail e.g. in case of virtual machines.
But setup may still succeed if BIOS offsets are correct.

Other families don't have a workaround implemented that assigns LVT
offsets. It's ok, to drop calling force_ibs_eilvt_setup() for that
families.

With the patch the [Firmware Bug] messages vanish. We see now:

 IBS: LVT offset 1 assigned
 perf: AMD IBS detected (0x00000007)
Signed-off-by: NRobert Richter <robert.richter@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111109162225.GO12451@erda.amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

16e5294e

perf, x86: Disable PEBS on SandyBridge chips · 6a600a8b

由 Peter Zijlstra 提交于 11月 15, 2011

Cc: Stephane Eranian <eranian@google.com>
Cc: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6a600a8b

14 11月, 2011 3 次提交

perf/x86: Enable raw event access to Intel offcore events · ed13ec58

由 Peter Zijlstra 提交于 11月 14, 2011

Now that the core offcore support is fixed up (thanks Stephane) and we
have sane generic events utilizing them, re-enable the raw access to
the feature as well.

Note that it doesn't matter if you use event 0x1b7 or 0x1bb to specify
an offcore event, either one works and neither guarantees you'll end
up on a particular offcore MSR.

Based on original patch from: Vince Weaver <vweaver1@eecs.utk.edu>.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>.
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1108031200390.703@cl320.eecs.utk.eduSigned-off-by: NIngo Molnar <mingo@elte.hu>

ed13ec58

perf: Don't use -ENOSPC for out of PMU resources · aa2bc1ad

由 Peter Zijlstra 提交于 11月 09, 2011

People (Linus) objected to using -ENOSPC to signal not having enough
resources on the PMU to satisfy the request. Use -EINVAL.
Requested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-xv8geaz2zpbjhlx0svmpp28n@git.kernel.org
[ merged to newer kernel, fixed up MIPS impact ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

aa2bc1ad

perf/x86: Fix PEBS instruction unwind · 57d1c0c0

由 Peter Zijlstra 提交于 10月 07, 2011

Masami spotted that we always try to decode the instruction stream as
64bit instructions when running a 64bit kernel, this doesn't work for
ia32-compat proglets.

Use TIF_IA32 to detect if we need to use the 32bit instruction
decoder.
Reported-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

57d1c0c0

12 11月, 2011 3 次提交

bma023: Add SFI translation for this device · 9f80d8b6

由 William Douglas 提交于 11月 10, 2011

This needed the sfi IRQ 0xFF fix to go in first. It simply plumbs in the
bma023 driver with the firmware naming of it.
Signed-off-by: NWilliam Douglas <william.douglas@intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9f80d8b6

vrtc: change its year offset from 1960 to 1972 · 57e6319d

由 Feng Tang 提交于 11月 10, 2011

Real world year equals the value in vrtc YEAR register plus an offset.
We used 1960 as the offset to make leap year consistent, but for a
device's first use, its YEAR register is 0 and the system year will
be parsed as 1960 which is not a valid UNIX time and will cause many
applications to fail mysteriously. So we use 1972 instead to fix this
issue.

Updated patch which adds a sanity check suggested by Mathias

This isn't a change in behaviour for systems, because 1972 is the one we
actually use. It's the old version in upstream which is out of sync with
all devices.
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

57e6319d

ce4100: fix a build error · f2ee4421

由 Zhang Rui 提交于 11月 10, 2011

Fix a build error. CE4100 with no serial errors because the alternate
function is only a prototype not a null function as intended.
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f2ee4421

10 11月, 2011 1 次提交

x86, perf: Add a build-time sanity test to the x86 decoder · 1ec454ba

由 Masami Hiramatsu 提交于 10月 20, 2011

Add a sanity test of x86 insn decoder against a stream
of randomly generated input, at build time.

This test is also able to reproduce any bug that might
trigger by allowing the passing of random-seed and
iteration-number to the test, or by passing input
which has invalid byte code.

Changes in V2:
 - Code cleanup.
 - Show how to reproduce the error by insn_sanity test.
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: acme@redhat.com
Cc: ming.m.lin@intel.com
Cc: robert.richter@amd.com
Cc: ravitillo@lbl.gov
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20111020140109.20938.92572.stgit@localhost.localdomainSigned-off-by: NIngo Molnar <mingo@elte.hu>

1ec454ba

07 11月, 2011 1 次提交

mrst pmu: update comment · 22f4521d

由 Len Brown 提交于 8月 12, 2011

referenced MeeGo, in particular, but really means Linux, in general.
Signed-off-by: NLen Brown <len.brown@intel.com>

22f4521d

04 11月, 2011 3 次提交

oprofile, x86: Reimplement nmi timer mode using perf event · dcfce4a0

由 Robert Richter 提交于 10月 11, 2011

The legacy x86 nmi watchdog code was removed with the implementation
of the perf based nmi watchdog. This broke Oprofile's nmi timer
mode. To run nmi timer mode we relied on a continuous ticking nmi
source which the nmi watchdog provided. The nmi tick was no longer
available and current watchdog can not be used anymore since it runs
with very long periods in the range of seconds. This patch
reimplements the nmi timer mode using a perf counter nmi source.

V2:
* removing pr_info()
* fix undefined reference to `__udivdi3' for 32 bit build
* fix section mismatch of .cpuinit.data:nmi_timer_cpu_nb
* removed nmi timer setup in arch/x86
* implemented function stubs for op_nmi_init/exit()
* made code more readable in oprofile_init()

V3:
* fix architectural initialization in oprofile_init()
* fix CONFIG_OPROFILE_NMI_TIMER dependencies
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NRobert Richter <robert.richter@amd.com>

dcfce4a0

oprofile, x86: Add kernel parameter oprofile.cpu_type=timer · 159a80b2

由 Robert Richter 提交于 10月 11, 2011

We need this to better test x86 NMI timer mode. Otherwise it is very
hard to setup systems with NMI timer enabled, but we have this e.g. in
virtual machine environments.
Signed-off-by: NRobert Richter <robert.richter@amd.com>

159a80b2

oprofile, x86: Fix crash when unloading module (nmi timer mode) · 97f7f818

由 Robert Richter 提交于 10月 10, 2011

If oprofile uses the nmi timer interrupt there is a crash while
unloading the module. The bug can be triggered with oprofile build as
module and kernel parameter nolapic set. This patch fixes this.

oprofile: using NMI timer interrupt.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58
PGD 42dbca067 PUD 41da6a067 PMD 0
Oops: 0002 [#1] PREEMPT SMP
CPU 5
Modules linked in: oprofile(-) [last unloaded: oprofile]

Pid: 2518, comm: modprobe Not tainted 3.1.0-rc7-00019-gb2fb49d #19 Advanced Micro Device Anaheim/Anaheim
RIP: 0010:[<ffffffff8123c226>]  [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58
RSP: 0018:ffff88041ef71e98  EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffffffa0017100 RCX: dead000000200200
RDX: 0000000000000000 RSI: dead000000100100 RDI: ffffffff8178c620
RBP: ffff88041ef71ea8 R08: 0000000000000001 R09: 0000000000000082
R10: 0000000000000000 R11: ffff88041ef71de8 R12: 0000000000000080
R13: fffffffffffffff5 R14: 0000000000000001 R15: 0000000000610210
FS:  00007fc902f20700(0000) GS:ffff88042fd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000041cdb6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 2518, threadinfo ffff88041ef70000, task ffff88041d348040)
Stack:
 ffff88041ef71eb8 ffffffffa0017790 ffff88041ef71eb8 ffffffffa0013532
 ffff88041ef71ec8 ffffffffa00132d6 ffff88041ef71ed8 ffffffffa00159b2
 ffff88041ef71f78 ffffffff81073115 656c69666f72706f 0000000000610200
Call Trace:
 [<ffffffffa0013532>] op_nmi_exit+0x15/0x17 [oprofile]
 [<ffffffffa00132d6>] oprofile_arch_exit+0xe/0x10 [oprofile]
 [<ffffffffa00159b2>] oprofile_exit+0x1e/0x20 [oprofile]
 [<ffffffff81073115>] sys_delete_module+0x1c3/0x22f
 [<ffffffff811bf09e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8148070b>] system_call_fastpath+0x16/0x1b
Code: 20 c6 78 81 e8 c5 cc 23 00 48 8b 13 48 8b 43 08 48 be 00 01 10 00 00 00 ad de 48 b9 00 02 20 00 00 00 ad de 48 c7 c7 20 c6 78 81
 89 42 08 48 89 10 48 89 33 48 89 4b 08 e8 a6 c0 23 00 5a 5b
RIP  [<ffffffff8123c226>] unregister_syscore_ops+0x41/0x58
 RSP <ffff88041ef71e98>
CR2: 0000000000000008
---[ end trace 43a541a52956b7b0 ]---

CC: stable@kernel.org # 2.6.37+
Signed-off-by: NRobert Richter <robert.richter@amd.com>

97f7f818

03 11月, 2011 2 次提交

thp: share get_huge_page_tail() · b35a35b5

由 Andrea Arcangeli 提交于 11月 02, 2011

This avoids duplicating the function in every arch gup_fast.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b35a35b5

mm: thp: tail page refcounting fix · 70b50f94

由 Andrea Arcangeli 提交于 11月 02, 2011

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times.  This will guarantee that get_page_unless_zero() can never
succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages.  That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic.  As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns.  It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages.  The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it.  A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead.  So it's worth it.  Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages().  The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reported-by: NMichel Lespinasse <walken@google.com>
Reviewed-by: NMichel Lespinasse <walken@google.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

70b50f94

02 11月, 2011 14 次提交

um: Fix kmalloc argument order in um/vdso/vma.c · 0d65ede0

由 Dave Jones 提交于 10月 24, 2011

kmalloc size is 1st arg, not second.
Signed-off-by: NDave Jones <davej@redhat.com>
Signed-off-by: NRichard Weinberger <richard@nod.at>

Cc: <stable@kernel.org> # 3.0.x
[richard@nod.at: on 3.0 the to be patched file is
arch/um/sys-x86_64/vdso/vma.c]

0d65ede0

R
um: we need sys/user.h only on i386 · 38b64aed
由 Richard Weinberger 提交于 8月 18, 2011
```
Signed-off-by: NRichard Weinberger <richard@nod.at>
```
38b64aed
R
um: merge delay_{32,64}.c · d0af6cbf
由 Richard Weinberger 提交于 8月 18, 2011
```
Signed-off-by: NRichard Weinberger <richard@nod.at>
```
d0af6cbf

um: kill system-um.h · a34978cb

由 Al Viro 提交于 8月 18, 2011

most of it belonged in irqflags.h, actually
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

a34978cb

um: segment.h is x86-only and needed only there · 46ecca8a

由 Al Viro 提交于 8月 18, 2011

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

46ecca8a

um: unify ptrace_user.h · 966e803a

由 Al Viro 提交于 8月 18, 2011

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

966e803a

um: unify KSTK_... · a10c95d8

由 Al Viro 提交于 8月 18, 2011

... and switch get_thread_register() to HOST_... for register numbers
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

a10c95d8

um: fix gcov build breakage · 4d211093

由 Al Viro 提交于 8月 18, 2011

a) exports in gmon_syms.c duplicate kernel/gcov/* ones
b) excluding -pg in vdso compile is not enough - -fprofile-arcs
and -ftest-coverage also needs to be excluded
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

4d211093

um: irq_vectors.h just shadows x86 one · 3fb77d72

由 Al Viro 提交于 8月 18, 2011

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

3fb77d72

A
um: required-features.h is there only to shadow x86 one... · ff9586e9
由 Al Viro 提交于 8月 18, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>
```
ff9586e9

um: asm/apic.h is there only to shadow the x86 one... · 8807c1d5

由 Al Viro 提交于 8月 18, 2011

... so take it to arch/um/x86/asm.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

8807c1d5

um: take ldt.h to arch/x86/um/asm/mm_context.h · b3ee571e

由 Al Viro 提交于 8月 18, 2011

it's x86-only and we have no business playing with it in asm/mmu.h; make
the latter have
	struct uml_arch_mm_context arch;
instead of
	struct uml_ldt ldt;
and let arch/<subarch>/um/asm/mm_context.h decide what'll be in there.
While we are at it, kill host_ldt.h - it's not needed in part of places
that include it (we want asm/ldt.h in those) and it can be trivially
expanded into the single remaining one.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

b3ee571e

um: merge signal_{32,64}.c · f67aa2ff

由 Al Viro 提交于 8月 18, 2011

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>

f67aa2ff

A
um: no need to play with save_sp in signal frame setup anymore · fbe98686
由 Al Viro 提交于 8月 18, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NRichard Weinberger <richard@nod.at>
```
fbe98686