提交 · c074eaac2ab264c94520efff7e896b771de885ae · openanolis / cloud-kernel

18 11月, 2012 14 次提交

由 Yinghai Lu 提交于 11月 16, 2012

Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-44-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

c074eaac

x86, mm: Move init_gbpages() out of setup.c · 148b2098

由 Yinghai Lu 提交于 11月 16, 2012

Put it in mm/init.c, and call it from probe_page_mask().
init_mem_mapping is calling probe_page_mask at first.
So calling sequence is not changed.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-32-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

148b2098

x86, mm, Xen: Remove mapping_pagetable_reserve() · 6f80b68e

由 Yinghai Lu 提交于 11月 16, 2012

Page table area are pre-mapped now after
	x86, mm: setup page table in top-down
	x86, mm: Remove early_memremap workaround for page table accessing on 64bit

mapping_pagetable_reserve is not used anymore, so remove it.

Also remove operation in mask_rw_pte(), as modified allow_low_page
always return pages that are already mapped, moreover
xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
before hooking it into the pagetable automatically.

-v2: add changelog about mask_rw_pte() from Stefano.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-27-git-send-email-yinghai@kernel.org
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

6f80b68e

x86, mm: Move min_pfn_mapped back to mm/init.c · 9985b4c6

由 Yinghai Lu 提交于 11月 16, 2012

Also change it to static.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-26-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

9985b4c6

x86, mm: setup page table in top-down · 8d57470d

由 Yinghai Lu 提交于 11月 16, 2012

Get pgt_buf early from BRK, and use it to map PMD_SIZE from top at first.
Then use mapped pages to map more ranges below, and keep looping until
all pages get mapped.

alloc_low_page will use page from BRK at first, after that buffer is used
up, will use memblock to find and reserve pages for page table usage.

Introduce min_pfn_mapped to make sure find new pages from mapped ranges,
that will be updated when lower pages get mapped.

Also add step_size to make sure that don't try to map too big range with
limited mapped pages initially, and increase the step_size when we have
more mapped pages on hand.

We don't need to call pagetable_reserve anymore, reserve work is done
in alloc_low_page() directly.

At last we can get rid of calculation and find early pgt related code.

-v2: update to after fix_xen change,
     also use MACRO for initial pgt_buf size and add comments with it.
-v3: skip big reserved range in memblock.reserved near end.
-v4: don't need fix_xen change now.
-v5: add changelog about moving about reserving pagetable to alloc_low_page.
Suggested-by: N"H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-22-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

8d57470d

x86, mm: relocate initrd under all mem for 64bit · 74f27655

由 Yinghai Lu 提交于 11月 16, 2012

instead of under 4g.

For 64bit, we can use any mapped mem instead of low mem.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-17-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

74f27655

x86, mm: Only direct map addresses that are marked as E820_RAM · 66520ebc

由 Jacob Shin 提交于 11月 16, 2012

Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

Our system with 1TB of RAM has an e820 that looks like this:

 BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
 BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
 BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
 BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
 BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
 BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
 BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
 BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
 BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
 BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
 BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
 BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
 BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

and so direct mappings are created for huge memory hole between
0x000000e038000000 to 0x0000010000000000. Even though the kernel never
generates memory accesses in that region, since the page tables mark
them incorrectly as being WB, our (AMD) processor ends up causing a MCE
while doing some memory bookkeeping/optimizations around that area.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges. Depending on
the alignment of E820 ranges, this may possibly result in using smaller
size (i.e. 4K instead of 2M or 1G) page tables.

-v2: move changes from setup.c to mm/init.c, also use for_each_mem_pfn_range
	instead.  - Yinghai Lu
-v3: add calculate_all_table_space_size() to get correct needed page table
	size. - Yinghai Lu
-v4: fix add_pfn_range_mapped() to get correct max_low_pfn_mapped when
     mem map does have hole under 4g that is found by Konard on xen
     domU with 8g ram. - Yinghai
Signed-off-by: NJacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1353123563-3103-16-git-send-email-yinghai@kernel.orgSigned-off-by: NYinghai Lu <yinghai@kernel.org>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

66520ebc

x86, mm: use pfn_range_is_mapped() with reserve_initrd · e8c57d40

由 Yinghai Lu 提交于 11月 16, 2012

We are going to map ram only, so under max_low_pfn_mapped,
between 4g and max_pfn_mapped does not mean mapped at all.

Use pfn_range_is_mapped() to find out if range is mapped for initrd.

That could happen bootloader put initrd in range but user could
use memmap to carve some of range out.

Also during copying need to use early_memmap to map original initrd
for accessing.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-15-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

e8c57d40

x86, mm: use pfn_range_is_mapped() with gart · 5101730c

由 Yinghai Lu 提交于 11月 16, 2012

We are going to map ram only, so under max_low_pfn_mapped,
between 4g and max_pfn_mapped does not mean mapped at all.

Use pfn_range_is_mapped() directly.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-14-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

5101730c

x86, mm: Fixup code testing if a pfn is direct mapped · dda56e13

由 Jacob Shin 提交于 11月 16, 2012

Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
pfn_mapped ranges instead.

-v2: change applying sequence to keep git bisecting working.
     so add dummy pfn_range_is_mapped(). - Yinghai Lu
Signed-off-by: NJacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1353123563-3103-12-git-send-email-yinghai@kernel.orgSigned-off-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

dda56e13

x86, mm: if kernel .text .data .bss are not marked as E820_RAM, complain and fix · 4eea6aa5

由 Jacob Shin 提交于 11月 16, 2012

There could be cases where user supplied memmap=exactmap memory
mappings do not mark the region where the kernel .text .data and
.bss reside as E820_RAM, as reported here:

https://lkml.org/lkml/2012/8/14/86

Handle it by complaining, and adding the range back into the e820.
Signed-off-by: NJacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1353123563-3103-11-git-send-email-yinghai@kernel.orgSigned-off-by: NYinghai Lu <yinghai@kernel.org>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

4eea6aa5

x86, mm: Set memblock initial limit to 1M · dd7dfad7

由 Yinghai Lu 提交于 11月 16, 2012

memblock_x86_fill() could double memory array.
If we set memblock.current_limit to 512M, so memory array could be around 512M.
So kdump will not get big range (like 512M) under 1024M.

Try to put it down under 1M, it would use about 4k or so, and that is limited.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-10-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

dd7dfad7

x86, mm: Move init_memory_mapping calling out of setup.c · 22ddfcaa

由 Yinghai Lu 提交于 11月 16, 2012

Now init_memory_mapping is called two times, later will be called for every
ram ranges.

Could put all related init_mem calling together and out of setup.c.

Actually, it reverts commit 1bbbbe77
x86: Exclude E820_RESERVED regions and memory holes above 4 GB from direct mapping.
will address that later with complete solution include handling hole under 4g.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-5-git-send-email-yinghai@kernel.orgReviewed-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

22ddfcaa

x86, mm: Add global page_size_mask and probe one time only · fa62aafe

由 Yinghai Lu 提交于 11月 16, 2012

Now we pass around use_gbpages and use_pse for calculating page table size,
Later we will need to call init_memory_mapping for every ram range one by one,
that mean those calculation will be done several times.

Those information are the same for all ram range and could be stored in
page_size_mask and could be probed it one time only.

Move that probing code out of init_memory_mapping into separated function
probe_page_size_mask(), and call it before all init_memory_mapping.
Suggested-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-2-git-send-email-yinghai@kernel.orgReviewed-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

fa62aafe

26 10月, 2012 1 次提交

x86: efi: Turn off efi_enabled after setup on mixed fw/kernel · 5189c2a7

由 Olof Johansson 提交于 10月 24, 2012

When 32-bit EFI is used with 64-bit kernel (or vice versa), turn off
efi_enabled once setup is done. Beyond setup, it is normally used to
determine if runtime services are available and we will have none.

This will resolve issues stemming from efivars modprobe panicking on a
32/64-bit setup, as well as some reboot issues on similar setups.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=45991Reported-by: NMarko Kohtala <marko.kohtala@gmail.com>
Reported-by: NMaxim Kammerer <mk@dee.su>
Signed-off-by: NOlof Johansson <olof@lixom.net>
Acked-by: NMaarten Lankhorst <maarten.lankhorst@canonical.com>
Cc: stable@kernel.org # 3.4 - 3.6
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: NMatt Fleming <matt.fleming@intel.com>

5189c2a7

25 10月, 2012 2 次提交

x86, mm: Use memblock memory loop instead of e820_RAM · 1f2ff682

由 Yinghai Lu 提交于 10月 22, 2012

We need to handle E820_RAM and E820_RESERVED_KERNEL at the same time.

Also memblock has page aligned range for ram, so we could avoid mapping
partial pages.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.comAcked-by: NJacob Shin <jacob.shin@amd.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>

1f2ff682

x86, mm: Trim memory in memblock to be page aligned · 6ede1fd3

由 Yinghai Lu 提交于 10月 22, 2012

We will not map partial pages, so need to make sure memblock
allocation will not allocate those bytes out.

Also we will use for_each_mem_pfn_range() to loop to map memory
range to keep them consistent.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.comAcked-by: NJacob Shin <jacob.shin@amd.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>

6ede1fd3

24 10月, 2012 10 次提交

x86/irq/ioapic: Check for valid irq_cfg pointer in smp_irq_move_cleanup_interrupt · 94777fc5

由 Dimitri Sivanich 提交于 10月 16, 2012

Posting this patch to fix an issue concerning sparse irq's that
I raised a while back.  There was discussion about adding
refcounting to sparse irqs (to fix other potential race
conditions), but that does not appear to have been addressed
yet.  This covers the only issue of this type that I've
encountered in this area.

A NULL pointer dereference can occur in
smp_irq_move_cleanup_interrupt() if we haven't yet setup the
irq_cfg pointer in the irq_desc.irq_data.chip_data.

In create_irq_nr() there is a window where we have set
vector_irq in __assign_irq_vector(), but not yet called
irq_set_chip_data() to set the irq_cfg pointer.

Should an IRQ_MOVE_CLEANUP_VECTOR hit the cpu in question during
this time, smp_irq_move_cleanup_interrupt() will attempt to
process the aforementioned irq, but panic when accessing
irq_cfg.

Only continue processing the irq if irq_cfg is non-NULL.
Signed-off-by: NDimitri Sivanich <sivanich@sgi.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20121016125021.GA22935@sgi.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

94777fc5

perf/x86: Remove unused variable in nhmex_rbox_alter_er() · 64dfab8e

由 Wei Yongjun 提交于 10月 22, 2012

The variable port is initialized but never used
otherwise, so remove the unused variable.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: Yan, Zheng <zheng.z.yan@intel.com>
Cc: a.p.zijlstra@chello.nl
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Link: http://lkml.kernel.org/r/CAPgLHd8NZkYSkZm22FpZxiEh6HcA0q-V%3D29vdnheiDhgrJZ%2Byw@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

64dfab8e

perf/x86: Enable overflow on Intel KNC with a custom knc_pmu_handle_irq() · e4074b30

由 Vince Weaver 提交于 10月 17, 2012

Although based on the Intel P6 design, the interrupt mechnanism
for KNC more closely resembles the Intel architectural
perfmon one.

We can't just re-use that code though, because KNC has different
MSR numbers for the status and ack registers.

In this case we just cut-and paste from perf_event_intel.c
with some minor changes, as it looks like it would not be
worth the trouble to change that code to be MSR-configurable.
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: eranian@gmail.com
Cc: Meadows Lawrence F <lawrence.f.meadows@intel.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1210171304410.23243@vincent-weaver-1.um.maine.edu
[ Small stylistic edits. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

e4074b30

perf/x86: Remove cpuc->enable check on Intl KNC event enable/disable · 7d011962

由 Vince Weaver 提交于 10月 17, 2012

x86_pmu.enable() is called from x86_pmu_enable() with
cpuc->enabled set to 0.  This means we weren't re-enabling the
counters after a context switch.

This patch just removes the check, as it should't be necessary
(and the equivelent x86_ generic code does not have the checks).

The origin of this problem is the KNC driver being based on the
P6 one.   The P6 driver also has this issue, but works anyway
due to various lucky accidents.
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: eranian@gmail.com
Cc: Meadows
Cc: Lawrence F <lawrence.f.meadows@intel.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1210171303290.23243@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>

7d011962

perf/x86: Make Intel KNC use full 40-bit width of counters · ae5ba47a

由 Vince Weaver 提交于 10月 17, 2012

Early versions of Intel KNC chips have a bug where bits above 32
were not properly set.  We worked around this by only using the
bottom 32 bits (out of 40 that should be available).

It turns out this workaround breaks overflow handling.

The buggy silicon will in theory never be used in production
systems, so remove this workaround so we get proper overflow
support.
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: eranian@gmail.com
Cc: Meadows Lawrence F <lawrence.f.meadows@intel.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1210171302140.23243@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>

ae5ba47a

perf/x86/uncore: Handle pci_read_config_dword() errors · 032c3851

由 Yan, Zheng 提交于 10月 24, 2012

This, beyond handling corner cases, also fixes some build warnings:

arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function ‘snbep_uncore_pci_disable_box’:
arch/x86/kernel/cpu/perf_event_intel_uncore.c:124:9: warning: ‘config’ is used uninitialized in this function [-Wuninitialized]
arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function ‘snbep_uncore_pci_enable_box’:
arch/x86/kernel/cpu/perf_event_intel_uncore.c:135:9: warning: ‘config’ is used uninitialized in this function [-Wuninitialized]
arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function ‘snbep_uncore_pci_read_counter’:
arch/x86/kernel/cpu/perf_event_intel_uncore.c:164:2: warning: ‘count’ is used uninitialized in this function [-Wuninitialized]
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Cc: a.p.zijlstra@chello.nl
Link: http://lkml.kernel.org/r/1351068140-13456-1-git-send-email-zheng.z.yan@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

032c3851

perf/x86: Remove P6 cpuc->enabled check · 58e9eaf0

由 Vince Weaver 提交于 10月 19, 2012

Between 2.6.33 and 2.6.34 the PMU code was made modular.

The x86_pmu_enable() call was extended to disable cpuc->enabled
and iterate the counters, enabling one at a time, before calling
enable_all() at the end, followed by re-enabling cpuc->enabled.

Since cpuc->enabled was set to 0, that change effectively caused
the "val |= ARCH_PERFMON_EVENTSEL_ENABLE;" code in p6_pmu_enable_event()
and p6_pmu_disable_event() to be dead code that was never called.

This change removes this code (which was confusing) and adds some
extra commentary to make it more clear what is going on.
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1210191732000.14552@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>

58e9eaf0

perf/x86: Update/fix generic events on P6 PMU · e09df478

由 Vince Weaver 提交于 10月 19, 2012

This patch updates the generic events on p6, including some new
extended cache events.

Values for these events were taken from the equivelant PAPI
predefined events.

Tested on a Pentium II.
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1210191730080.14552@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>

e09df478

perf/x86: Fix P6 FP_ASSIST event constraint · 7991c9ca

由 Vince Weaver 提交于 10月 19, 2012

According to Intel SDM Volume 3B, FP_ASSIST is limited to Counter 1 only,
not Counter 0.

Tested on a Pentium II.
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1210191728570.14552@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>

7991c9ca

x86/perf: Fix virtualization sanity check · bffd5fc2

由 Andre Przywara 提交于 10月 09, 2012

In check_hw_exists() we try to detect non-emulated MSR accesses
by writing an arbitrary value into one of the PMU registers
and check if it's value after a readout is still the same.
This algorithm silently assumes that the register does not contain
the magic value already, which is wrong in at least one situation.

Fix the algorithm to really do a read-modify-write cycle. This fixes
a warning under Xen under some circumstances on AMD family 10h CPUs.

The reasons in more details actually sound like a story from
Believe It or Not!:

First you need an AMD family 10h/12h CPU. These do not reset the
PERF_CTR registers on a reboot.
Now you boot bare metal Linux, which goes successfully through this
check, but leaves the magic value of 0xabcd in the register. You
don't use the performance counters, but do a reboot (warm reset).
Then you choose to boot Xen. The check will be triggered with a
recent Linux kernel as Dom0 again, trying to write 0xabcd into the
MSR. Xen silently drops the write (expected), but the subsequent read
will return the value in the register, which just happens to be the
expected magic value. Thus the test misleadingly succeeds, leaving
the kernel in the belief that the PMU is available. This will trigger
the following message:

[    0.020294] ------------[ cut here ]------------
[    0.020311] WARNING: at arch/x86/xen/enlighten.c:730 xen_apic_write+0x15/0x17()
[    0.020318] Hardware name: empty
[    0.020323] Modules linked in:
[    0.020334] Pid: 1, comm: swapper/0 Not tainted 3.3.8 #7
[    0.020340] Call Trace:
[    0.020354]  [<ffffffff81050379>] warn_slowpath_common+0x80/0x98
[    0.020369]  [<ffffffff810503a6>] warn_slowpath_null+0x15/0x17
[    0.020378]  [<ffffffff810034df>] xen_apic_write+0x15/0x17
[    0.020392]  [<ffffffff8101cb2b>] perf_events_lapic_init+0x2e/0x30
[    0.020410]  [<ffffffff81ee4dd0>] init_hw_perf_events+0x250/0x407
[    0.020419]  [<ffffffff81ee4b80>] ? check_bugs+0x2d/0x2d
[    0.020430]  [<ffffffff81002181>] do_one_initcall+0x7a/0x131
[    0.020444]  [<ffffffff81edbbf9>] kernel_init+0x91/0x15d
[    0.020456]  [<ffffffff817caaa4>] kernel_thread_helper+0x4/0x10
[    0.020471]  [<ffffffff817c347c>] ? retint_restore_args+0x5/0x6
[    0.020481]  [<ffffffff817caaa0>] ? gs_change+0x13/0x13
[    0.020500] ---[ end trace a7919e7f17c0a725 ]---

The new code will change every of the 16 low bits read from the
register and tries to write and read-back that modified number
from the MSR.
Signed-off-by: NAndre Przywara <andre.przywara@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Link: http://lkml.kernel.org/r/1349797115-28346-2-git-send-email-andre.przywara@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

bffd5fc2

23 10月, 2012 1 次提交

KVM guest: exit idleness when handling KVM_PV_REASON_PAGE_NOT_PRESENT · c5e015d4

由 Sasha Levin 提交于 10月 19, 2012

KVM_PV_REASON_PAGE_NOT_PRESENT kicks cpu out of idleness, but we haven't
marked that spot as an exit from idleness.

Not doing so can cause RCU warnings such as:

[  732.788386] ===============================
[  732.789803] [ INFO: suspicious RCU usage. ]
[  732.790032] 3.7.0-rc1-next-20121019-sasha-00002-g6d8d02d-dirty #63 Tainted: G        W
[  732.790032] -------------------------------
[  732.790032] include/linux/rcupdate.h:738 rcu_read_lock() used illegally while idle!
[  732.790032]
[  732.790032] other info that might help us debug this:
[  732.790032]
[  732.790032]
[  732.790032] RCU used illegally from idle CPU!
[  732.790032] rcu_scheduler_active = 1, debug_locks = 1
[  732.790032] RCU used illegally from extended quiescent state!
[  732.790032] 2 locks held by trinity-child31/8252:
[  732.790032]  #0:  (&rq->lock){-.-.-.}, at: [<ffffffff83a67528>] __schedule+0x178/0x8f0
[  732.790032]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81152bde>] cpuacct_charge+0xe/0x200
[  732.790032]
[  732.790032] stack backtrace:
[  732.790032] Pid: 8252, comm: trinity-child31 Tainted: G        W    3.7.0-rc1-next-20121019-sasha-00002-g6d8d02d-dirty #63
[  732.790032] Call Trace:
[  732.790032]  [<ffffffff8118266b>] lockdep_rcu_suspicious+0x10b/0x120
[  732.790032]  [<ffffffff81152c60>] cpuacct_charge+0x90/0x200
[  732.790032]  [<ffffffff81152bde>] ? cpuacct_charge+0xe/0x200
[  732.790032]  [<ffffffff81158093>] update_curr+0x1a3/0x270
[  732.790032]  [<ffffffff81158a6a>] dequeue_entity+0x2a/0x210
[  732.790032]  [<ffffffff81158ea5>] dequeue_task_fair+0x45/0x130
[  732.790032]  [<ffffffff8114ae29>] dequeue_task+0x89/0xa0
[  732.790032]  [<ffffffff8114bb9e>] deactivate_task+0x1e/0x20
[  732.790032]  [<ffffffff83a67c29>] __schedule+0x879/0x8f0
[  732.790032]  [<ffffffff8117e20d>] ? trace_hardirqs_off+0xd/0x10
[  732.790032]  [<ffffffff810a37a5>] ? kvm_async_pf_task_wait+0x1d5/0x2b0
[  732.790032]  [<ffffffff83a67cf5>] schedule+0x55/0x60
[  732.790032]  [<ffffffff810a37c4>] kvm_async_pf_task_wait+0x1f4/0x2b0
[  732.790032]  [<ffffffff81139e50>] ? abort_exclusive_wait+0xb0/0xb0
[  732.790032]  [<ffffffff81139c25>] ? prepare_to_wait+0x25/0x90
[  732.790032]  [<ffffffff810a3a66>] do_async_page_fault+0x56/0xa0
[  732.790032]  [<ffffffff83a6a6e8>] async_page_fault+0x28/0x30
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Acked-by: NGleb Natapov <gleb@redhat.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

c5e015d4

20 10月, 2012 2 次提交

perf/x86: Disable uncore on virtualized CPUs · a05123bd

由 Yan, Zheng 提交于 8月 21, 2012

Initializing uncore PMU on virtualized CPU may hang the kernel.
This is because kvm does not emulate the entire hardware. Thers
are lots of uncore related MSRs, making kvm enumerate them all
is a non-trival task. So just disable uncore on virtualized CPU.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Tested-by: NPekka Enberg <penberg@kernel.org>
Cc: a.p.zijlstra@chello.nl
Cc: eranian@google.com
Cc: andi@firstfloor.org
Cc: avi@redhat.com
Link: http://lkml.kernel.org/r/1345540117-14164-1-git-send-email-zheng.z.yan@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a05123bd

xen/x86: don't corrupt %eip when returning from a signal handler · a349e23d

由 David Vrabel 提交于 10月 19, 2012

In 32 bit guests, if a userspace process has %eax == -ERESTARTSYS
(-512) or -ERESTARTNOINTR (-513) when it is interrupted by an event
/and/ the process has a pending signal then %eip (and %eax) are
corrupted when returning to the main process after handling the
signal.  The application may then crash with SIGSEGV or a SIGILL or it
may have subtly incorrect behaviour (depending on what instruction it
returned to).

The occurs because handle_signal() is incorrectly thinking that there
is a system call that needs to restarted so it adjusts %eip and %eax
to re-execute the system call instruction (even though user space had
not done a system call).

If %eax == -514 (-ERESTARTNOHAND (-514) or -ERESTART_RESTARTBLOCK
(-516) then handle_signal() only corrupted %eax (by setting it to
-EINTR).  This may cause the application to crash or have incorrect
behaviour.

handle_signal() assumes that regs->orig_ax >= 0 means a system call so
any kernel entry point that is not for a system call must push a
negative value for orig_ax.  For example, for physical interrupts on
bare metal the inverse of the vector is pushed and page_fault() sets
regs->orig_ax to -1, overwriting the hardware provided error code.

xen_hypervisor_callback() was incorrectly pushing 0 for orig_ax
instead of -1.

Classic Xen kernels pushed %eax which works as %eax cannot be both
non-negative and -RESTARTSYS (etc.), but using -1 is consistent with
other non-system call entry points and avoids some of the tests in
handle_signal().

There were similar bugs in xen_failsafe_callback() of both 32 and
64-bit guests. If the fault was corrected and the normal return path
was used then 0 was incorrectly pushed as the value for orig_ax.
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
Acked-by: NJan Beulich <JBeulich@suse.com>
Acked-by: NIan Campbell <ian.campbell@citrix.com>
Cc: stable@vger.kernel.org
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

a349e23d

19 10月, 2012 1 次提交

x86, MCE: Remove bios_cmci_threshold sysfs attribute · 5bc66170

由 Borislav Petkov 提交于 10月 18, 2012

450cc201 ("x86/mce: Provide boot argument to honour bios-set CMCI
threshold") added the bios_cmci_threshold sysfs attribute which was
supposed to communicate to userspace tools that BIOS CMCI threshold has
been honoured.

However, this info is not of any importance to userspace - it should
rather get the actual error count it has been thresholded already from
MCi_STATUS[38:52].

So drop this before it becomes a used interface (good thing we caught
this early in 3.7-rc1, right after the merge window closed).

Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: NTony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20121017105940.GA14590@x1.osrc.amd.comSigned-off-by: NBorislav Petkov <borislav.petkov@amd.com>

5bc66170

18 10月, 2012 2 次提交

x86, amd, mce: Avoid NULL pointer reference on CPU northbridge lookup · 21c5e50e

由 Daniel J Blueman 提交于 10月 01, 2012

When booting on a federated multi-server system (NumaScale), the
processor Northbridge lookup returns NULL; add guards to prevent this
causing an oops.

On those systems, the northbridge is accessed through MMIO and the
"normal" northbridge enumeration in amd_nb.c doesn't work since we're
generating the northbridge ID from the initial APIC ID and the last
is not unique on those systems. Long story short, we end up without
northbridge descriptors.
Signed-off-by: NDaniel J Blueman <daniel@numascale-asia.com>
Cc: stable@vger.kernel.org # 3.6
Link: http://lkml.kernel.org/r/1349073725-14093-1-git-send-email-daniel@numascale-asia.com
[ Boris: beef up commit message ]
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

21c5e50e

x86: Exclude E820_RESERVED regions and memory holes above 4 GB from direct mapping. · 1bbbbe77

由 Jacob Shin 提交于 10月 20, 2011

On systems with very large memory (1 TB in our case), BIOS may report a
reserved region or a hole in the E820 map, even above the 4 GB range. Exclude
these from the direct mapping.

[ hpa: this should be done not just for > 4 GB but for everything above the legacy
  region (1 MB), at the very least.  That, however, turns out to require significant
  restructuring.  That work is well underway, but is not suitable for rc/stable. ]

Cc: stable@kernel.org   # > 2.6.32
Signed-off-by: NJacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1319145326-13902-1-git-send-email-jacob.shin@amd.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

1bbbbe77

16 10月, 2012 1 次提交

perf: Require exclude_guest to use PEBS - kernel side enforcement · 20b279dd

由 Peter Zijlstra 提交于 9月 13, 2012

Intel PEBS in VT-x context uses the DS address as a guest linear
address, even though its programmed by the host as a host linear
address. This either results in guest memory corruption and or the
hardware faulting and 'crashing' the virtual machine.  Therefore we have
to disable PEBS on VT-x enter and re-enable on VT-x exit, enforcing a
strict exclude_guest.

This patch enforces exclude_guest kernel side.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Robert Richter <robert.richter@amd.com>
Link: http://lkml.kernel.org/r/1347569955-54626-3-git-send-email-dsahern@gmail.comSigned-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

20b279dd

13 10月, 2012 1 次提交
- A
  x86, um: convert to saner kernel_execve() semantics · 22e2430d
  由 Al Viro 提交于 10月 10, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  22e2430d
12 10月, 2012 1 次提交

kgdb,x86: fix warning about unused variable · 42c12213

由 Jason Wessel 提交于 8月 08, 2012

When compiling without CONFIG_DEBUG_RODATA the following
compiler warning is generated:

arch/x86/kernel/kgdb.c: In function 'kgdb_arch_set_breakpoint':
arch/x86/kernel/kgdb.c:749: warning: unused variable 'opc'

The variable instantiation needs to be inside the #ifdef to
make the warning go away.
Reported-by: NThiago Rafael Becker <trbecker@trbecker.org>
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

42c12213

08 10月, 2012 1 次提交

uprobes/x86: Only rep+nop can be emulated correctly · b64b9c93

由 Oleg Nesterov 提交于 9月 29, 2012

__skip_sstep() correctly detects the "nontrivial" nop insns,
but since it doesn't update regs->ip we can not really skip
"0x0f 0x1f | 0x0f 0x19 | 0x87 0xc0", the probed application
is killed by SIGILL'ed handle_swbp().

Remove these additional checks. If we want to implement this
correctly we need to know the full insn length to update ->ip.

rep* + nop is fine even without updating ->ip.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>

b64b9c93

06 10月, 2012 1 次提交

sections: fix section conflicts in arch/x86 · 75fdd155

由 Andi Kleen 提交于 10月 04, 2012

Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

75fdd155

05 10月, 2012 1 次提交

perf/AMD/IBS: Add sysfs support · 2e132b12

由 Robert Richter 提交于 9月 12, 2012

Add sysfs format entries for AMD IBS PMUs:

 # find /sys/bus/event_source/devices/ibs_*/format
 /sys/bus/event_source/devices/ibs_fetch/format
 /sys/bus/event_source/devices/ibs_fetch/format/rand_en
 /sys/bus/event_source/devices/ibs_op/format
 /sys/bus/event_source/devices/ibs_op/format/cnt_ctl

This allows to specify following IBS options:

 $ perf record -e ibs_fetch/rand_en=1/GH ...
 $ perf record -e ibs_op/cnt_ctl=1/GH ...

Option cnt_ctl is only enabled if the IBS_CAPS_OPCNT bit is set in IBS
cpuid feature flags (AMD family 10h RevC and above).
Signed-off-by: NRobert Richter <robert.richter@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1347447584-28405-1-git-send-email-robert.richter@amd.com
[ Added small readability improvements. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

2e132b12

04 10月, 2012 1 次提交

perf/x86: Add support for Intel Xeon-Phi Knights Corner PMU · e717bf4e

由 Vince Weaver 提交于 9月 26, 2012

The following patch adds perf_event support for the Xeon-Phi
PMU, as documented in the "Intel Xeon Phi Coprocessor (codename:
Knights Corner) Performance Monitoring Units" manual.

Even though it is a co-processor, a Phi runs a full Linux
environment and can support performance counters.

This is just barebones support, it does not add support for
interesting new features such as the SPFLT intruction that
allows starting/stopping events without entering the kernel.

The PMU internally is just like that of an original Pentium, but
a "P6-like" MSR interface is provided.  The interface is
different enough from a real P6 that it's not easy (or
practical) to re-use the code in  perf_event_p6.c
Acked-by: NLawrence F Meadows <lawrence.f.meadows@intel.com>
Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: eranian@gmail.com
Cc: Lawrence F <lawrence.f.meadows@intel.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1209261405320.8398@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>

e717bf4e

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功