提交 · fb92045843a8cd99c7b843d9b567a680a3854ba1 · openeuler / Kernel

27 12月, 2011 16 次提交

KVM: MMU: remove KVM host pv mmu support · fb920458

由 Chris Wright 提交于 11月 01, 2011

The host side pv mmu support has been marked for feature removal in
January 2011.  It's not in use, is slower than shadow or hardware
assisted paging, and a maintenance burden.  It's November 2011, time to
remove it.
Signed-off-by: NChris Wright <chrisw@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

fb920458

KVM guest: remove KVM guest pv mmu support · 5202397d

由 Chris Wright 提交于 11月 01, 2011

This has not been used for some years now.  It's time to remove it.
Signed-off-by: NChris Wright <chrisw@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

5202397d

KVM: x86: Simplify kvm timer handler · 3f2e5260

由 Jan Kiszka 提交于 9月 14, 2011

The vcpu reference of a kvm_timer can't become NULL while the timer is
valid, so drop this redundant test. This also makes it pointless to
carry a separate __kvm_timer_fn, fold it into kvm_timer_fn.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

3f2e5260

KVM: MMU: improve write flooding detected · a30f47cb

由 Xiao Guangrong 提交于 9月 22, 2011

Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

a30f47cb

KVM: MMU: fix detecting misaligned accessed · 5d9ca30e

由 Xiao Guangrong 提交于 9月 22, 2011

Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

5d9ca30e

KVM: MMU: split kvm_mmu_pte_write function · 889e5cbc

由 Xiao Guangrong 提交于 9月 22, 2011

kvm_mmu_pte_write is too long, we split it for better readable
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

889e5cbc

KVM: MMU: remove unnecessary kvm_mmu_free_some_pages · f8734352

由 Xiao Guangrong 提交于 9月 22, 2011

In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

f8734352

KVM: MMU: fast prefetch spte on invlpg path · f57f2ef5

由 Xiao Guangrong 提交于 9月 22, 2011

Fast prefetch spte for the unsync shadow page on invlpg path
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

f57f2ef5

KVM: MMU: cleanup FNAME(invlpg) · 505aef8f

由 Xiao Guangrong 提交于 9月 22, 2011

Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

505aef8f

KVM: MMU: do not mark accessed bit on pte write path · d01f8d5e

由 Xiao Guangrong 提交于 9月 22, 2011

In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

d01f8d5e

KVM: x86: cleanup port-in/port-out emulated · 6f6fbe98

由 Xiao Guangrong 提交于 9月 22, 2011

Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

6f6fbe98

KVM: x86: retry non-page-table writing instructions · 1cb3f3ae

由 Xiao Guangrong 提交于 9月 22, 2011

If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

1cb3f3ae

KVM: x86: tag the instructions which are used to write page table · d5ae7ce8

由 Xiao Guangrong 提交于 9月 22, 2011

The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and
| drop shadow if any other instruction is used.
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
| and cmpxchg8b.

This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

d5ae7ce8

KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write · f759e2b4

由 Xiao Guangrong 提交于 9月 22, 2011

kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

f759e2b4

KVM: nVMX: Fix warning-causing idt-vectoring-info behavior · 51cfe38e

由 Nadav Har'El 提交于 9月 22, 2011

When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".

Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.

In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.

Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.

Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.

Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

51cfe38e

KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT · d6185f20

由 Nadav Har'El 提交于 9月 22, 2011

This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.

We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.

Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.

On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

d6185f20

26 12月, 2011 1 次提交

KVM: Don't automatically expose the TSC deadline timer in cpuid · 4d25a066

由 Jan Kiszka 提交于 12月 21, 2011

Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.

This is broken in several ways:
 - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
   deadline timer feature, a guest that uses the feature will break
 - live migration to older host kernels that don't support the TSC deadline
   timer will cause the feature to be pulled from under the guest's feet;
   breaking it
 - guests that are broken wrt the feature will fail.

Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.

Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.

[avi: add the KVM_CAP + documentation]
Reported-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

4d25a066

25 12月, 2011 1 次提交

KVM: x86: Prevent starting PIT timers in the absence of irqchip support · 0924ab2c

由 Jan Kiszka 提交于 12月 14, 2011

User space may create the PIT and forgets about setting up the irqchips.
In that case, firing PIT IRQs will crash the host:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm]
...
Call Trace:
 [<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm]
 [<ffffffff81071431>] process_one_work+0x111/0x4d0
 [<ffffffff81071bb2>] worker_thread+0x152/0x340
 [<ffffffff81075c8e>] kthread+0x7e/0x90
 [<ffffffff815a4474>] kernel_thread_helper+0x4/0x10

Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

0924ab2c

20 12月, 2011 2 次提交

x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT · 13f541c1

由 Clemens Ladisch 提交于 12月 19, 2011

When printing the code bytes in show_registers(), the markers around the
byte at the fault address could make the printk() format string look
like a valid log level and facility code.  This would prevent this byte
from being printed and result in a spurious newline:

[ 7555.765589] Code: 8b 32 e9 94 00 00 00 81 7d 00 ff 00 00 00 0f 87 96 00 00 00 48 8b 83 c0 00 00 00 44 89 e2 44 89 e6 48 89 df 48 8b 80 d8 02 00 00
[ 7555.765683]  8b 48 28 48 89 d0 81 e2 ff 0f 00 00 48 c1 e8 0c 48 c1 e0 04

Add KERN_CONT where needed, and elsewhere in show_registers() for
consistency.
Signed-off-by: NClemens Ladisch <clemens@ladisch.de>
Link: http://lkml.kernel.org/r/4EEFA7AE.9020407@ladisch.deSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

13f541c1

net: bpf_jit: fix an off-one bug in x86_64 cond jump target · a03ffcf8

由 Markus Kötter 提交于 12月 17, 2011

x86 jump instruction size is 2 or 5 bytes (near/long jump), not 2 or 6
bytes.

In case a conditional jump is followed by a long jump, conditional jump
target is one byte past the start of target instruction.
Signed-off-by: NMarkus Kötter <nepenthesdev@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a03ffcf8

16 12月, 2011 1 次提交

xen: only limit memory map to maximum reservation for domain 0. · d3db7281

由 Ian Campbell 提交于 12月 14, 2011

d312ae87 "xen: use maximum reservation to limit amount of usable RAM"
clamped the total amount of RAM to the current maximum reservation. This is
correct for dom0 but is not correct for guest domains. In order to boot a guest
"pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for
future memory expansion the guest must derive max_pfn from the e820 provided by
the toolstack and not the current maximum reservation (which can reflect only
the current maximum, not the guest lifetime max). The existing algorithm
already behaves this correctly if we do not artificially limit the maximum
number of pages for the guest case.

For a guest booted with maxmem=512, memory=128 this results in:
 [    0.000000] BIOS-provided physical RAM map:
 [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
 [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
-[    0.000000]  Xen: 0000000000100000 - 0000000008100000 (usable)
-[    0.000000]  Xen: 0000000008100000 - 0000000020800000 (unusable)
+[    0.000000]  Xen: 0000000000100000 - 0000000020800000 (usable)
...
 [    0.000000] NX (Execute Disable) protection: active
 [    0.000000] DMI not present or invalid.
 [    0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
 [    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
-[    0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000
+[    0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000
 [    0.000000] initial memory mapped : 0 - 027ff000
 [    0.000000] Base memory trampoline at [c009f000] 9f000 size 4096
-[    0.000000] init_memory_mapping: 0000000000000000-0000000008100000
-[    0.000000]  0000000000 - 0008100000 page 4k
-[    0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000
+[    0.000000] init_memory_mapping: 0000000000000000-0000000020800000
+[    0.000000]  0000000000 - 0020800000 page 4k
+[    0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000
 [    0.000000] xen: setting RW the range 27e8000 - 27ff000
 [    0.000000] 0MB HIGHMEM available.
-[    0.000000] 129MB LOWMEM available.
-[    0.000000]   mapped low ram: 0 - 08100000
-[    0.000000]   low ram: 0 - 08100000
+[    0.000000] 520MB LOWMEM available.
+[    0.000000]   mapped low ram: 0 - 20800000
+[    0.000000]   low ram: 0 - 20800000

With this change "xl mem-set <domain> 512M" will successfully increase the
guest RAM (by reducing the balloon).

There is no change for dom0.
Reported-and-Tested-by: NGeorge Shuklin <george.shuklin@gmail.com>
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Cc: stable@kernel.org
Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

d3db7281

13 12月, 2011 1 次提交

Revert "x86, efi: Calling __pa() with an ioremap()ed address is invalid" · e1ad783b

由 Keith Packard 提交于 12月 11, 2011

This hangs my MacBook Air at boot time; I get no console
messages at all. I reverted this on top of -rc5 and my machine
boots again.

This reverts commit e8c71062.
Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
Signed-off-by: NKeith Packard <keithp@keithp.com>
Acked-by: NH. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@consoleSigned-off-by: NIngo Molnar <mingo@elte.hu>

e1ad783b

10 12月, 2011 1 次提交

x86, efi: Make efi_call_phys_{prelog,epilog} CONFIG_RELOCATABLE-aware · 6d3e32e6

由 Matt Fleming 提交于 12月 08, 2011

efi_call_phys_prelog() sets up a 1:1 mapping of the physical address
range in swapper_pg_dir. Instead of replacing then restoring entries
in swapper_pg_dir we should be using initial_page_table which already
contains the 1:1 mapping.

It's safe to blindly switch back to swapper_pg_dir in the epilog
because the physical EFI routines are only called before
efi_enter_virtual_mode(), e.g. before any user processes have been
forked. Therefore, we don't need to track which pgd was in %cr3 when
we entered the prelog.

The previous code actually contained a bug because it assumed that the
kernel was loaded at a physical address within the first 8MB of ram,
usually at 0x100000. However, this isn't the case with a
CONFIG_RELOCATABLE=y kernel which could have been loaded anywhere in
the physical address space.

Also delete the ancient (and bogus) comments about the page table
being restored after the lock is released. There is no locking.

Cc: Matthew Garrett <mjg@redhat.com>
Cc: Darrent Hart <dvhart@linux.intel.com>
Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1323346250.3894.74.camel@mfleming-mobl1.ger.corp.intel.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

6d3e32e6

09 12月, 2011 3 次提交

thp: add compound tail page _mapcount when mapped · b6999b19

由 Youquan Song 提交于 12月 08, 2011

With the 3.2-rc kernel, IOMMU 2M pages in KVM works.  But when I tried
to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
failed to be used.

The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
page calls gup_huge_pmd.  If compound pages are used and the page is a
tail page, gup_huge_pmd() increases _mapcount to record tail page are
mapped while gup_huge_pud does not do that.

So when the mapped page is relesed, it will result in kernel oops
because the page is not marked mapped.

This patch add tail process for compound page in 1GB huge page which
keeps the same process as 2M page.

Reproduce like:
1. Add grub boot option: hugepagesz=1G hugepages=8
2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
	-net none -device pci-assign,host=07:00.1

  kernel BUG at mm/swap.c:114!
  invalid opcode: 0000 [#1] SMP
  Call Trace:
    put_page+0x15/0x37
    kvm_release_pfn_clean+0x31/0x36
    kvm_iommu_put_pages+0x94/0xb1
    kvm_iommu_unmap_memslots+0x80/0xb6
    kvm_assign_device+0xba/0x117
    kvm_vm_ioctl_assigned_device+0x301/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
  RIP  put_compound_page+0xd4/0x168
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6999b19

x86, efi: Calling __pa() with an ioremap()ed address is invalid · e8c71062

由 Matt Fleming 提交于 11月 18, 2011

If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.

On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:

  BUG: unable to handle kernel paging request at f7f22280
  IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
  [...]

  Call Trace:
   [<c104f8ca>] ? page_is_ram+0x1a/0x40
   [<c1025aff>] reserve_memtype+0xdf/0x2f0
   [<c1024dc9>] set_memory_uc+0x49/0xa0
   [<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
   [<c19216d4>] start_kernel+0x291/0x2f2
   [<c19211c7>] ? loglevel+0x1b/0x1b
   [<c19210bf>] i386_start_kernel+0xbf/0xc8

A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().

Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,

	https://bugzilla.redhat.com/show_bug.cgi?id=748516

Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.
Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

e8c71062

x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked · 2ded6e6a

由 Mark Langsdorf 提交于 11月 18, 2011

When HPET is operating in RTC mode, the TN_ENABLE bit on timer1
controls whether the HPET or the RTC delivers interrupts to irq8. When
the system goes into suspend, the RTC driver sends a signal to the
HPET driver so that the HPET releases control of irq8, allowing the
RTC to wake the system from suspend. The switchover is accomplished by
a write to the HPET configuration registers which currently only
occurs while servicing the HPET interrupt.

On some systems, I have seen the system suspend before an HPET
interrupt occurs, preventing the write to the HPET configuration
register and leaving the HPET in control of the irq8. As the HPET is
not active during suspend, it does not generate a wake signal and RTC
alarms do not work.

This patch forces the HPET driver to immediately transfer control of
the irq8 channel to the RTC instead of waiting until the next
interrupt event.
Signed-off-by: NMark Langsdorf <mark.langsdorf@amd.com>
Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.comTested-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org

2ded6e6a

06 12月, 2011 5 次提交

x86/intel_mid: Kconfig select fix · 4e2b1c4f

由 Alan Cox 提交于 12月 06, 2011

If we select a symbol it should have a type declared first
otherwise in some situations the config tools get upset. They
are currently perhaps a bit too resilient which is why this
wasn't noticed initially.
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111206132811.4041.32549.stgit@bob.linux.org.ukSigned-off-by: NIngo Molnar <mingo@elte.hu>

4e2b1c4f

x86/intel_mid: Fix the Kconfig for MID selection · dd137525

由 Alan Cox 提交于 12月 05, 2011

We currently fail to build on CONFIG_X86_INTEL_MID=y and
CONFIG_X86_MRST unset.

We could build all the bits to make generic MID work if you
picked MID platform alone but that's really silly. Instead use
select and two variables.

This looks a bit daft right now but once we add a Medfield
selection it'll start to look a good deal more sensible.
Reported-by: NIngo Molnar <mingo@elte.hu>
Reported-by: NStanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111205231433.28811.51297.stgit@bob.linux.org.ukSigned-off-by: NIngo Molnar <mingo@elte.hu>

dd137525

x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions · f62ef5f3

由 Andreas Herrmann 提交于 12月 02, 2011

I've received complaints that the numa_node attribute for family
15h model 00-0fh (e.g. Interlagos) northbridge functions shows
-1 instead of the proper node ID.

Correct this with attached quirks (similar to quirks for other
AMD CPU families used in multi-socket systems).
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20111202072143.GA31916@alberich.amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

f62ef5f3

x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms · 35d47699

由 Mathias Nyman 提交于 11月 15, 2011

Intel MID x86 platforms have a memory mapped virtual RTC
instead.  No MID platform have the default ports (and
accessing them may do weird stuff).
Signed-off-by: NMathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Cc: feng.tang@intel.com
Cc: Feng Tang <feng.tang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

35d47699

x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode · 2cd1c8d4

由 Konrad Rzeszutek Wilk 提交于 11月 15, 2011

Fix an outstanding issue that has been reported since 2.6.37.
Under a heavy loaded machine processing "fork()" calls could
crash with:

BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000 Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
 [<c01ac222>] ? __swap_duplicate+0xc2/0x160
 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
 [<c01ac2e4>] ? swap_duplicate+0x14/0x40
 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500
 [<c01a0ca5>] ? copy_page_range+0x195/0x200
 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0
 [<c0132cf8>] ? dup_mm+0xa8/0x130
 [<c013376a>] ? copy_process+0x98a/0xb30
 [<c013395f>] ? do_fork+0x4f/0x280
 [<c01573b3>] ? getnstimeofday+0x43/0x100
 [<c010f770>] ? sys_clone+0x30/0x40
 [<c06c048d>] ? ptregs_clone+0x15/0x48
 [<c06bfb71>] ? syscall_call+0x7/0xb

The problem is that in copy_page_range() we turn lazy mode on,
and then in swap_entry_free() we call swap_count_continued()
which ends up in:

         map = kmap_atomic(page, KM_USER0) + offset;

and then later we touch *map.

Since we are running in batched mode (lazy) we don't actually
set up the PTE mappings and the kmap_atomic is not done
synchronously and ends up trying to dereference a page that has
not been set.

Looking at kmap_atomic_prot_pfn(), it uses
'arch_flush_lazy_mmu_mode' and doing the same in
kmap_atomic_prot() and __kunmap_atomic() makes the problem go
away.

Interestingly, commit b8bcfe99 ("x86/paravirt: remove lazy
mode in interrupts") removed part of this to fix an interrupt
issue - but it went to far and did not consider this scenario.
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2cd1c8d4

05 12月, 2011 9 次提交

x86: Fix "Acer Aspire 1" reboot hang · 1ef03890

由 Peter Chubb 提交于 12月 05, 2011

Looks like on some Acer Aspire 1s with older bioses, reboot via bios
fails.  It works on my machine, (with BIOS version 0.3310) but
not on some others (BIOS version 0.3309).

There's a log of problems at:

  https://bbs.archlinux.org/viewtopic.php?id=124136

This patch adds a different callback to the reboot quirk table,
to allow rebooting via keybaord controller.
Reported-by: NUroš Vampl <mobile.leecher@gmail.com>
Tested-by: NVasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: NPeter Chubb <peter.chubb@nicta.com.au>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

1ef03890

x86/mtrr: Resolve inconsistency with Intel processor manual · 8dbf4a30

由 Ajaykumar Hotchandani 提交于 11月 11, 2011

Following is from Notes of section 11.5.3 of Intel processor
manual available at:

  http://www.intel.com/Assets/PDF/manual/325384.pdf

For the Pentium 4 and Intel Xeon processors, after the sequence of
steps given above has been executed, the cache lines containing the
code between the end of the WBINVD instruction and before the
MTRRS have actually been disabled may be retained in the cache
hierarchy. Here, to remove code from the cache completely, a
second WBINVD instruction must be executed after the MTRRs have
been disabled.

This patch provides resolution for that.

Ideally, I will like to make changes only for Pentium 4 and Xeon
processors. But, I am not finding easier way to do it.
And, extra wbinvd() instruction does not hurt much for other
processors.
Signed-off-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Link: http://lkml.kernel.org/r/4EBD1CC5.3030008@oracle.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

8dbf4a30

x86: Document rdmsr_safe restrictions · ce37defc

由 Borislav Petkov 提交于 12月 05, 2011

Recently, I got bitten by using rdmsr_safe too early in the boot
process. Document its shortcomings for future reference.

Link: http://lkml.kernel.org/r/4ED5B70F.606@lwfinger.netSigned-off-by: NBorislav Petkov <borislav.petkov@amd.com>

ce37defc

x86, microcode: Fix the failure path of microcode update driver init code · bd399063

由 Srivatsa S. Bhat 提交于 11月 07, 2011

The microcode update driver's initialization code does not handle
failures correctly. This patch fixes this issue.
Signed-off-by: NJan Beulich <JBeulich@suse.com>
Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20111107123530.12164.31227.stgit@srivatsabhat.in.ibm.com
Link: http://lkml.kernel.org/r/4ED8E2270200007800065120@nat28.tlf.novell.comSigned-off-by: NBorislav Petkov <borislav.petkov@amd.com>

bd399063

Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup · 644ddf58

由 Prarit Bhargava 提交于 10月 18, 2011

TAINT_FIRMWARE_WORKAROUND should be set when an MTRR fixup
is done.
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1318958650-12447-1-git-send-email-prarit@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

644ddf58

x86/mpparse: Account for bus types other than ISA and PCI · 9e686668

由 Bjorn Helgaas 提交于 9月 25, 2011

In commit f8924e77 ("x86: unify mp_bus_info"), the 32-bit
and 64-bit versions of MP_bus_info were rearranged to match each
other better.  Unfortunately it introduced a regression: prior
to that change we used to always set the mp_bus_not_pci bit,
then clear it if we found a PCI bus.  After it, we set
mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave
it alone otherwise.

In the cases of ISA and PCI, there's not much difference.  But
ISA is not the only non-PCI bus, so it's better to always set
mp_bus_not_pci and clear it only for PCI.

Without this change, Dan's Dell PowerEdge 4200 panics on boot
with a log indicating interrupt routing trouble unless the
"noapic" option is supplied.  With this change, the machine
boots reliably without "noapic".

Fixes http://bugs.debian.org/586494Reported-bisected-and-tested-by: NDan McGrath <troubledaemon@gmail.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org	# 2.6.26+
Cc: Dan McGrath <troubledaemon@gmail.com>
Cc: Alexey Starikovskiy <aystarik@gmail.com>
[jrnieder@gmail.com: clarified commit message]
Signed-off-by: NJonathan Nieder <jrnieder@gmail.com>
Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.netSigned-off-by: NIngo Molnar <mingo@elte.hu>

9e686668

x86, mrst: Change the pmic_gpio device type to IPC · efa22126

由 Feng Tang 提交于 11月 16, 2011

In latest firmware's SFI tables, pmic_gpio has been set to
IPC type of device, so we need handle it too.
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

efa22126

mrst: Added some platform data for the SFI translations · 28744b3e

由 Jekyll Lai 提交于 11月 16, 2011

Add SFI glue for the following devices:

tca6416: a gpio expander compatible with max7315
mpu3050: gyro sensor

Both of these actual drivers are already upstream
Signed-off-by: NJekyll Lai <jekyll_lai@wistron.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

28744b3e

x86,mrst: Power control commands update · 48bc5562

由 Jacob Pan 提交于 11月 16, 2011

On the Intel MID devices SCU commands are issued to manage power
off and the like. We need to issue different ones for
non-Lincroft based devices.
Signed-off-by: NAlek Du <alek.du@intel.com>
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

48bc5562

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功