提交 · 61719a8fff3da865cdda57dd62974e561e16315d · openanolis / cloud-kernel

07 8月, 2013 8 次提交

nEPT: Support shadow paging for guest paging without A/D bits · 61719a8f

由 Gleb Natapov 提交于 8月 05, 2013

Some guest paging modes do not support A/D bits. Add support for such
modes in shadow page code. For such modes PT_GUEST_DIRTY_MASK,
PT_GUEST_ACCESSED_MASK, PT_GUEST_DIRTY_SHIFT and PT_GUEST_ACCESSED_SHIFT
should be set to zero.
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

61719a8f

nEPT: make guest's A/D bits depends on guest's paging mode · d8089bac

由 Gleb Natapov 提交于 8月 05, 2013

This patch makes guest A/D bits definition to be dependable on paging
mode, so when EPT support will be added it will be able to define them
differently.
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d8089bac

nEPT: Move common code to paging_tmpl.h · 0ad805a0

由 Nadav Har'El 提交于 8月 05, 2013

For preparation, we just move gpte_access(), prefetch_invalid_gpte(),
s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c
to paging_tmpl.h.
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Signed-off-by: NJun Nakajima <jun.nakajima@intel.com>
Signed-off-by: NXinhao Xu <xinhao.xu@intel.com>
Signed-off-by: NYang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: NJun Nakajima <jun.nakajima@intel.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0ad805a0

nEPT: Fix wrong test in kvm_set_cr3 · b7e91450

由 Nadav Har'El 提交于 8月 05, 2013

kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.
Reviewed-by: NOrit Wasserman <owasserm@redhat.com>
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Signed-off-by: NJun Nakajima <jun.nakajima@intel.com>
Signed-off-by: NXinhao Xu <xinhao.xu@intel.com>
Signed-off-by: NYang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b7e91450

nEPT: Fix cr3 handling in nested exit and entry · 3633cfc3

由 Nadav Har'El 提交于 8月 05, 2013

The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: NOrit Wasserman <owasserm@redhat.com>
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Signed-off-by: NJun Nakajima <jun.nakajima@intel.com>
Signed-off-by: NXinhao Xu <xinhao.xu@intel.com>
Signed-off-by: NYang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3633cfc3

nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 · 8049d651

由 Nadav Har'El 提交于 8月 05, 2013

Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.
Reviewed-by: NOrit Wasserman <owasserm@redhat.com>
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Signed-off-by: NJun Nakajima <jun.nakajima@intel.com>
Signed-off-by: NXinhao Xu <xinhao.xu@intel.com>
Signed-off-by: NYang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8049d651

KVM: MMU: fix check the reserved bits on the gpte of L2 · 02766421

由 Xiao Guangrong 提交于 8月 05, 2013

Current code always uses arch.mmu to check the reserved bits on guest gpte
which is valid only for L1 guest, we should use arch.nested_mmu instead when
we translate gva to gpa for the L2 guest

Fix it by using @mmu instead since it is adapted to the current mmu mode
automatically

The bug can be triggered when nested npt is used and L1 guest and L2 guest
use different mmu mode
Reported-by: NJan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

02766421

KVM: nVMX: correctly set tr base on nested vmexit emulation · 205befd9

由 Gleb Natapov 提交于 8月 04, 2013

After commit 21feb4eb tr base is zeroed
during vmexit. Set it to L1's HOST_TR_BASE. This should fix
https://bugzilla.kernel.org/show_bug.cgi?id=60679Reported-by: NYongjie Ren <yongjie.ren@intel.com>
Reviewed-by: NArthur Chunqi Li <yzt356@gmail.com>
Tested-by: NYongjie Ren <yongjie.ren@intel.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

205befd9

29 7月, 2013 11 次提交

nVMX: reset rflags register cache during nested vmentry. · 63fbf59f

由 Gleb Natapov 提交于 7月 28, 2013

During nested vmentry into vm86 mode a vcpu state is found to be incorrect
because rflags does not have VM flag set since it is read from the cache
and has L1's value instead of L2's. If emulate_invalid_guest_state=1 L0
KVM tries to emulate it, but emulation does not work for nVMX and it
never should happen anyway. Fix that by using vmx_set_rflags() to set
rflags during nested vmentry which takes care of updating register cache.
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

63fbf59f

KVM: s390: Make KVM_HVA_ERR_BAD usable on s390 · bf640876

由 Dominik Dingel 提交于 7月 26, 2013

Current common code uses PAGE_OFFSET to indicate a bad host virtual address.
As this check won't work on architectures that don't map kernel and user memory
into the same address space (e.g. s390), such architectures can now provide
their own KVM_HVA_ERR_BAD defines.
Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bf640876

KVM: s390: Add helper function for setting condition code · ea828ebf

由 Thomas Huth 提交于 7月 26, 2013

Introduced a helper function for setting the CC in the
guest PSW to improve the readability of the code.
Signed-off-by: NThomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ea828ebf

KVM: s390: Fix sparse warnings in priv.c · 843200e7

由 Thomas Huth 提交于 7月 26, 2013

sparse complained about the missing UL postfix for long constants.
Signed-off-by: NThomas Huth <thuth@linux.vnet.ibm.com>
Acked-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

843200e7

KVM: s390: declare virtual HW facilities · 78c4b59f

由 Michael Mueller 提交于 7月 26, 2013

The patch renames the array holding the HW facility bitmaps.
This allows to interprete the variable as set of virtual
machine specific "virtual" facilities. The basic idea is
to make virtual facilities externally managable in future.
An availability test for virtual facilites has been added
as well.
Signed-off-by: NMichael Mueller <mimu@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

78c4b59f

KVM: s390: fix task size check · ee6ee55b

由 Martin Schwidefsky 提交于 7月 26, 2013

The gmap_map_segment function uses PGDIR_SIZE in the check for the
maximum address in the tasks address space. This incorrectly limits
the amount of memory usable for a kvm guest to 4TB. The correct limit
is (1UL << 53). As the TASK_SIZE has different values (4TB vs 8PB)
dependent on the existance of the fourth page table level, create
a new define 'TASK_MAX_SIZE' for (1UL << 53).
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ee6ee55b

KVM: s390: allow sie enablement for multi-threaded programs · 3eabaee9

由 Martin Schwidefsky 提交于 7月 26, 2013

Improve the code to upgrade the standard 2K page tables to 4K page tables
with PGSTEs to allow the operation to happen when the program is already
multi-threaded.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3eabaee9

KVM: x86: handle singlestep during emulation · 663f4c61

由 Paolo Bonzini 提交于 6月 25, 2013

This lets debugging work better during emulation of invalid
guest state.

This time the check is done after emulation, but before writeback
of the flags; we need to check the flags *before* execution of the
instruction, we cannot check singlestep_rip because the CS base may
have already been modified.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

Conflicts:
	arch/x86/kvm/x86.c

663f4c61

KVM: x86: handle hardware breakpoints during emulation · 4a1e10d5

由 Paolo Bonzini 提交于 5月 30, 2013

This lets debugging work better during emulation of invalid
guest state.

The check is done before emulating the instruction, and (in the case
of guest debugging) reuses EMULATE_DO_MMIO to exit with KVM_EXIT_DEBUG.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4a1e10d5

KVM: x86: rename EMULATE_DO_MMIO · ac0a48c3

由 Paolo Bonzini 提交于 6月 25, 2013

The next patch will reuse it for other userspace exits than MMIO,
namely debug events.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ac0a48c3

KVM: introduce __kvm_io_bus_sort_cmp · a343c9b7

由 Paolo Bonzini 提交于 7月 16, 2013

kvm_io_bus_sort_cmp is used also directly, not just as a callback for
sort and bsearch.  In these cases, it is handy to have a type-safe
variant.  This patch introduces such a variant, __kvm_io_bus_sort_cmp,
and uses it throughout kvm_main.c.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a343c9b7

25 7月, 2013 2 次提交

KVM: x86: Drop some unused functions from lapic · 9576c4cd

由 Jan Kiszka 提交于 7月 25, 2013

Both have no users anymore.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

9576c4cd

KVM: x86: Simplify __apic_accept_irq · 11f5cc05

由 Jan Kiszka 提交于 7月 25, 2013

If posted interrupts are enabled, we can no longer track if an IRQ was
coalesced based on IRR. So drop this logic also from the classic
software path and simplify apic_test_and_set_irr to apic_set_irr.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

11f5cc05

20 7月, 2013 1 次提交

perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5 · 103af0a9

由 Andi Kleen 提交于 7月 18, 2013

[KVM maintainers:
The underlying support for this is in perf/core now. So please merge
this patch into the KVM tree.]

This is not arch perfmon, but older CPUs will just ignore it. This makes
it possible to do at least some TSX measurements from a KVM guest

v2: Various fixes to address review feedback
v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
v4: Use reserved bits for #GP
v5: Remove obsolete argument
Acked-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

103af0a9

18 7月, 2013 12 次提交

KVM: nVMX: Set segment infomation of L1 when L2 exits · 21feb4eb

由 Arthur Chunqi Li 提交于 7月 15, 2013

When L2 exits to L1, segment infomations of L1 are not set correctly.
According to Intel SDM 27.5.2(Loading Host Segment and Descriptor
Table Registers), segment base/limit/access right of L1 should be
set to some designed value when L2 exits to L1. This patch fixes
this.
Signed-off-by: NArthur Chunqi Li <yzt356@gmail.com>
Reviewed-by: NGleb Natapov <gnatapov@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

21feb4eb

remove sched notifier for cross-cpu migrations · e04c5d76

由 Marcelo Tosatti 提交于 7月 10, 2013

Linux as a guest on KVM hypervisor, the only user of the pvclock
vsyscall interface, does not require notification on task migration
because:

1. cpu ID number maps 1:1 to per-CPU pvclock time info.
2. per-CPU pvclock time info is updated if the
   underlying CPU changes.
3. that version is increased whenever underlying CPU
   changes.

Which is sufficient to guarantee nanoseconds counter
is calculated properly.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

e04c5d76

KVM: nVMX: Fix read/write to MSR_IA32_FEATURE_CONTROL · b3897a49

由 Nadav Har'El 提交于 7月 08, 2013

Fix read/write to IA32_FEATURE_CONTROL MSR in nested environment.

This patch simulate this MSR in nested_vmx and the default value is
0x0. BIOS should set it to 0x5 before VMXON. After setting the lock
bit, write to it will cause #GP(0).

Another QEMU patch is also needed to handle emulation of reset
and migration. Reset to vCPU should clear this MSR and migration
should reserve value of it.

This patch is based on Nadav's previous commit.
http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/88478Signed-off-by: NNadav Har'El <nyh@math.technion.ac.il>
Signed-off-by: NArthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

b3897a49

KVM: x86: Drop useless cast · 6b61edf7

由 Mathias Krause 提交于 6月 26, 2013

Void pointers don't need no casting, drop it.
Signed-off-by: NMathias Krause <minipli@googlemail.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

6b61edf7

KVM: VMX: Use proper types to access const arrays · c2bae893

由 Mathias Krause 提交于 6月 26, 2013

Use a const pointer type instead of casting away the const qualifier
from const arrays. Keep the pointer array on the stack, nonetheless.
Making it static just increases the object size.
Signed-off-by: NMathias Krause <minipli@googlemail.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

c2bae893

KVM: nVMX: Set success rflags when emulate VMXON/VMXOFF in nested virt · a25eb114

由 Arthur Chunqi Li 提交于 7月 04, 2013

Set rflags after successfully emulateing VMXON/VMXOFF in VMX.
Signed-off-by: NArthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a25eb114

KVM: nVMX: Change location of 3 functions in vmx.c · 0658fbaa

由 Arthur Chunqi Li 提交于 7月 04, 2013

Move nested_vmx_succeed/nested_vmx_failInvalid/nested_vmx_failValid
ahead of handle_vmon to eliminate double declaration in the same
file
Signed-off-by: NArthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0658fbaa

KVM: x86: Avoid zapping mmio sptes twice for generation wraparound · e6dff7d1

由 Takuya Yoshikawa 提交于 7月 04, 2013

Now that kvm_arch_memslots_updated() catches every increment of the
memslots->generation, checking if the mmio generation has reached its
maximum value is enough.
Signed-off-by: NTakuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e6dff7d1

KVM: Introduce kvm_arch_memslots_updated() · e59dbe09

由 Takuya Yoshikawa 提交于 7月 04, 2013

This is called right after the memslots is updated, i.e. when the result
of update_memslots() gets installed in install_new_memslots().  Since
the memslots needs to be updated twice when we delete or move a memslot,
kvm_arch_commit_memory_region() does not correspond to this exactly.

In the following patch, x86 will use this new API to check if the mmio
generation has reached its maximum value, in which case mmio sptes need
to be flushed out.
Signed-off-by: NTakuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Acked-by: NAlexander Graf <agraf@suse.de>
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e59dbe09

KVM: s390: use cookies for ioeventfd · 85dfe87e

由 Cornelia Huck 提交于 7月 03, 2013

Make use of cookies for the virtio ccw notification hypercall to speed up
lookup of devices on the io bus.
Signed-off-by: NCornelia Huck <cornelia.huck@de.ibm.com>
[Small fix to a comment. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

85dfe87e

KVM: kvm-io: support cookies · 126a5af5

由 Cornelia Huck 提交于 7月 03, 2013

Add new functions kvm_io_bus_{read,write}_cookie() that allows users of
the kvm io infrastructure to use a cookie value to speed up lookup of a
device on an io bus.
Signed-off-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

126a5af5

KVM: MMU: avoid fast page fault fixing mmio page fault · 1c118b82

由 Xiao Guangrong 提交于 7月 18, 2013

Currently, fast page fault incorrectly tries to fix mmio page fault when
the generation number is invalid (spte.gen != kvm.gen). It then returns
to guest to retry the fault since it sees the last spte is nonpresent.
This causes an infinite loop.

Since fast page fault only works for direct mmu, the issue exists when
1) tdp is enabled. It is only triggered only on AMD host since on Intel host
the mmio page fault is recognized as ept-misconfig whose handler call
fault-page path with error_code = 0

2) guest paging is disabled. Under this case, the issue is hardly discovered
since paging disable is short-lived and the sptes will be invalid after
memslot changed for 150 times

Fix it by filtering out MMIO page faults in page_fault_can_be_fast.
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1c118b82

15 7月, 2013 4 次提交

L

Linux 3.11-rc1 · ad81f054
由 Linus Torvalds 提交于 7月 14, 2013

ad81f054

Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux · 54be8200

由 Linus Torvalds 提交于 7月 14, 2013

Pull slab update from Pekka Enberg:
 "Highlights:

  - Fix for boot-time problems on some architectures due to
    init_lock_keys() not respecting kmalloc_caches boundaries
    (Christoph Lameter)

  - CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim)

  - Fix for excessive slab freelist draining (Wanpeng Li)

  - SLUB and SLOB cleanups and fixes (various people)"

I ended up editing the branch, and this avoids two commits at the end
that were immediately reverted, and I instead just applied the oneliner
fix in between myself.

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
  slub: Check for page NULL before doing the node_match check
  mm/slab: Give s_next and s_stop slab-specific names
  slob: Check for NULL pointer before calling ctor()
  slub: Make cpu partial slab support configurable
  slab: add kmalloc() to kernel API documentation
  slab: fix init_lock_keys
  slob: use DIV_ROUND_UP where possible
  slub: do not put a slab to cpu partial list when cpu_partial is 0
  mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo
  mm/slub: Drop unnecessary nr_partials
  mm/slab: Fix /proc/slabinfo unwriteable for slab
  mm/slab: Sharing s_next and s_stop between slab and slub
  mm/slab: Fix drain freelist excessively
  slob: Rework #ifdeffery in slab.h
  mm, slab: moved kmem_cache_alloc_node comment to correct place

54be8200

slub: Check for page NULL before doing the node_match check · c25f195e

由 Steven Rostedt 提交于 1月 17, 2013

In the -rt kernel (mrg), we hit the following dump:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
PGD a2d39067 PUD b1641067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
CPU 3
Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992
RIP: 0010:[<ffffffff811573f1>]  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
RSP: 0018:ffff8800a9b17d70  EFLAGS: 00010213
RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000
RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500
RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd
R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500
R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000)
Stack:
 ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011
 0000000001200011 0000000001200011 0000000000000000 0000000000000000
 00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd
Call Trace:
 [<ffffffff81202e08>] ? current_has_perm+0x68/0x80
 [<ffffffff81041cbd>] copy_process+0xdd/0x15b0
 [<ffffffff810a2125>] ? rt_up_read+0x25/0x30
 [<ffffffff8104369a>] do_fork+0x5a/0x360
 [<ffffffff8107c66b>] ? migrate_enable+0xeb/0x220
 [<ffffffff8100b068>] sys_clone+0x28/0x30
 [<ffffffff81527423>] stub_clone+0x13/0x20
 [<ffffffff81527152>] ? system_call_fastpath+0x16/0x1b
Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 <48> 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2
RIP  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
 RSP <ffff8800a9b17d70>
CR2: 0000000000000000
---[ end trace 0000000000000002 ]---

Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel
with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do
disable migration. But the SLUB code is relatively lockless, and the
spin_locks there are raw_spin_locks (not converted to mutexes), thus I
believe this bug can happen in mainline without -rt features. The -rt
patch is just good at triggering mainline bugs ;-)

Anyway, looking at where this crashed, it seems that the page variable
can be NULL when passed to the node_match() function (which does not
check if it is NULL). When this happens we get the above panic.

As page is only used in slab_alloc() to check if the node matches, if
it's NULL I'm assuming that we can say it doesn't and call the
__slab_alloc() code. Is this a correct assumption?
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c25f195e

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 41d9884c

由 Linus Torvalds 提交于 7月 14, 2013

Pull more vfs stuff from Al Viro:
 "O_TMPFILE ABI changes, Oleg's fput() series, misc cleanups, including
  making simple_lookup() usable for filesystems with non-NULL s_d_op,
  which allows us to get rid of quite a bit of ugliness"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sunrpc: now we can just set ->s_d_op
  cgroup: we can use simple_lookup() now
  efivarfs: we can use simple_lookup() now
  make simple_lookup() usable for filesystems that set ->s_d_op
  configfs: don't open-code d_alloc_name()
  __rpc_lookup_create_exclusive: pass string instead of qstr
  rpc_create_*_dir: don't bother with qstr
  llist: llist_add() can use llist_add_batch()
  llist: fix/simplify llist_add() and llist_add_batch()
  fput: turn "list_head delayed_fput_list" into llist_head
  fs/file_table.c:fput(): add comment
  Safer ABI for O_TMPFILE

41d9884c

14 7月, 2013 2 次提交
- A
  sunrpc: now we can just set ->s_d_op · dae3794f
  由 Al Viro 提交于 7月 14, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  dae3794f
- A
  cgroup: we can use simple_lookup() now · 786e1448
  由 Al Viro 提交于 7月 14, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  786e1448

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功