1. 11 9月, 2015 1 次提交
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
  2. 09 9月, 2015 1 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
  3. 06 9月, 2015 3 次提交
  4. 15 8月, 2015 1 次提交
  5. 11 8月, 2015 2 次提交
  6. 07 8月, 2015 2 次提交
    • H
      KVM: x86: Use adjustment in guest cycles when handling MSR_IA32_TSC_ADJUST · d7add054
      Haozhong Zhang 提交于
      When kvm_set_msr_common() handles a guest's write to
      MSR_IA32_TSC_ADJUST, it will calcuate an adjustment based on the data
      written by guest and then use it to adjust TSC offset by calling a
      call-back adjust_tsc_offset(). The 3rd parameter of adjust_tsc_offset()
      indicates whether the adjustment is in host TSC cycles or in guest TSC
      cycles. If SVM TSC scaling is enabled, adjust_tsc_offset()
      [i.e. svm_adjust_tsc_offset()] will first scale the adjustment;
      otherwise, it will just use the unscaled one. As the MSR write here
      comes from the guest, the adjustment is in guest TSC cycles. However,
      the current kvm_set_msr_common() uses it as a value in host TSC
      cycles (by using true as the 3rd parameter of adjust_tsc_offset()),
      which can result in an incorrect adjustment of TSC offset if SVM TSC
      scaling is enabled. This patch fixes this problem.
      Signed-off-by: NHaozhong Zhang <haozhong.zhang@intel.com>
      Cc: stable@vger.linux.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d7add054
    • P
      KVM: x86: zero IDT limit on entry to SMM · 18c3626e
      Paolo Bonzini 提交于
      The recent BlackHat 2015 presentation "The Memory Sinkhole"
      mentions that the IDT limit is zeroed on entry to SMM.
      
      This is not documented, and must have changed some time after 2010
      (see http://www.ssi.gouv.fr/uploads/IMG/pdf/IT_Defense_2010_final.pdf).
      KVM was not doing it, but the fix is easy.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      18c3626e
  7. 05 8月, 2015 11 次提交
  8. 30 7月, 2015 1 次提交
    • P
      KVM: x86: clean/fix memory barriers in irqchip_in_kernel · 71ba994c
      Paolo Bonzini 提交于
      The memory barriers are trying to protect against concurrent RCU-based
      interrupt injection, but the IRQ routing table is not valid at the time
      kvm->arch.vpic is written.  Fix this by writing kvm->arch.vpic last.
      kvm_destroy_pic then need not set kvm->arch.vpic to NULL; modify it
      to take a struct kvm_pic* and reuse it if the IOAPIC creation fails.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      71ba994c
  9. 29 7月, 2015 2 次提交
  10. 23 7月, 2015 11 次提交
  11. 10 7月, 2015 5 次提交
    • W
      kvm: x86: fix load xsave feature warning · ee4100da
      Wanpeng Li 提交于
      [   68.196974] WARNING: CPU: 1 PID: 2140 at arch/x86/kvm/x86.c:3161 kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]()
      [   68.196975] Modules linked in: snd_hda_codec_hdmi i915 rfcomm bnep bluetooth i2c_algo_bit rfkill nfsd drm_kms_helper nfs_acl nfs drm lockd grace sunrpc fscache snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_dummy snd_seq_oss x86_pkg_temp_thermal snd_seq_midi kvm_intel snd_seq_midi_event snd_rawmidi kvm snd_seq ghash_clmulni_intel fuse snd_timer aesni_intel parport_pc ablk_helper snd_seq_device cryptd ppdev snd lp parport lrw dcdbas gf128mul i2c_core glue_helper lpc_ich video shpchp mfd_core soundcore serio_raw acpi_cpufreq ext4 mbcache jbd2 sd_mod crc32c_intel ahci libahci libata e1000e ptp pps_core
      [   68.197005] CPU: 1 PID: 2140 Comm: qemu-system-x86 Not tainted 4.2.0-rc1+ #2
      [   68.197006] Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
      [   68.197007]  ffffffffa03b0657 ffff8800d984bca8 ffffffff815915a2 0000000000000000
      [   68.197009]  0000000000000000 ffff8800d984bce8 ffffffff81057c0a 00007ff6d0001000
      [   68.197010]  0000000000000002 ffff880211c1a000 0000000000000004 ffff8800ce0288c0
      [   68.197012] Call Trace:
      [   68.197017]  [<ffffffff815915a2>] dump_stack+0x45/0x57
      [   68.197020]  [<ffffffff81057c0a>] warn_slowpath_common+0x8a/0xc0
      [   68.197022]  [<ffffffff81057cfa>] warn_slowpath_null+0x1a/0x20
      [   68.197029]  [<ffffffffa037bed8>] kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]
      [   68.197035]  [<ffffffffa037aede>] ? kvm_arch_vcpu_load+0x4e/0x1c0 [kvm]
      [   68.197040]  [<ffffffffa03696a6>] kvm_vcpu_ioctl+0xc6/0x5c0 [kvm]
      [   68.197043]  [<ffffffff811252d2>] ? perf_pmu_enable+0x22/0x30
      [   68.197044]  [<ffffffff8112663e>] ? perf_event_context_sched_in+0x7e/0xb0
      [   68.197048]  [<ffffffff811a6882>] do_vfs_ioctl+0x2c2/0x4a0
      [   68.197050]  [<ffffffff8107bf33>] ? finish_task_switch+0x173/0x220
      [   68.197053]  [<ffffffff8123307f>] ? selinux_file_ioctl+0x4f/0xd0
      [   68.197055]  [<ffffffff8122cac3>] ? security_file_ioctl+0x43/0x60
      [   68.197057]  [<ffffffff811a6ad9>] SyS_ioctl+0x79/0x90
      [   68.197060]  [<ffffffff81597e57>] entry_SYSCALL_64_fastpath+0x12/0x6a
      [   68.197061] ---[ end trace 558a5ebf9445fc80 ]---
      
      After commit (0c4109be 'x86/fpu/xstate: Fix up bad get_xsave_addr()
      assumptions'), there is no assumption an xsave bit is present in the
      hardware (pcntxt_mask) that it is always present in a given xsave buffer.
      An enabled state to be present on 'pcntxt_mask', but *not* in 'xstate_bv'
      could happen when the last 'xsave' did not request that this feature be
      saved (unlikely) or because the "init optimization" caused it to not be
      saved. This patch kill the assumption.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee4100da
    • P
      KVM: x86: apply guest MTRR virtualization on host reserved pages · fd717f11
      Paolo Bonzini 提交于
      Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
      However, the guest could prefer a different page type than UC for
      such pages. A good example is that pass-throughed VGA frame buffer is
      not always UC as host expected.
      
      This patch enables full use of virtual guest MTRRs.
      Suggested-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Tested-by: Joerg Roedel <jroedel@suse.de> (on AMD)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fd717f11
    • J
      KVM: SVM: Sync g_pat with guest-written PAT value · e098223b
      Jan Kiszka 提交于
      When hardware supports the g_pat VMCB field, we can use it for emulating
      the PAT configuration that the guest configures by writing to the
      corresponding MSR.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Tested-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e098223b
    • P
      KVM: SVM: use NPT page attributes · 3c2e7f7d
      Paolo Bonzini 提交于
      Right now, NPT page attributes are not used, and the final page
      attribute depends solely on gPAT (which however is not synced
      correctly), the guest MTRRs and the guest page attributes.
      
      However, we can do better by mimicking what is done for VMX.
      In the absence of PCI passthrough, the guest PAT can be ignored
      and the page attributes can be just WB.  If passthrough is being
      used, instead, keep respecting the guest PAT, and emulate the guest
      MTRRs through the PAT field of the nested page tables.
      
      The only snag is that WP memory cannot be emulated correctly,
      because Linux's default PAT setting only includes the other types.
      Tested-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c2e7f7d
    • P
      KVM: count number of assigned devices · 5544eb9b
      Paolo Bonzini 提交于
      If there are no assigned devices, the guest PAT are not providing
      any useful information and can be overridden to writeback; VMX
      always does this because it has the "IPAT" bit in its extended
      page table entries, but SVM does not have anything similar.
      Hook into VFIO and legacy device assignment so that they
      provide this information to KVM.
      Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
      Tested-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5544eb9b