提交 · 9938b04472d5c59f8bd8152a548533a8599596a2 · openanolis / cloud-kernel

15 4月, 2016 1 次提交

Revert "x86: remove the kernel code/data/bss resources from /proc/iomem" · 4046d6e8

由 Linus Torvalds 提交于 4月 14, 2016

This reverts commit c4004b02.

Sadly, my hope that nobody would actually use the special kernel entries
in /proc/iomem were dashed by kexec.  Which reads /proc/iomem explicitly
to find the kernel base address.  Nasty.

Anyway, that means we can't do the sane and simple thing and just remove
the entries, and we'll instead have to mask them out based on permissions.
Reported-by: NZhengyu Zhang <zhezhang@redhat.com>
Reported-by: NDave Young <dyoung@redhat.com>
Reported-by: NFreeman Zhang <freeman.zhang1992@gmail.com>
Reported-by: NEmrah Demir <ed@abdsec.com>
Reported-by: NBaoquan He <bhe@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4046d6e8

13 4月, 2016 1 次提交

x86/mce: Avoid using object after free in genpool · a3125494

由 Tony Luck 提交于 4月 06, 2016

When we loop over all queued machine check error records to pass them
to the registered notifiers we use llist_for_each_entry(). But the loop
calls gen_pool_free() for the entry in the body of the loop - and then
the iterator looks at node->next after the free.

Use llist_for_each_entry_safe() instead.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Gong Chen <gong.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/0205920@agluck-desk.sc.intel.com
Link: http://lkml.kernel.org/r/1459929916-12852-4-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

a3125494

11 4月, 2016 3 次提交

KVM: x86: mask CPUID(0xD,0x1).EAX against host value · 316314ca

由 Paolo Bonzini 提交于 3月 21, 2016

This ensures that the guest doesn't see XSAVE extensions
(e.g. xgetbv1 or xsavec) that the host lacks.

Cc: stable@vger.kernel.org
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

316314ca

kvm: x86: do not leak guest xcr0 into host interrupt handlers · fc5b7f3b

由 David Matlack 提交于 3月 30, 2016

An interrupt handler that uses the fpu can kill a KVM VM, if it runs
under the following conditions:
 - the guest's xcr0 register is loaded on the cpu
 - the guest's fpu context is not loaded
 - the host is using eagerfpu

Note that the guest's xcr0 register and fpu context are not loaded as
part of the atomic world switch into "guest mode". They are loaded by
KVM while the cpu is still in "host mode".

Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
interrupt handler will look something like this:

if (irq_fpu_usable()) {
        kernel_fpu_begin();

        [... code that uses the fpu ...]

        kernel_fpu_end();
}

As long as the guest's fpu is not loaded and the host is using eager
fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
returns true). The interrupt handler proceeds to use the fpu with
the guest's xcr0 live.

kernel_fpu_begin() saves the current fpu context. If this uses
XSAVE[OPT], it may leave the xsave area in an undesirable state.
According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
xcr0[i] == 0 following an XSAVE.

kernel_fpu_end() restores the fpu context. Now if any bit i in
XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
fault is trapped and SIGSEGV is delivered to the current process.

Only pre-4.2 kernels appear to be vulnerable to this sequence of
events. Commit 653f52c3 ("kvm,x86: load guest FPU context more eagerly")
from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.

This patch fixes the bug by keeping the host's xcr0 loaded outside
of the interrupts-disabled region where KVM switches into guest mode.

Cc: stable@vger.kernel.org
Suggested-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
[Move load after goto cancel_injection. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fc5b7f3b

KVM: MMU: fix permission_fault() · 7a98205d

由 Xiao Guangrong 提交于 3月 25, 2016

kvm-unit-tests complained about the PFEC is not set properly, e.g,:
test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15
expected 5
Dump mapping: address: 0x123400000000
------L4: 3e95007
------L3: 3e96007
------L2: 2000083

It's caused by the reason that PFEC returned to guest is copied from the
PFEC triggered by shadow page table

This patch fixes it and makes the logic of updating errcode more clean
Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
[Do not assume pfec.p=1. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7a98205d

08 4月, 2016 1 次提交

tools/power turbostat: print IRTL MSRs · 5a63426e

由 Len Brown 提交于 4月 06, 2016

Some processors use the Interrupt Response Time Limit (IRTL) MSR value
to describe the maximum IRQ response time latency for deep
package C-states. (Though others have the register, but do not use it)
Lets print it out to give insight into the cases where it is used.

IRTL begain in SNB, with PC3/PC6/PC7, and HSW added PC8/PC9/PC10.
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

5a63426e

07 4月, 2016 1 次提交

x86: remove the kernel code/data/bss resources from /proc/iomem · c4004b02

由 Linus Torvalds 提交于 4月 06, 2016

Let's see if anybody even notices.  I doubt anybody uses this, and it
does expose addresses that should be randomized, so let's just remove
the code.  It's old and traditional, and it used to be cute, but we
should have removed this long ago.

If it turns out anybody notices and this breaks something, we'll have to
revert this, and maybe we'll end up using other approaches instead
(using %pK or similar).  But removing unnecessary code is always the
preferred option.
Noted-by: NEmrah Demir <ed@abdsec.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c4004b02

05 4月, 2016 1 次提交

kvm: x86: make lapic hrtimer pinned · 61abdbe0

由 Luiz Capitulino 提交于 4月 04, 2016

When a vCPU runs on a nohz_full core, the hrtimer used by
the lapic emulation code can be migrated to another core.
When this happens, it's possible to observe milisecond
latency when delivering timer IRQs to KVM guests.

The huge latency is mainly due to the fact that
apic_timer_fn() expects to run during a kvm exit. It
sets KVM_REQ_PENDING_TIMER and let it be handled on kvm
entry. However, if the timer fires on a different core,
we have to wait until the next kvm exit for the guest
to see KVM_REQ_PENDING_TIMER set.

This problem became visible after commit 9642d18e. This
commit changed the timer migration code to always attempt
to migrate timers away from nohz_full cores. While it's
discussable if this is correct/desirable (I don't think
it is), it's clear that the lapic emulation code has
a requirement on firing the hrtimer in the same core
where it was started. This is achieved by making the
hrtimer pinned.

Lastly, note that KVM has code to migrate timers when a
vCPU is scheduled to run in different core. However, this
forced migration may fail. When this happens, we can have
the same problem. If we want 100% correctness, we'll have
to modify apic_timer_fn() to cause a kvm exit when it runs
on a different core than the vCPU. Not sure if this is
possible.

Here's a reproducer for the issue being fixed:

 1. Set all cores but core0 to be nohz_full cores
 2. Start a guest with a single vCPU
 3. Trace apic_timer_fn() and kvm_inject_apic_timer_irqs()

You'll see that apic_timer_fn() will run in core0 while
kvm_inject_apic_timer_irqs() runs in a different core. If
you get both on core0, try running a program that takes 100%
of the CPU and pin it to core0 to force the vCPU out.
Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

61abdbe0

02 4月, 2016 2 次提交

mm/rmap: batched invalidations should use existing api · 858eaaa7

由 Nadav Amit 提交于 4月 01, 2016

The recently introduced batched invalidations mechanism uses its own
mechanism for shootdown.  However, it does wrong accounting of
interrupts (e.g., inc_irq_stat is called for local invalidations),
trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and
may break some platforms as it bypasses the invalidation mechanisms of
Xen and SGI UV.

This patch reuses the existing TLB flushing mechnaisms instead.  We use
NULL as mm to indicate a global invalidation is required.

Fixes 72b252ae ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
Signed-off-by: NNadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

858eaaa7

x86/mm: TLB_REMOTE_SEND_IPI should count pages · 18c98243

由 Nadav Amit 提交于 4月 01, 2016

TLB_REMOTE_SEND_IPI was recently introduced, but it counts bytes instead
of pages.  In addition, it does not report correctly the case in which
flush_tlb_page flushes a page.  Fix it to be consistent with other TLB
counters.

Fixes: 5b74283a ("x86, mm: trace when an IPI is about to be sent")
Signed-off-by: NNadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

18c98243

01 4月, 2016 4 次提交

kvm: set page dirty only if page has been writable · 14f47605

由 Yu Zhao 提交于 3月 30, 2016

In absence of shadow dirty mask, there is no need to set page dirty
if page has never been writable. This is a tiny optimization but
good to have for people who care much about dirty page tracking.
Signed-off-by: NYu Zhao <yuzhao@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

14f47605

KVM: x86: reduce default value of halt_poll_ns parameter · 14ebda33

由 Paolo Bonzini 提交于 3月 29, 2016

Windows lets applications choose the frequency of the timer tick,
and in Windows 10 the maximum rate was changed from 1024 Hz to
2048 Hz.  Unfortunately, because of the way the Windows API
works, most applications who need a higher rate than the default
64 Hz will just do

   timeGetDevCaps(&tc, sizeof(tc));
   timeBeginPeriod(tc.wPeriodMin);

and pick the maximum rate.  This causes very high CPU usage when
playing media or games on Windows 10, even if the guest does not
actually use the CPU very much, because the frequent timer tick
causes halt_poll_ns to kick in.

There is no really good solution, especially because Microsoft
could sooner or later bump the limit to 4096 Hz, but for now
the best we can do is lower a bit the upper limit for
halt_poll_ns. :-(
Reported-by: NJon Panozzo <jonp@lime-technology.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

14ebda33

KVM: Hyper-V: do not do hypercall userspace exits if SynIC is disabled · a2b5c3c0

由 Paolo Bonzini 提交于 3月 29, 2016

If SynIC is disabled, there is nothing that userspace can do to
handle these exits; on the other hand, userspace probably will
not know about KVM_EXIT_HYPERV_HCALL and complain about it or
even exit.  Just prevent anything bad from happening by handling
the hypercall in KVM and returning an "invalid hypercall" code.

Fixes: 83326e43
Cc: Andrey Smetanin <irqlevel@gmail.com>
Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a2b5c3c0

KVM: x86: Inject pending interrupt even if pending nmi exist · 321c5658

由 Yuki Shibuya 提交于 3月 24, 2016

Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.

In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).

This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.

Cc: stable@vger.kernel.org
Signed-off-by: NYuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

321c5658

31 3月, 2016 1 次提交

perf/x86/amd/ibs: Fix pmu::stop() nesting · 85dc6002

由 Peter Zijlstra 提交于 3月 21, 2016

Patch 5a50f529 ("perf/x86/ibs: Fix race with IBS_STARTING state")
closed a big hole while opening another, smaller hole.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 5a50f529 ("perf/x86/ibs: Fix race with IBS_STARTING state")
Signed-off-by: NIngo Molnar <mingo@kernel.org>

85dc6002

29 3月, 2016 9 次提交

xen/x86: Call cpu_startup_entry(CPUHP_AP_ONLINE_IDLE) from xen_play_dead() · dc6416f1

由 Boris Ostrovsky 提交于 3月 17, 2016

This call has always been missing from xen_play dead() but until
recently this was rather benign. With new cpu hotplug framework
(commit 8df3e07e ("cpu/hotplug: Let upcoming cpu bring itself fully up").
however this call is required, otherwise a hot-plugged CPU will not
be properly brough up (by never calling cpuhp_online_idle())
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

dc6416f1

x86/build: Build compressed x86 kernels as PIE · 6d92bc9d

由 H.J. Lu 提交于 3月 16, 2016

The 32-bit x86 assembler in binutils 2.26 will generate R_386_GOT32X
relocation to get the symbol address in PIC.  When the compressed x86
kernel isn't built as PIC, the linker optimizes R_386_GOT32X relocations
to their fixed symbol addresses.  However, when the compressed x86
kernel is loaded at a different address, it leads to the following
load failure:

  Failed to allocate space for phdrs

during the decompression stage.

If the compressed x86 kernel is relocatable at run-time, it should be
compiled with -fPIE, instead of -fPIC, if possible and should be built as
Position Independent Executable (PIE) so that linker won't optimize
R_386_GOT32X relocation to its fixed symbol address.

Older linkers generate R_386_32 relocations against locally defined
symbols, _bss, _ebss, _got and _egot, in PIE.  It isn't wrong, just less
optimal than R_386_RELATIVE.  But the x86 kernel fails to properly handle
R_386_32 relocations when relocating the kernel.  To generate
R_386_RELATIVE relocations, we mark _bss, _ebss, _got and _egot as
hidden in both 32-bit and 64-bit x86 kernels.

To build a 64-bit compressed x86 kernel as PIE, we need to disable the
relocation overflow check to avoid relocation overflow errors. We do
this with a new linker command-line option, -z noreloc-overflow, which
got added recently:

 commit 4c10bbaa0912742322f10d9d5bb630ba4e15dfa7
 Author: H.J. Lu <hjl.tools@gmail.com>
 Date:   Tue Mar 15 11:07:06 2016 -0700

    Add -z noreloc-overflow option to x86-64 ld

    Add -z noreloc-overflow command-line option to the x86-64 ELF linker to
    disable relocation overflow check.  This can be used to avoid relocation
    overflow check if there will be no dynamic relocation overflow at
    run-time.

The 64-bit compressed x86 kernel is built as PIE only if the linker supports
-z noreloc-overflow.  So far 64-bit relocatable compressed x86 kernel
boots fine even when it is built as a normal executable.
Signed-off-by: NH.J. Lu <hjl.tools@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
[ Edited the changelog and comments. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

6d92bc9d

x86/cpu: Add advanced power management bits · 34a4cceb

由 Huang Rui 提交于 3月 25, 2016

Bit 11 of CPUID 8000_0007 edx is processor feedback interface.
Bit 12 of CPUID 8000_0007 edx is accumulated power.

Print proper names in proc/cpuinfo
Reported-and-tested-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Cc: Tony Li <tony.li@amd.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: "Len Brown" <lenb@kernel.org>
Link: http://lkml.kernel.org/r/1458871720-3209-1-git-send-email-ray.huang@amd.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

34a4cceb

x86/thread_info: Merge two !__ASSEMBLY__ sections · 5f870a3f

由 Borislav Petkov 提交于 3月 28, 2016

We have

  #ifndef __ASSEMBLY__
  ...
  #endif

  #ifndef __ASSEMBLY__
  ...
  #endif

Merge the two.

No functionality change.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1459189217-25532-1-git-send-email-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

5f870a3f

x86/cpufreq: Remove duplicated TDP MSR macro definitions · 4a6772f5

由 Vladimir Zapolskiy 提交于 3月 26, 2016

The list of CPU model specific registers contains two copies of TDP
registers, remove the one, which is out of numerical order in the
list.

Fixes: 6a35fc2d ("cpufreq: intel_pstate: get P1 from TAR when available")
Signed-off-by: NVladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Kristen Carlson
 Accardi <kristen@linux.intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: http://lkml.kernel.org/r/1459018020-24577-1-git-send-email-vladimir_zapolskiy@mentor.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

4a6772f5

x86/cpu: Get rid of compute_unit_id · 8196dab4

由 Borislav Petkov 提交于 3月 25, 2016

It is cpu_core_id anyway.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1458917557-8757-3-git-send-email-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

8196dab4

perf/x86/amd: Cleanup Fam10h NB event constraints · 32b62f44

由 Peter Zijlstra 提交于 3月 25, 2016

Avoid allocating the AMD NB event constraints data structure when not
needed. This gets rid of x86_max_cores usage and avoids allocating
this on AMD Core Perfctr supporting hardware (which has separate MSRs
for NB events).
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: aherrmann@suse.com
Cc: Rui Huang <ray.huang@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: jencce.kernel@gmail.com
Link: http://lkml.kernel.org/r/20160320124629.GY6375@twins.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

32b62f44

x86/topology: Fix AMD core count · ee6825c8

由 Peter Zijlstra 提交于 3月 25, 2016

It turns out AMD gets x86_max_cores wrong when there are compute
units.

The issue is that Linux assumes:

	nr_logical_cpus = nr_cores * nr_siblings

But AMD reports its CU unit as 2 cores, but then sets num_smp_siblings
to 2 as well.

Boris: fixup ras/mce_amd_inj.c too, to compute the Node Base Core
properly, according to the new nomenclature.

Fixes: 1f12e32f ("x86/topology: Create logical package id")
Reported-by: NXiong Zhou <jencce.kernel@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andreas Herrmann <aherrmann@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/20160317095220.GO6344@twins.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

ee6825c8

x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem() · fc0c2028

由 Dan Williams 提交于 3月 08, 2016

Update the definition of memcpy_from_pmem() to return 0 or a negative
error code.  Implement x86/arch_memcpy_from_pmem() with memcpy_mcsafe().

Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: NIngo Molnar <mingo@kernel.org>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

fc0c2028

26 3月, 2016 3 次提交

ACPI / processor: Request native thermal interrupt handling via _OSC · a2121167

由 Srinivas Pandruvada 提交于 3月 23, 2016

There are several reports of freeze on enabling HWP (Hardware PStates)
feature on Skylake-based systems by the Intel P-states driver. The root
cause is identified as the HWP interrupts causing BIOS code to freeze.

HWP interrupts use the thermal LVT which can be handled by Linux
natively, but on the affected Skylake-based systems SMM will respond
to it by default.  This is a problem for several reasons:
 - On the affected systems the SMM thermal LVT handler is broken (it
   will crash when invoked) and a BIOS update is necessary to fix it.
 - With thermal interrupt handled in SMM we lose all of the reporting
   features of the arch/x86/kernel/cpu/mcheck/therm_throt driver.
 - Some thermal drivers like x86-package-temp depend on the thermal
   threshold interrupts signaled via the thermal LVT.
 - The HWP interrupts are useful for debugging and tuning
   performance (if the kernel can handle them).
The native handling of thermal interrupts needs to be enabled
because of that.

This requires some way to tell SMM that the OS can handle thermal
interrupts.  That can be done by using _OSC/_PDC in processor
scope very early during ACPI initialization.

The meaning of _OSC/_PDC bit 12 in processor scope is whether or
not the OS supports native handling of interrupts for Collaborative
Processor Performance Control (CPPC) notifications.  Since on
HWP-capable systems CPPC is a firmware interface to HWP, setting
this bit effectively tells the firmware that the OS will handle
thermal interrupts natively going forward.

For details on _OSC/_PDC refer to:
http://www.intel.com/content/www/us/en/standards/processor-vendor-specific-acpi-specification.html

To implement the _OSC/_PDC handshake as described, introduce a new
function, acpi_early_processor_osc(), that walks the ACPI
namespace looking for ACPI processor objects and invokes _OSC for
them with bit 12 in the capabilities buffer set and terminates the
namespace walk on the first success.

Also modify intel_thermal_interrupt() to clear HWP status bits in
the HWP_STATUS MSR to acknowledge HWP interrupts (which prevents
them from firing continuously).
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject & changelog, function rename ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

a2121167

mm, kasan: stackdepot implementation. Enable stackdepot for SLAB · cd11016e

由 Alexander Potapenko 提交于 3月 25, 2016

Implement the stack depot and provide CONFIG_STACKDEPOT.  Stack depot
will allow KASAN store allocation/deallocation stack traces for memory
chunks.  The stack traces are stored in a hash table and referenced by
handles which reside in the kasan_alloc_meta and kasan_free_meta
structures in the allocated memory chunks.

IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
duplication.

Right now stackdepot support is only enabled in SLAB allocator.  Once
KASAN features in SLAB are on par with those in SLUB we can switch SLUB
to stackdepot as well, thus removing the dependency on SLUB stack
bookkeeping, which wastes a lot of memory.

This patch is based on the "mm: kasan: stack depots" patch originally
prepared by Dmitry Chernenkov.

Joonsoo has said that he plans to reuse the stackdepot code for the
mm/page_owner.c debugging facility.

[akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
[aryabinin@virtuozzo.com: comment style fixes]
Signed-off-by: NAlexander Potapenko <glider@google.com>
Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cd11016e

arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections · be7635e7

由 Alexander Potapenko 提交于 3月 25, 2016

KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.

Move the definition of __irq_entry to <linux/interrupt.h> so that the
users don't need to pull in <linux/ftrace.h>.  Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.
Signed-off-by: NAlexander Potapenko <glider@google.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be7635e7

25 3月, 2016 2 次提交

xen/apic: Provide Xen-specific version of cpu_present_to_apicid APIC op · ed6069be

由 Boris Ostrovsky 提交于 3月 17, 2016

Currently Xen uses default_cpu_present_to_apicid() which will always
report BAD_APICID for PV guests since x86_bios_cpu_apic_id is initialised
to that value and is never updated.

With commit 1f12e32f ("x86/topology: Create logical package id"), this
op is now called by smp_init_package_map() when deciding whether to call
topology_update_package_map() which sets cpu_data(cpu).logical_proc_id.
The latter (as topology_logical_package_id(cpu)) may be used, for example,
by cpu_to_rapl_pmu() as an array index. Since uninitialized
logical_package_id is set to -1, the index will become 64K which is
obviously problematic.

While RAPL code (and any other users of logical_package_id) should be
careful in their assumptions about id's validity, Xen's
cpu_present_to_apicid op should still provide value consistent with its
own xen_apic_read(APIC_ID).
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ed6069be

perf/x86: Move events_sysfs_show() outside CPU_SUP_INTEL · a49ac9f8

由 Huang Rui 提交于 3月 25, 2016

randconfig builds can sometimes disable CONFIG_CPU_SUP_INTEL while
enabling the AMD power reporting PMU driver, resulting in this
build failure:

  arch/x86/kernel/cpu/perf_event.h:663:31: error: 'events_sysfs_show' undeclared here (not in a function)

To fix it, move events_sysfs_show() outside of #ifdef CONFIG_CPU_SUP_INTEL.
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Reported-by: Nbuild test robot <lkp@intel.com>
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: kbuild-all@01.org
Cc: linux-next@vger.kernel.org
Cc: spg_linux_kernel@amd.com
Link: http://lkml.kernel.org/r/1458875905-4278-1-git-send-email-ray.huang@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a49ac9f8

23 3月, 2016 5 次提交

x86/msr: Remove unused native_read_tscp() · 9da77666

由 Prarit Bhargava 提交于 3月 22, 2016

After e76b027e ("x86,vdso: Use LSL unconditionally for vgetcpu")
native_read_tscp() is unused in the kernel. The function can be removed like
native_read_tsc() was.
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Acked-by: NAndy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1458687968-9106-1-git-send-email-prarit@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

9da77666

x86/extable: use generic search and sort routines · 29934b0f

由 Ard Biesheuvel 提交于 3月 22, 2016

Replace the arch specific versions of search_extable() and
sort_extable() with calls to the generic ones, which now support
relative exception tables as well.
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

29934b0f

kernel: add kcov code coverage · 5c9a8750

由 Dmitry Vyukov 提交于 3月 22, 2016

kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing).  Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system.  A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/).  However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible.  It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g.  scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes.  Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch).  I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side.  The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

  https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation".  For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
typical coverage can be just a dozen of basic blocks (e.g.  an invalid
input).  In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M).  Cost of
kcov depends only on number of executed basic blocks/edges.  On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure.  But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5c9a8750

x86/compat: remove is_compat_task() · f970165b

由 Andy Lutomirski 提交于 3月 22, 2016

x86's is_compat_task always checked the current syscall type, not the
task type.  It has no non-arch users any more, so just remove it to
avoid confusion.

On x86, nothing should really be checking the task ABI.  There are
legitimate users for the syscall ABI and for the mm ABI.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f970165b

KVM: page_track: fix access to NULL slot · a6adb106

由 Paolo Bonzini 提交于 3月 22, 2016

This happens when doing the reboot test from virt-tests:

[  131.833653] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  131.842461] IP: [<ffffffffa0950087>] kvm_page_track_is_active+0x17/0x60 [kvm]
[  131.850500] PGD 0
[  131.852763] Oops: 0000 [#1] SMP
[  132.007188] task: ffff880075fbc500 ti: ffff880850a3c000 task.ti: ffff880850a3c000
[  132.138891] Call Trace:
[  132.141639]  [<ffffffffa092bd11>] page_fault_handle_page_track+0x31/0x40 [kvm]
[  132.149732]  [<ffffffffa093380f>] paging64_page_fault+0xff/0x910 [kvm]
[  132.172159]  [<ffffffffa092c734>] kvm_mmu_page_fault+0x64/0x110 [kvm]
[  132.179372]  [<ffffffffa06743c2>] handle_exception+0x1b2/0x430 [kvm_intel]
[  132.187072]  [<ffffffffa067a301>] vmx_handle_exit+0x1e1/0xc50 [kvm_intel]
...

Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Fixes: 3d0c27adSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a6adb106

22 3月, 2016 6 次提交

kvm, rt: change async pagefault code locking for PREEMPT_RT · 9db284f3

由 Rik van Riel 提交于 3月 21, 2016

The async pagefault wake code can run from the idle task in exception
context, so everything here needs to be made non-preemptible.

Conversion to a simple wait queue and raw spinlock does the trick.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9db284f3

KVM/x86: update the comment of memory barrier in the vcpu_enter_guest() · 0f127d12

由 Lan Tianyu 提交于 3月 13, 2016

The barrier also orders the write to mode from any reads
to the page tables done and so update the comment.
Signed-off-by: NLan Tianyu <tianyu.lan@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0f127d12

KVM/x86: Call smp_wmb() before increasing tlbs_dirty · 7bfdf217

由 Lan Tianyu 提交于 3月 13, 2016

Update spte before increasing tlbs_dirty to make sure no tlb flush
in lost after spte is zapped. This pairs with the barrier in the
kvm_flush_remote_tlbs().
Signed-off-by: NLan Tianyu <tianyu.lan@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7bfdf217

L
KVM/x86: Replace smp_mb() with smp_store_mb/release() in the walk_shadow_page_lockless_begin/end() · 36ca7e0a
由 Lan Tianyu 提交于 3月 13, 2016
```
Signed-off-by: NLan Tianyu <tianyu.lan@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
```
36ca7e0a

KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page() · 9753f529

由 Lan Tianyu 提交于 3月 13, 2016

There is already a barrier inside of kvm_flush_remote_tlbs() which can
help to make sure everyone sees our modifications to the page tables and
see changes to vcpu->mode here. So remove the smp_mb in the
kvm_mmu_commit_zap_page() and update the comment.
Signed-off-by: NLan Tianyu <tianyu.lan@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9753f529

KVM, pkeys: expose CPUID/CR4 to guest · b9baba86

由 Huaitong Han 提交于 3月 22, 2016

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
CPUID.7.0.ECX[3]:PKU. X86_FEATURE_OSPKE is software support for pkeys,
enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of
CR4.PKE(bit 22).

This patch disables CPUID:PKU without ept, because pkeys is not yet
implemented for shadow paging.
Signed-off-by: NHuaitong Han <huaitong.han@intel.com>
Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9baba86

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功