- 14 5月, 2019 1 次提交
-
-
由 Stephane Eranian 提交于
This patch fixes a restriction/bug introduced by: 583feb08 ("perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS") The original patch prevented using multi-entry PEBS when wakeup_events != 0. However given that wakeup_events is part of a union with wakeup_watermark, it means that in watermark mode, PEBS multi-entry is also disabled which is not the intent. This patch fixes this by checking is watermark mode is enabled. Signed-off-by: NStephane Eranian <eranian@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: vincent.weaver@maine.edu Fixes: 583feb08 ("perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS") Link: http://lkml.kernel.org/r/20190514003400.224340-1-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 10 5月, 2019 1 次提交
-
-
由 Stephane Eranian 提交于
On Intel Westmere, a cmdline as follows: $ perf record -e cpu/event=0xc4,umask=0x2,name=br_inst_retired.near_call/p .... was failing. Yet the event+ umask support PEBS. It turns out this is due to a bug in the the PEBS event constraint table for westmere. All forms of BR_INST_RETIRED.* support PEBS. Therefore the constraint mask should ignore the umask. The name of the macro INTEL_FLAGS_EVENT_CONSTRAINT() hint that this is the case but it was not. That macros was checking both the event code and event umask. Therefore, it was only matching on 0x00c4. There are code+umask macros, they all have *UEVENT*. This bug fixes the issue by checking only the event code in the mask. Both single and range version are modified. Signed-off-by: NStephane Eranian <eranian@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/20190509214556.123493-1-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 5月, 2019 1 次提交
-
-
由 Wang Hai 提交于
Fixes the following sparse warning: arch/x86/events/amd/iommu.c:396:30: warning: symbol 'amd_iommu_attr_groups' was not declared. Should it be static? Signed-off-by: NWang Hai <wanghai26@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: namhyung@kernel.org Fixes: 51686546 (x86/events/amd/iommu: Fix sysfs perf attribute groups) Link: http://lkml.kernel.org/r/20190508020418.19568-1-wanghai26@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 06 5月, 2019 1 次提交
-
-
由 Nadav Amit 提交于
Poking-mm initialization might require to duplicate the PGD in early stage. Initialize the PGD cache earlier to prevent boot failures. Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NNadav Amit <namit@vmware.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 4fc19708 ("x86/alternatives: Initialize temporary mm for patching") Link: http://lkml.kernel.org/r/20190505011124.39692-1-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 5月, 2019 1 次提交
-
-
由 Jiri Olsa 提交于
New race in x86_pmu_stop() was introduced by replacing the atomic __test_and_clear_bit() of cpuc->active_mask by separate test_bit() and __clear_bit() calls in the following commit: 3966c3fe ("x86/perf/amd: Remove need to check "running" bit in NMI handler") The race causes panic for PEBS events with enabled callchains: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 ... RIP: 0010:perf_prepare_sample+0x8c/0x530 Call Trace: <NMI> perf_event_output_forward+0x2a/0x80 __perf_event_overflow+0x51/0xe0 handle_pmi_common+0x19e/0x240 intel_pmu_handle_irq+0xad/0x170 perf_event_nmi_handler+0x2e/0x50 nmi_handle+0x69/0x110 default_do_nmi+0x3e/0x100 do_nmi+0x11a/0x180 end_repeat_nmi+0x16/0x1a RIP: 0010:native_write_msr+0x6/0x20 ... </NMI> intel_pmu_disable_event+0x98/0xf0 x86_pmu_stop+0x6e/0xb0 x86_pmu_del+0x46/0x140 event_sched_out.isra.97+0x7e/0x160 ... The event is configured to make samples from PEBS drain code, but when it's disabled, we'll go through NMI path instead, where data->callchain will not get allocated and we'll crash: x86_pmu_stop test_bit(hwc->idx, cpuc->active_mask) intel_pmu_disable_event(event) { ... intel_pmu_pebs_disable(event); ... EVENT OVERFLOW -> <NMI> intel_pmu_handle_irq handle_pmi_common TEST PASSES -> test_bit(bit, cpuc->active_mask)) perf_event_overflow perf_prepare_sample { ... if (!(sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY)) data->callchain = perf_callchain(event, regs); CRASH -> size += data->callchain->nr; } </NMI> ... x86_pmu_disable_event(event) } __clear_bit(hwc->idx, cpuc->active_mask); Fixing this by disabling the event itself before setting off the PEBS bit. Signed-off-by: NJiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Arcari <darcari@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Lendacky Thomas <Thomas.Lendacky@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 3966c3fe ("x86/perf/amd: Remove need to check "running" bit in NMI handler") Link: http://lkml.kernel.org/r/20190504151556.31031-1-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 5月, 2019 2 次提交
-
-
由 Alexander Shishkin 提交于
Now that all AUX allocations are high-order by default, the software double buffering PMU capability doesn't make sense any more, get rid of it. In case some PMUs choose to opt out, we can re-introduce it. Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Link: http://lkml.kernel.org/r/20190503085536.24119-3-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kim Phillips 提交于
Add a new amd_hw_cache_event_ids_f17h assignment structure set for AMD families 17h and above, since a lot has changed. Specifically: L1 Data Cache The data cache access counter remains the same on Family 17h. For DC misses, PMCx041's definition changes with Family 17h, so instead we use the L2 cache accesses from L1 data cache misses counter (PMCx060,umask=0xc8). For DC hardware prefetch events, Family 17h breaks compatibility for PMCx067 "Data Prefetcher", so instead, we use PMCx05a "Hardware Prefetch DC Fills." L1 Instruction Cache PMCs 0x80 and 0x81 (32-byte IC fetches and misses) are backward compatible on Family 17h. For prefetches, we remove the erroneous PMCx04B assignment which counts how many software data cache prefetch load instructions were dispatched. LL - Last Level Cache Removing PMCs 7D, 7E, and 7F assignments, as they do not exist on Family 17h, where the last level cache is L3. L3 counters can be accessed using the existing AMD Uncore driver. Data TLB On Intel machines, data TLB accesses ("dTLB-loads") are assigned to counters that count load/store instructions retired. This is inconsistent with instruction TLB accesses, where Intel implementations report iTLB misses that hit in the STLB. Ideally, dTLB-loads would count higher level dTLB misses that hit in lower level TLBs, and dTLB-load-misses would report those that also missed in those lower-level TLBs, therefore causing a page table walk. That would be consistent with instruction TLB operation, remove the redundancy between dTLB-loads and L1-dcache-loads, and prevent perf from producing artificially low percentage ratios, i.e. the "0.01%" below: 42,550,869 L1-dcache-loads 41,591,860 dTLB-loads 4,802 dTLB-load-misses # 0.01% of all dTLB cache hits 7,283,682 L1-dcache-stores 7,912,392 dTLB-stores 310 dTLB-store-misses On AMD Families prior to 17h, the "Data Cache Accesses" counter is used, which is slightly better than load/store instructions retired, but still counts in terms of individual load/store operations instead of TLB operations. So, for AMD Families 17h and higher, this patch assigns "dTLB-loads" to a counter for L1 dTLB misses that hit in the L2 dTLB, and "dTLB-load-misses" to a counter for L1 DTLB misses that caused L2 DTLB misses and therefore also caused page table walks. This results in a much more accurate view of data TLB performance: 60,961,781 L1-dcache-loads 4,601 dTLB-loads 963 dTLB-load-misses # 20.93% of all dTLB cache hits Note that for all AMD families, data loads and stores are combined in a single accesses counter, so no 'L1-dcache-stores' are reported separately, and stores are counted with loads in 'L1-dcache-loads'. Also note that the "% of all dTLB cache hits" string is misleading because (a) "dTLB cache": although TLBs can be considered caches for page tables, in this context, it can be misinterpreted as data cache hits because the figures are similar (at least on Intel), and (b) not all those loads (technically accesses) technically "hit" at that hardware level. "% of all dTLB accesses" would be more clear/accurate. Instruction TLB On Intel machines, 'iTLB-loads' measure iTLB misses that hit in the STLB, and 'iTLB-load-misses' measure iTLB misses that also missed in the STLB and completed a page table walk. For AMD Family 17h and above, for 'iTLB-loads' we replace the erroneous instruction cache fetches counter with PMCx084 "L1 ITLB Miss, L2 ITLB Hit". For 'iTLB-load-misses' we still use PMCx085 "L1 ITLB Miss, L2 ITLB Miss", but set a 0xff umask because without it the event does not get counted. Branch Predictor (BPU) PMCs 0xc2 and 0xc3 continue to be valid across all AMD Families. Node Level Events Family 17h does not have a PMCx0e9 counter, and corresponding counters have not been made available publicly, so for now, we mark them as unsupported for Families 17h and above. Reference: "Open-Source Register Reference For AMD Family 17h Processors Models 00h-2Fh" Released 7/17/2018, Publication #56255, Revision 3.03: https://www.amd.com/system/files/TechDocs/56255_OSRR.pdf [ mingo: tidied up the line breaks. ] Signed-off-by: NKim Phillips <kim.phillips@amd.com> Cc: <stable@vger.kernel.org> # v4.9+ Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Liška <mliska@suse.cz> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Stephane Eranian <eranian@google.com> Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Cc: linux-perf-users@vger.kernel.org Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors") Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 02 5月, 2019 1 次提交
-
-
由 Linus Torvalds 提交于
The pvlock_page and hvclock_page variables are (as the name implies) addresses to pages, created by the linker script. But we declared them as just "extern u8" variables, which _works_, but now that gcc does some more bounds checking, it causes warnings like warning: array subscript 1 is outside array bounds of ‘u8[1]’ when we then access more than one byte from those variables. Fix this by simply making the declaration of the variables match reality, which makes the compiler happy too. Signed-off-by: NLinus Torvalds <torvalds@-linux-foundation.org>
-
- 01 5月, 2019 4 次提交
-
-
由 Jim Mattson 提交于
The size checks in vmx_nested_state are wrong because the calculations are made based on the size of a pointer to a struct kvm_nested_state rather than the size of a struct kvm_nested_state. Reported-by: NFelix Wilhelm <fwilhelm@google.com> Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NDrew Schmitt <dasch@google.com> Reviewed-by: NMarc Orr <marcorr@google.com> Reviewed-by: NPeter Shier <pshier@google.com> Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Fixes: 8fcc4b59 Cc: stable@ver.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
...to avoid dereferencing a null pointer when querying the per-vCPU timer advance. Fixes: 39497d76 ("KVM: lapic: Track lapic timer advance per vCPU") Reported-by: syzbot+f7e65445a40d3e0e4ebf@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
Commit 47c42e6b ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'") introduced a regression: 32-bit PAE guests stopped working. The issue appears to be: when guest switches (enables) PAE we need to re-initialize MMU context (set context->root_level, do reset_rsvds_bits_mask(), ...) but init_kvm_tdp_mmu() doesn't do that because we threw away is_pae(vcpu) flag from mmu role. Restore it to kvm_mmu_extended_role (as we now don't need it in base role) to fix the issue. Fixes: 47c42e6b ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'") Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
KVM's recent bug fix to update %rip after emulating I/O broke userspace that relied on the previous behavior of incrementing %rip prior to exiting to userspace. When running a Windows XP guest on AMD hardware, Qemu may patch "OUT 0x7E" instructions in reaction to the OUT itself. Because KVM's old behavior was to increment %rip before exiting to userspace to handle the I/O, Qemu manually adjusted %rip to account for the OUT instruction. Arguably this is a userspace bug as KVM requires userspace to re-enter the kernel to complete instruction emulation before taking any other actions. That being said, this is a bit of a grey area and breaking userspace that has worked for many years is bad. Pre-increment %rip on OUT to port 0x7e before exiting to userspace to hack around the issue. Fixes: 45def77e ("KVM: x86: update %rip after emulating IO") Reported-by: NSimon Becherer <simon@becherer.de> Reported-and-tested-by: NIakov Karpov <srid@rkmail.ru> Reported-by: NGabriele Balducci <balducci@units.it> Reported-by: NAntti Antinoja <reader@fennosys.fi> Cc: stable@vger.kernel.org Cc: Takashi Iwai <tiwai@suse.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 30 4月, 2019 20 次提交
-
-
由 Gary Hook 提交于
Enablement of AMD's Secure Memory Encryption feature is determined very early after start_kernel() is entered. Part of this procedure involves scanning the command line for the parameter 'mem_encrypt'. To determine intended state, the function sme_enable() uses library functions cmdline_find_option() and strncmp(). Their use occurs early enough such that it cannot be assumed that any instrumentation subsystem is initialized. For example, making calls to a KASAN-instrumented function before KASAN is set up will result in the use of uninitialized memory and a boot failure. When AMD's SME support is enabled, conditionally disable instrumentation of these dependent functions in lib/string.c and arch/x86/lib/cmdline.c. [ bp: Get rid of intermediary nostackp var and cleanup whitespace. ] Fixes: aca20d54 ("x86/mm: Add support to make use of Secure Memory Encryption") Reported-by: NLi RongQing <lirongqing@baidu.com> Signed-off-by: NGary R Hook <gary.hook@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Boris Brezillon <bbrezillon@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: "dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: "luto@kernel.org" <luto@kernel.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "mingo@redhat.com" <mingo@redhat.com> Cc: "peterz@infradead.org" <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/155657657552.7116.18363762932464011367.stgit@sosrh3.amd.com
-
由 Nadav Amit 提交于
Add a comment to clarify that users of text_poke() must ensure that no races with module removal take place. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-22-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Rick Edgecombe 提交于
Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special permissioned memory in vmalloc and remove places where memory was set NX and RW before freeing which is no longer needed. Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-21-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Rick Edgecombe 提交于
Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special permissioned memory in vmalloc and remove places where memory was set NX and RW before freeing which is no longer needed. Tested-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-20-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Rick Edgecombe 提交于
Make hibernate handle unmapped pages on the direct map when CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages to invalid configurations, so now hibernate should check if the pages have valid mappings and handle if they are unmapped when doing a hibernate save operation. Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y was configured. It does not appear to have a big hibernating performance impact. The speed of the saving operation before this change was measured as 819.02 MB/s, and after was measured at 813.32 MB/s. Before: [ 4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s) After: [ 4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s) Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NPavel Machek <pavel@ucw.cz> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-16-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Rick Edgecombe 提交于
Add two new functions set_direct_map_default_noflush() and set_direct_map_invalid_noflush() for setting the direct map alias for the page to its default valid permissions and to an invalid state that cannot be cached in a TLB, respectively. These functions do not flush the TLB. Note, __kernel_map_pages() does something similar but flushes the TLB and doesn't reset the permission bits to default on all architectures. Also add an ARCH config ARCH_HAS_SET_DIRECT_MAP for specifying whether these have an actual implementation or a default empty one. Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-15-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
The return value of text_poke_early() and text_poke_bp() is useless. Remove it. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-14-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
There are only two types of text poking: early and breakpoint based. The use of a function pointer to perform text poking complicates the code and is probably inefficient due to the use of indirect branches. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-13-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
When modules and BPF filters are loaded, there is a time window in which some memory is both writable and executable. An attacker that has already found another vulnerability (e.g., a dangling pointer) might be able to exploit this behavior to overwrite kernel code. Prevent having writable executable PTEs in this stage. In addition, avoiding having W+X mappings can also slightly simplify the patching of modules code on initialization (e.g., by alternatives and static-key), as would be done in the next patch. This was actually the main motivation for this patch. To avoid having W+X mappings, set them initially as RW (NX) and after they are set as RO set them as X as well. Setting them as executable is done as a separate step to avoid one core in which the old PTE is cached (hence writable), and another which sees the updated PTE (executable), which would break the W^X protection. Suggested-by: NThomas Gleixner <tglx@linutronix.de> Suggested-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
Set the page as executable after allocation. This patch is a preparatory patch for a following patch that makes module allocated pages non-executable. While at it, do some small cleanup of what appears to be unnecessary masking. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-11-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
Since alloc_module() will not set the pages as executable soon, set ftrace trampoline pages as executable after they are allocated. For the time being, do not change ftrace to use the text_poke() interface. As a result, ftrace still breaks W^X. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-10-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
text_poke() already ensures that the written value is the correct one and fails if that is not the case. There is no need for an additional comparison. Remove it. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-9-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
text_poke() can potentially compromise security as it sets temporary PTEs in the fixmap. These PTEs might be used to rewrite the kernel code from other cores accidentally or maliciously, if an attacker gains the ability to write onto kernel memory. Moreover, since remote TLBs are not flushed after the temporary PTEs are removed, the time-window in which the code is writable is not limited if the fixmap PTEs - maliciously or accidentally - are cached in the TLB. To address these potential security hazards, use a temporary mm for patching the code. Finally, text_poke() is also not conservative enough when mapping pages, as it always tries to map 2 pages, even when a single one is sufficient. So try to be more conservative, and do not map more than needed. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-8-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
To prevent improper use of the PTEs that are used for text patching, the next patches will use a temporary mm struct. Initailize it by copying the init mm. The address that will be used for patching is taken from the lower area that is usually used for the task memory. Doing so prevents the need to frequently synchronize the temporary-mm (e.g., when BPF programs are installed), since different PGDs are used for the task memory. Finally, randomize the address of the PTEs to harden against exploits that use these PTEs. Suggested-by: NAndy Lutomirski <luto@kernel.org> Tested-by: NMasami Hiramatsu <mhiramat@kernel.org> Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: ard.biesheuvel@linaro.org Cc: deneen.t.dock@intel.com Cc: kernel-hardening@lists.openwall.com Cc: kristen@linux.intel.com Cc: linux_dti@icloud.com Cc: will.deacon@arm.com Link: https://lkml.kernel.org/r/20190426232303.28381-8-nadav.amit@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
Prevent user watchpoints from mistakenly firing while the temporary mm is being used. As the addresses of the temporary mm might overlap those of the user-process, this is necessary to prevent wrong signals or worse things from happening. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-5-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Using a dedicated page-table for temporary PTEs prevents other cores from using - even speculatively - these PTEs, thereby providing two benefits: (1) Security hardening: an attacker that gains kernel memory writing abilities cannot easily overwrite sensitive data. (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in remote page-tables. To do so a temporary mm_struct can be used. Mappings which are private for this mm can be set in the userspace part of the address-space. During the whole time in which the temporary mm is loaded, interrupts must be disabled. The first use-case for temporary mm struct, which will follow, is for poking the kernel text. [ Commit message was written by Nadav Amit ] Tested-by: NMasami Hiramatsu <mhiramat@kernel.org> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-4-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
There is no apparent reason not to use text_poke_early() during early-init, since no patching of code that might be on the stack is done and only a single core is running. This is required for the next patches that would set a temporary mm for text poking, and this mm is only initialized after some static-keys are enabled/disabled. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-3-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
x86 has an nmi_uaccess_okay(), but other architectures do not. Arch-independent code might need to know whether access to user addresses is ok in an NMI context or in other code whose execution context is unknown. Specifically, this function is needed for bpf_probe_write_user(). Add a default implementation of nmi_uaccess_okay() for architectures that do not have such a function. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-23-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nadav Amit 提交于
text_mutex is currently expected to be held before text_poke() is called, but kgdb does not take the mutex, and instead *supposedly* ensures the lock is not taken and will not be acquired by any other core while text_poke() is running. The reason for the "supposedly" comment is that it is not entirely clear that this would be the case if gdb_do_roundup is zero. Create two wrapper functions, text_poke() and text_poke_kgdb(), which do or do not run the lockdep assertion respectively. While we are at it, change the return code of text_poke() to something meaningful. One day, callers might actually respect it and the existing BUG_ON() when patching fails could be removed. For kgdb, the return value can actually be used. Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org> Acked-by: NJiri Kosina <jkosina@suse.cz> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 9222f606 ("x86/alternatives: Lockdep-enforce text_mutex in text_poke*()") Link: https://lkml.kernel.org/r/20190426001143.4983-2-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Linus Torvalds 提交于
This doesn't really do anything, but at least we now parse teh ZERO_PAGE() address argument so that we'll catch the most obvious errors in usage next time they'll happen. See commit 6a5c5d26 ("rdma: fix build errors on s390 and MIPS due to bad ZERO_PAGE use") what happens when we don't have any use of the macro argument at all. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 4月, 2019 2 次提交
-
-
由 Thomas Gleixner 提交于
Replace the stack_trace_save*() functions with the new arch_stack_walk() interfaces. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: linux-arch@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: linux-mm@kvack.org Cc: David Rientjes <rientjes@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: iommu@lists.linux-foundation.org Cc: Robin Murphy <robin.murphy@arm.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: linux-btrfs@vger.kernel.org Cc: dm-devel@redhat.com Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Miroslav Benes <mbenes@suse.cz> Link: https://lkml.kernel.org/r/20190425094803.816485461@linutronix.de
-
由 Kairui Song 提交于
Currently perf callchain doesn't work well with ORC unwinder when sampling from trace point. We'll get useless in kernel callchain like this: perf 6429 [000] 22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 5651468729c1 [unknown] (/usr/bin/perf) 5651467ee82a main+0x69a (/usr/bin/perf) 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) The root cause is that, for trace point events, it doesn't provide a real snapshot of the hardware registers. Instead perf tries to get required caller's registers and compose a fake register snapshot which suppose to contain enough information for start a unwinding. However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the frame pointer, so current frame pointer is returned instead. We get a invalid register combination which confuse the unwinder, and end the stacktrace early. So in such case just don't try dump BP, and let the unwinder start directly when the register is not a real snapshot. Use SP as the skip mark, unwinder will skip all the frames until it meet the frame of the trace point caller. Tested with frame pointer unwinder and ORC unwinder, this makes perf callchain get the full kernel space stacktrace again like this: perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux) 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 55a22960d9c1 [unknown] (/usr/bin/perf) 55a22958982a main+0x69a (/usr/bin/perf) 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) Co-developed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NKairui Song <kasong@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Young <dyoung@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190422162652.15483-1-kasong@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 27 4月, 2019 1 次提交
-
-
由 Rick Edgecombe 提交于
The not-so-recent change to move VMX's VM-Exit handing to a dedicated "function" unintentionally exposed KVM to a speculative attack from the guest by executing a RET prior to stuffing the RSB. Make RSB stuffing happen immediately after VM-Exit, before any unpaired returns. Alternatively, the VM-Exit path could postpone full RSB stuffing until its current location by stuffing the RSB only as needed, or by avoiding returns in the VM-Exit path entirely, but both alternatives are beyond ugly since vmx_vmexit() has multiple indirect callers (by way of vmx_vmenter()). And putting the RSB stuffing immediately after VM-Exit makes it much less likely to be re-broken in the future. Note, the cost of PUSH/POP could be avoided in the normal flow by pairing the PUSH RAX with the POP RAX in __vmx_vcpu_run() and adding an a POP to nested_vmx_check_vmentry_hw(), but such a weird/subtle dependency is likely to cause problems in the long run, and PUSH/POP will take all of a few cycles, which is peanuts compared to the number of cycles required to fill the RSB. Fixes: 453eafbe ("KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines") Reported-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 26 4月, 2019 3 次提交
-
-
由 Nadav Amit 提交于
Move flush_tlb_info variables off the stack. This allows to align flush_tlb_info to cache-line and avoid potentially unnecessary cache line movements. It also allows to have a fixed virtual-to-physical translation of the variables, which reduces TLB misses. Use per-CPU struct for flush_tlb_mm_range() and flush_tlb_kernel_range(). Add debug assertions to ensure there are no nested TLB flushes that might overwrite the per-CPU data. For arch_tlbbatch_flush() use a const struct. Results when running a microbenchmarks that performs 10^6 MADV_DONTEED operations and touching a page, in which 3 additional threads run a busy-wait loop (5 runs, PTI and retpolines are turned off): base off-stack ---- --------- avg (usec/op) 1.629 1.570 (-3%) stddev 0.014 0.009 Signed-off-by: NNadav Amit <namit@vmware.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190425230143.7008-1-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Rafael J. Wysocki 提交于
There are problems with running time_cpufreq_notifier() on SMP systems. First off, the rdtsc() called from there runs on the CPU executing that code and not necessarily on the CPU whose sched_clock() rate is updated which is questionable at best. Second, in the cases when the frequencies of all CPUs in an SMP system are always in sync, it is not sufficient to update just one of them or the set associated with a given cpufreq policy on frequency changes - all CPUs in the system should be updated and that would require more than a simple transition notifier. Note, however, that the underlying issue (the TSC rate depending on the CPU frequency) has not been present in hardware shipping for the last few years and in quite a few relevant cases (acpi-cpufreq in particular) running time_cpufreq_notifier() will cause the TSC to be marked as unstable anyway. For this reason, make time_cpufreq_notifier() simply mark the TSC as unstable and give up when run on SMP and only try to carry out any adjustments otherwise. Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
-
由 Jacob Pan 提交于
Local APIC timer clockevent parameters can be calculated based on platform specific methods. However the code is mostly duplicated with the interrupt based calibration. The commit which increased the max_delta parameter updated only one place and made the implementations diverge. Unify it to prevent further damage. [ tglx: Rename function to lapic_init_clockevent() and adjust changelog a bit ] Fixes: 4aed89d6 ("x86, lapic-timer: Increase the max_delta to 31 bits") Reported-by: NDaniel Drake <drake@endlessm.com> Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Len Brown <lenb@kernel.org> Link: https://lkml.kernel.org/r/1556213272-63568-1-git-send-email-jacob.jun.pan@linux.intel.com
-
- 25 4月, 2019 2 次提交
-
-
由 Mike Rapoport 提交于
Mel Gorman says: "32-bit NUMA systems should be non-existent in practice. The last NUMA system I'm aware of that was both NUMA and 32-bit only died somewhere between 2004 and 2007. If someone is running a 64-bit capable system in 32-bit mode with NUMA, they really are just punishing themselves for fun." Mark DISCONTIGMEM broken for now as suggested by Christoph Hellwig, and (hopefully) remove it in a couple of releases. Suggested-by: NChristoph Hellwig <hch@infradead.org> Signed-off-by: NMike Rapoport <rppt@linux.ibm.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1556112252-9339-3-git-send-email-rppt@linux.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mike Rapoport 提交于
Sparsemem has been a default memory model for x86-64 for over a decade, since: b263295d ("x86: 64-bit, make sparsemem vmemmap the only memory model"). Make it the default for 32-bit NUMA systems (if there any left) as well. Signed-off-by: NMike Rapoport <rppt@linux.ibm.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1556112252-9339-2-git-send-email-rppt@linux.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-