提交 · 033fbae988fcb67e5077203512181890848b8e90 · openeuler / Kernel

28 8月, 2015 2 次提交

mm: ZONE_DEVICE for "device memory" · 033fbae9

由 Dan Williams 提交于 8月 09, 2015

While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.

Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory.  Device memory has
different lifetime and performance characteristics than RAM.  However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
[hch: various simplifications in the arch interface]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

033fbae9

nd_blk: change aperture mapping from WC to WB · 67a3e8fe

由 Ross Zwisler 提交于 8月 27, 2015

This should result in a pretty sizeable performance gain for reads.  For
rough comparison I did some simple read testing using PMEM to compare
reads of write combining (WC) mappings vs write-back (WB).  This was
done on a random lab machine.

PMEM reads from a write combining mapping:
	# dd of=/dev/null if=/dev/pmem0 bs=4096 count=100000
	100000+0 records in
	100000+0 records out
	409600000 bytes (410 MB) copied, 9.2855 s, 44.1 MB/s

PMEM reads from a write-back mapping:
	# dd of=/dev/null if=/dev/pmem0 bs=4096 count=1000000
	1000000+0 records in
	1000000+0 records out
	4096000000 bytes (4.1 GB) copied, 3.44034 s, 1.2 GB/s

To be able to safely support a write-back aperture I needed to add
support for the "read flush" _DSM flag, as outlined in the DSM spec:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag tells the ND BLK driver that it needs to flush the cache lines
associated with the aperture after the aperture is moved but before any
new data is read.  This ensures that any stale cache lines from the
previous contents of the aperture will be discarded from the processor
cache, and the new data will be read properly from the DIMM.  We know
that the cache lines are clean and will be discarded without any
writeback because either a) the previous aperture operation was a read,
and we never modified the contents of the aperture, or b) the previous
aperture operation was a write and we must have written back the dirtied
contents of the aperture to the DIMM before the I/O was completed.

In order to add support for the "read flush" flag I needed to add a
generic routine to invalidate cache lines, mmio_flush_range().  This is
protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
only supported on x86.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

67a3e8fe

21 8月, 2015 4 次提交

pmem: add copy_from_iter_pmem() and clear_pmem() · 5de490da

由 Ross Zwisler 提交于 8月 18, 2015

Add support for two new PMEM APIs, copy_from_iter_pmem() and
clear_pmem().  copy_from_iter_pmem() is used to copy data from an
iterator into a PMEM buffer.  clear_pmem() zeros a PMEM memory range.

Both of these new APIs must be explicitly ordered using a wmb_pmem()
function call and are implemented in such a way that the wmb_pmem()
will make the stores to PMEM durable.  Because both APIs are unordered
they can be called as needed without introducing any unwanted memory
barriers.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5de490da

pmem, x86: clean up conditional pmem includes · 4a370df5

由 Ross Zwisler 提交于 8月 18, 2015

Prior to this change x86_64 used the pmem defines in
arch/x86/include/asm/pmem.h, and UM used the default ones at the
top of include/linux/pmem.h. The inclusion or exclusion in linux/pmem.h
was controlled by CONFIG_ARCH_HAS_PMEM_API, but the ones in asm/pmem.h
were controlled by ARCH_HAS_NOCACHE_UACCESS.

Instead, control them both with CONFIG_ARCH_HAS_PMEM_API so that it's
clear that they are related and we don't run into the possibility where
they are both included or excluded. Also remove a bunch of stale
function prototypes meant for UM in asm/pmem.h - these just conflicted
with the inline defaults in linux/pmem.h and gave compile errors.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4a370df5

pmem: remove layer when calling arch_has_wmb_pmem() · 18279b46

由 Ross Zwisler 提交于 8月 18, 2015

Prior to this change arch_has_wmb_pmem() was only called by
arch_has_pmem_api(). Both arch_has_wmb_pmem() and arch_has_pmem_api()
checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.

Instead, remove the old arch_has_wmb_pmem() wrapper to be rid of one
extra layer of indirection and the redundant CONFIG_ARCH_HAS_PMEM_API
check. Rename __arch_has_wmb_pmem() to arch_has_wmb_pmem() since we no
longer have a wrapper, and just have arch_has_pmem_api() call the
architecture specific arch_has_wmb_pmem() directly.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

18279b46

pmem, x86: move x86 PMEM API to new pmem.h header · 40603526

由 Ross Zwisler 提交于 8月 18, 2015

Move the x86 PMEM API implementation out of asm/cacheflush.h and into
its own header asm/pmem.h.  This will allow members of the PMEM API to
be more easily identified on this and other architectures.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

40603526

19 8月, 2015 1 次提交

libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option · 7a67832c

由 Dan Williams 提交于 8月 19, 2015

We currently register a platform device for e820 type-12 memory and
register a nvdimm bus beneath it.  Registering the platform device
triggers the device-core machinery to probe for a driver, but that
search currently comes up empty.  Building the nvdimm-bus registration
into the e820_pmem platform device registration in this way forces
libnvdimm to be built-in.  Instead, convert the built-in portion of
CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the
rest of the logic to the driver for e820_pmem, for the following
reasons:

1/ Letting e820_pmem support be a module allows building and testing
   libnvdimm.ko changes without rebooting

2/ All the normal policy around modules can be applied to e820_pmem
   (unbind to disable and/or blacklisting the module from loading by
   default)

3/ Moving the driver to a generic location and converting it to scan
   "iomem_resource" rather than "e820.map" means any other architecture can
   take advantage of this simple nvdimm resource discovery mechanism by
   registering a resource named "Persistent Memory (legacy)"

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

7a67832c

15 8月, 2015 1 次提交

pmem: convert to generic memremap · e836a256

由 Dan Williams 提交于 8月 12, 2015

Kill arch_memremap_pmem() and just let the architecture specify the
flags to be passed to memremap().  Default to writethrough by default.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e836a256

26 7月, 2015 2 次提交

x86/mm/pat: Revert 'Adjust default caching mode translation tables' · 1a4e8795

由 Thomas Gleixner 提交于 7月 26, 2015

Toshi explains:

"No, the default values need to be set to the fallback types,
i.e. minimal supported mode. For WC and WT, UC is the fallback type.

When PAT is disabled, pat_init() does update the tables below to
enable WT per the default BIOS setup. However, when PAT is enabled,
but CPU has PAT -errata, WT falls back to UC per the default values."

Revert: ca1fec58 'x86/mm/pat: Adjust default caching mode translation tables'
Requested-by: NToshi Kani <toshi.kani@hp.com>
Cc: Jan Beulich <jbeulich@suse.de>
Link: http://lkml.kernel.org/r/1437577776.3214.252.camel@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

1a4e8795

perf/x86/intel/cqm: Return cached counter value from IRQ context · 2c534c0d

由 Matt Fleming 提交于 7月 21, 2015

Peter reported the following potential crash which I was able to
reproduce with his test program,

[  148.765788] ------------[ cut here ]------------
[  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260()
[  148.765797] Modules linked in:
[  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
[  148.765803]  ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007
[  148.765805]  0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000
[  148.765807]  ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640
[  148.765809] Call Trace:
[  148.765810]  <NMI>  [<ffffffff818bdfd5>] dump_stack+0x45/0x57
[  148.765818]  [<ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0
[  148.765822]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765824]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765825]  [<ffffffff810e422a>] warn_slowpath_null+0x1a/0x20
[  148.765827]  [<ffffffff811613f6>] smp_call_function_many+0xb6/0x260
[  148.765829]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765831]  [<ffffffff81161748>] on_each_cpu_mask+0x28/0x60
[  148.765832]  [<ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0
[  148.765836]  [<ffffffff811cdd35>] perf_output_read+0x2a5/0x400
[  148.765839]  [<ffffffff811d2e5a>] perf_output_sample+0x31a/0x590
[  148.765840]  [<ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380
[  148.765841]  [<ffffffff811d3497>] perf_event_output+0x47/0x60
[  148.765843]  [<ffffffff811d36c5>] __perf_event_overflow+0x215/0x240
[  148.765844]  [<ffffffff811d4124>] perf_event_overflow+0x14/0x20
[  148.765847]  [<ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440
[  148.765849]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765853]  [<ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0
[  148.765854]  [<ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20
[  148.765859]  [<ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[  148.765863]  [<ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30
[  148.765865]  [<ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20
[  148.765869]  [<ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40
[  148.765872]  [<ffffffff811c8d86>] ? irq_work_queue+0x66/0x80
[  148.765875]  [<ffffffff81075306>] perf_event_nmi_handler+0x26/0x40
[  148.765877]  [<ffffffff81063ed9>] nmi_handle+0x79/0x100
[  148.765879]  [<ffffffff81064422>] default_do_nmi+0x42/0x100
[  148.765880]  [<ffffffff81064563>] do_nmi+0x83/0xb0
[  148.765884]  [<ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e
[  148.765886]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765888]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765890]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765891]  <<EOE>>  [<ffffffff8110ab66>] finish_task_switch+0x156/0x210
[  148.765898]  [<ffffffff818c1671>] __schedule+0x341/0x920
[  148.765899]  [<ffffffff818c1c87>] schedule+0x37/0x80
[  148.765903]  [<ffffffff810ae1af>] ? do_page_fault+0x2f/0x80
[  148.765905]  [<ffffffff818c1f4a>] schedule_user+0x1a/0x50
[  148.765907]  [<ffffffff818c666c>] retint_careful+0x14/0x32
[  148.765908] ---[ end trace e33ff2be78e14901 ]---

The CQM task events are not safe to be called from within interrupt
context because they require performing an IPI to read the counter value
on all sockets. And performing IPIs from within IRQ context is a
"no-no".

Make do with the last read counter value currently event in
event->count when we're invoked in this context.
Reported-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Will Auld <will.auld@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

2c534c0d

24 7月, 2015 2 次提交

x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit · c0c3322e

由 Denys Vlasenko 提交于 7月 24, 2015

This change reverts most of commit 53e9accf 'Do not use R9 in
SYSCALL32'. I don't yet understand how, but code in that commit
sometimes fails to preserve EBP.

See https://bugzilla.kernel.org/show_bug.cgi?id=101061
"Problems while executing 32-bit code on AMD64"
Reported-and-tested-by: NKrzysztof A. Sobiecki <sobkas@gmail.com>
Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Will Drewry <wad@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
Link: http://lkml.kernel.org/r/1437740203-11552-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

c0c3322e

T
x86/mm: Fix newly introduced printk format warnings · 8a0a5da6
由 Thomas Gleixner 提交于 7月 24, 2015
```
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
```
8a0a5da6

23 7月, 2015 5 次提交

KVM: x86: rename quirk constants to KVM_X86_QUIRK_* · 0da029ed

由 Paolo Bonzini 提交于 7月 23, 2015

Make them clearly architecture-dependent; the capability is valid for
all architectures, but the argument is not.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0da029ed

KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED · fb279950

由 Xiao Guangrong 提交于 7月 16, 2015

OVMF depends on WB to boot fast, because it only clears caches after
it has set up MTRRs---which is too late.

Let's do writeback if CR0.CD is set to make it happy, similar to what
SVM is already doing.
Signed-off-by: NXiao Guangrong <guangrong.xiao@intel.com>
Tested-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fb279950

KVM: x86: introduce kvm_check_has_quirk · 41dbc6bc

由 Paolo Bonzini 提交于 7月 23, 2015

The logic of the disabled_quirks field usually results in a double
negation.  Wrap it in a simple function that checks the bit and
negates it.

Based on a patch from Xiao Guangrong.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

41dbc6bc

KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type · 3e5d2fdc

由 Xiao Guangrong 提交于 7月 16, 2015

kvm_mtrr_get_guest_memory_type never returns -1 which is implied
in the current code since if @type = -1 (means no MTRR contains the
range), iter.partial_map must be true

Simplify the code to indicate this fact
Signed-off-by: NXiao Guangrong <guangrong.xiao@intel.com>
Tested-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3e5d2fdc

KVM: MTRR: fix memory type handling if MTRR is completely disabled · 10dc331f

由 Xiao Guangrong 提交于 7月 16, 2015

Currently code uses default memory type if MTRR is fully disabled,
fix it by using UC instead.
Signed-off-by: NXiao Guangrong <guangrong.xiao@intel.com>
Tested-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

10dc331f

22 7月, 2015 2 次提交

x86/mm: Remove region_is_ram() call from ioremap · 9a58eebe

由 Toshi Kani 提交于 7月 16, 2015

__ioremap_caller() calls region_is_ram() to walk through the
iomem_resource table to check if a target range is in RAM, which was
added to improve the lookup performance over page_is_ram() (commit
906e36c5 "x86: use optimized ioresource lookup in ioremap
function"). page_is_ram() was no longer used when this change was
added, though.

__ioremap_caller() then calls walk_system_ram_range(), which had
replaced page_is_ram() to improve the lookup performance (commit
c81c8a1e "x86, ioremap: Speed up check for RAM pages").

Since both checks walk through the same iomem_resource table for
the same purpose, there is no need to call both functions.

Aside of that walk_system_ram_range() is the only useful check at the
moment because region_is_ram() always returns -1 due to an
implementation bug. That bug in region_is_ram() cannot be fixed
without breaking existing ioremap callers, which rely on the subtle
difference of walk_system_ram_range() versus non page aligned ranges.

Once these offending callers are fixed we can use region_is_ram() and
remove walk_system_ram_range().

[ tglx: Massaged changelog ]
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-3-git-send-email-toshi.kani@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

9a58eebe

x86/mm: Move warning from __ioremap_check_ram() to the call site · 1c9cf9b2

由 Toshi Kani 提交于 7月 16, 2015

__ioremap_check_ram() has a WARN_ONCE() which is emitted when the
given pfn range is not RAM. The warning is bogus in two aspects:

- it never triggers since walk_system_ram_range() only calls
  __ioremap_check_ram() for RAM ranges.

- the warning message is wrong as it says: "ioremap on RAM' after it
  established that the pfn range is not RAM.

Move the WARN_ONCE() to __ioremap_caller(), and update the message to
include the address range so we get an actual warning when something
tries to ioremap system RAM.

[ tglx: Massaged changelog ]
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-2-git-send-email-toshi.kani@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

1c9cf9b2

21 7月, 2015 4 次提交

x86/mm/pat: Adjust default caching mode translation tables · ca1fec58

由 Jan Beulich 提交于 7月 20, 2015

Make WT really mean WT (rather than UC).

I can't see why commit 9cd25aac ("x86/mm/pat: Emulate PAT when
it is disabled") didn't make this to match its changes to
pat_init().
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Link: http://lkml.kernel.org/r/55ACC3660200007800092E62@mail.emea.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

ca1fec58

x86/fpu: Disable dependent CPU features on "noxsave" · 5bc016f1

由 Jan Beulich 提交于 7月 20, 2015

Complete the set of dependent features that need disabling at
once: XSAVEC, AVX-512 and all currently known to the kernel
extensions to it, as well as MPX need to be disabled too.
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55ACC40D0200007800092E6C@mail.emea.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5bc016f1

x86/mpx: Do not set ->vm_ops on MPX VMAs · a8965276

由 Kirill A. Shutemov 提交于 7月 20, 2015

MPX setups private anonymous mapping, but uses vma->vm_ops too.
This can confuse core VM, as it relies on vm->vm_ops to
distinguish file VMAs from anonymous.

As result we will get SIGBUS, because handle_pte_fault() thinks
it's file VMA without vm_ops->fault and it doesn't know how to
handle the situation properly.

Let's fix that by not setting ->vm_ops.

We don't really need ->vm_ops here: MPX VMA can be detected with
VM_MPX flag. And vma_merge() will not merge MPX VMA with non-MPX
VMA, because ->vm_flags won't match.

The only thing left is name of VMA. I'm not sure if it's part of
ABI, or we can just drop it. The patch keep it by providing
arch_vma_name() on x86.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org> # Fixes: 6b7339f4 (mm: avoid setting up anonymous pages into file mapping)
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@sr71.net
Link: http://lkml.kernel.org/r/20150720212958.305CC3E9@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a8965276

x86/mm: Add parenthesis for TLB tracepoint size calculation · bbc03778

由 Dave Hansen 提交于 7月 20, 2015

flush_tlb_info->flush_start/end are both normal virtual
addresses.  When calculating 'nr_pages' (only used for the
tracepoint), I neglected to put parenthesis in.

Thanks to David Koufaty for pointing this out.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@sr71.net
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20150720230153.9E834081@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

bbc03778

18 7月, 2015 3 次提交

x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86 · 5aaeb5c0

由 Ingo Molnar 提交于 7月 17, 2015

Don't burden architectures without dynamic task_struct sizing
with the overhead of dynamic sizing.

Also optimize the x86 code a bit by caching task_struct_size.
Acked-and-Tested-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437128892-9831-3-git-send-email-mingo@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

5aaeb5c0

x86/fpu, sched: Dynamically allocate 'struct fpu' · 0c8c0f03

由 Dave Hansen 提交于 7月 17, 2015

The FPU rewrite removed the dynamic allocations of 'struct fpu'.
But, this potentially wastes massive amounts of memory (2k per
task on systems that do not have AVX-512 for instance).

Instead of having a separate slab, this patch just appends the
space that we need to the 'task_struct' which we dynamically
allocate already.  This saves from doing an extra slab
allocation at fork().

The only real downside here is that we have to stick everything
and the end of the task_struct.  But, I think the
BUILD_BUG_ON()s I stuck in there should keep that from being too
fragile.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

0c8c0f03

mm: clean up per architecture MM hook header files · f2abeef9

由 Laurent Dufour 提交于 7月 17, 2015

Commit 2ae416b1 ("mm: new mm hook framework") introduced an empty
header file (mm-arch-hooks.h) for every architecture, even those which
doesn't need to define mm hooks.

As suggested by Geert Uytterhoeven, this could be cleaned through the use
of a generic header file included via each per architecture
asm/include/Kbuild file.

The PowerPC architecture is not impacted here since this architecture has
to defined the arch_remap MM hook.
Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
Suggested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Acked-by: NVineet Gupta <vgupta@synopsys.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f2abeef9

17 7月, 2015 9 次提交

x86/entry/64, x86/nmi/64: Add CONFIG_DEBUG_ENTRY NMI testing code · a97439aa

由 Andy Lutomirski 提交于 7月 15, 2015

It turns out to be rather tedious to test the NMI nesting code.
Make it easier: add a new CONFIG_DEBUG_ENTRY option that causes
the NMI handler to pre-emptively unmask NMIs.

With this option set, errors in the repeat_nmi logic or failures
to detect that we're in a nested NMI will result in quick panics
under perf (especially if multiple counters are running at high
frequency) instead of requiring an unusual workload that
generates page faults or breakpoints inside NMIs.

I called it CONFIG_DEBUG_ENTRY instead of CONFIG_DEBUG_NMI_ENTRY
because I want to add new non-NMI checks elsewhere in the entry
code in the future, and I'd rather not add too many new config
options or add this option and then immediately rename it.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

a97439aa

x86/nmi/64: Make the "NMI executing" variable more consistent · 36f1a77b

由 Andy Lutomirski 提交于 7月 15, 2015

Currently, "NMI executing" is one the first time an outermost
NMI hits repeat_nmi and zero thereafter.  Change it to be zero
each time for consistency.

This is intended to help NMI handling fail harder if it's buggy.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

36f1a77b

x86/nmi/64: Minor asm simplification · 23a781e9

由 Andy Lutomirski 提交于 7月 15, 2015

Replace LEA; MOV with an equivalent SUB.  This saves one
instruction.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

23a781e9

x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection · 810bc075

由 Andy Lutomirski 提交于 7月 15, 2015

We have a tricky bug in the nested NMI code: if we see RSP
pointing to the NMI stack on NMI entry from kernel mode, we
assume that we are executing a nested NMI.

This isn't quite true.  A malicious userspace program can point
RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
happen while RSP is still pointing at the NMI stack.

Fix it with a sneaky trick.  Set DF in the region of code that
the RSP check is intended to detect.  IRET will clear DF
atomically.

( Note: other than paravirt, there's little need for all this
  complexity. We could check RIP instead of RSP. )
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

810bc075

x86/nmi/64: Reorder nested NMI checks · a27507ca

由 Andy Lutomirski 提交于 7月 15, 2015

Check the repeat_nmi .. end_repeat_nmi special case first.  The
next patch will rework the RSP check and, as a side effect, the
RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
we'll need this ordering of the checks.

Note: this is more subtle than it appears.  The check for
repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
instead of adjusting the "iret" frame to force a repeat.  This
is necessary, because the code between repeat_nmi and
end_repeat_nmi sets "NMI executing" and then writes to the
"iret" frame itself.  If a nested NMI comes in and modifies the
"iret" frame while repeat_nmi is also modifying it, we'll end up
with garbage.  The old code got this right, as does the new
code, but the new code is a bit more explicit.

If we were to move the check right after the "NMI executing"
check, then we'd get it wrong and have random crashes.

( Because the "NMI executing" check would jump to the code that would
  modify the "iret" frame without checking if the interrupted NMI was
  currently modifying it. )
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

a27507ca

x86/nmi/64: Improve nested NMI comments · 0b22930e

由 Andy Lutomirski 提交于 7月 15, 2015

I found the nested NMI documentation to be difficult to follow.
Improve the comments.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

0b22930e

x86/nmi/64: Switch stacks on userspace NMI entry · 9b6e6a83

由 Andy Lutomirski 提交于 7月 15, 2015

Returning to userspace is tricky: IRET can fail, and ESPFIX can
rearrange the stack prior to IRET.

The NMI nesting fixup relies on a precise stack layout and
atomic IRET.  Rather than trying to teach the NMI nesting fixup
to handle ESPFIX and failed IRET, punt: run NMIs that came from
user mode on the normal kernel stack.

This will make some nested NMIs visible to C code, but the C
code is okay with that.

As a side effect, this should speed up perf: it eliminates an
RDMSR when NMIs come from user mode.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

9b6e6a83

x86/nmi/64: Remove asm code that saves CR2 · 0e181bb5

由 Andy Lutomirski 提交于 7月 15, 2015

Now that do_nmi saves CR2, we don't need to save it in asm.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Acked-by: NBorislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

0e181bb5

x86/nmi: Enable nested do_nmi() handling for 64-bit kernels · 9d050416

由 Andy Lutomirski 提交于 7月 15, 2015

32-bit kernels handle nested NMIs in C.  Enable the exact same
handling on 64-bit kernels as well.  This isn't currently
necessary, but it will become necessary once the asm code starts
allowing limited nesting.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

9d050416

15 7月, 2015 1 次提交

genirq: Revert sparse irq locking around __cpu_up() and move it to x86 for now · ce0d3c0a

由 Thomas Gleixner 提交于 7月 14, 2015

Boris reported that the sparse_irq protection around __cpu_up() in the
generic code causes a regression on Xen. Xen allocates interrupts and
some more in the xen_cpu_up() function, so it deadlocks on the
sparse_irq_lock.

There is no simple fix for this and we really should have the
protection for all architectures, but for now the only solution is to
move it to x86 where actual wreckage due to the lack of protection has
been observed.
Reported-and-tested-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Fixes: a8994181 'hotplug: Prevent alloc/free of irq descriptors during cpu up/down'
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: xiao jin <jin.xiao@intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>

ce0d3c0a

10 7月, 2015 4 次提交

kvm: x86: fix load xsave feature warning · ee4100da

由 Wanpeng Li 提交于 7月 09, 2015

[   68.196974] WARNING: CPU: 1 PID: 2140 at arch/x86/kvm/x86.c:3161 kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]()
[   68.196975] Modules linked in: snd_hda_codec_hdmi i915 rfcomm bnep bluetooth i2c_algo_bit rfkill nfsd drm_kms_helper nfs_acl nfs drm lockd grace sunrpc fscache snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_dummy snd_seq_oss x86_pkg_temp_thermal snd_seq_midi kvm_intel snd_seq_midi_event snd_rawmidi kvm snd_seq ghash_clmulni_intel fuse snd_timer aesni_intel parport_pc ablk_helper snd_seq_device cryptd ppdev snd lp parport lrw dcdbas gf128mul i2c_core glue_helper lpc_ich video shpchp mfd_core soundcore serio_raw acpi_cpufreq ext4 mbcache jbd2 sd_mod crc32c_intel ahci libahci libata e1000e ptp pps_core
[   68.197005] CPU: 1 PID: 2140 Comm: qemu-system-x86 Not tainted 4.2.0-rc1+ #2
[   68.197006] Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
[   68.197007]  ffffffffa03b0657 ffff8800d984bca8 ffffffff815915a2 0000000000000000
[   68.197009]  0000000000000000 ffff8800d984bce8 ffffffff81057c0a 00007ff6d0001000
[   68.197010]  0000000000000002 ffff880211c1a000 0000000000000004 ffff8800ce0288c0
[   68.197012] Call Trace:
[   68.197017]  [<ffffffff815915a2>] dump_stack+0x45/0x57
[   68.197020]  [<ffffffff81057c0a>] warn_slowpath_common+0x8a/0xc0
[   68.197022]  [<ffffffff81057cfa>] warn_slowpath_null+0x1a/0x20
[   68.197029]  [<ffffffffa037bed8>] kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]
[   68.197035]  [<ffffffffa037aede>] ? kvm_arch_vcpu_load+0x4e/0x1c0 [kvm]
[   68.197040]  [<ffffffffa03696a6>] kvm_vcpu_ioctl+0xc6/0x5c0 [kvm]
[   68.197043]  [<ffffffff811252d2>] ? perf_pmu_enable+0x22/0x30
[   68.197044]  [<ffffffff8112663e>] ? perf_event_context_sched_in+0x7e/0xb0
[   68.197048]  [<ffffffff811a6882>] do_vfs_ioctl+0x2c2/0x4a0
[   68.197050]  [<ffffffff8107bf33>] ? finish_task_switch+0x173/0x220
[   68.197053]  [<ffffffff8123307f>] ? selinux_file_ioctl+0x4f/0xd0
[   68.197055]  [<ffffffff8122cac3>] ? security_file_ioctl+0x43/0x60
[   68.197057]  [<ffffffff811a6ad9>] SyS_ioctl+0x79/0x90
[   68.197060]  [<ffffffff81597e57>] entry_SYSCALL_64_fastpath+0x12/0x6a
[   68.197061] ---[ end trace 558a5ebf9445fc80 ]---

After commit (0c4109be 'x86/fpu/xstate: Fix up bad get_xsave_addr()
assumptions'), there is no assumption an xsave bit is present in the
hardware (pcntxt_mask) that it is always present in a given xsave buffer.
An enabled state to be present on 'pcntxt_mask', but *not* in 'xstate_bv'
could happen when the last 'xsave' did not request that this feature be
saved (unlikely) or because the "init optimization" caused it to not be
saved. This patch kill the assumption.
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ee4100da

KVM: x86: apply guest MTRR virtualization on host reserved pages · fd717f11

由 Paolo Bonzini 提交于 7月 07, 2015

Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
However, the guest could prefer a different page type than UC for
such pages. A good example is that pass-throughed VGA frame buffer is
not always UC as host expected.

This patch enables full use of virtual guest MTRRs.
Suggested-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de> (on AMD)
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fd717f11

KVM: SVM: Sync g_pat with guest-written PAT value · e098223b

由 Jan Kiszka 提交于 4月 20, 2015

When hardware supports the g_pat VMCB field, we can use it for emulating
the PAT configuration that the guest configures by writing to the
corresponding MSR.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Tested-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e098223b

KVM: SVM: use NPT page attributes · 3c2e7f7d

由 Paolo Bonzini 提交于 7月 07, 2015

Right now, NPT page attributes are not used, and the final page
attribute depends solely on gPAT (which however is not synced
correctly), the guest MTRRs and the guest page attributes.

However, we can do better by mimicking what is done for VMX.
In the absence of PCI passthrough, the guest PAT can be ignored
and the page attributes can be just WB.  If passthrough is being
used, instead, keep respecting the guest PAT, and emulate the guest
MTRRs through the PAT field of the nested page tables.

The only snag is that WP memory cannot be emulated correctly,
because Linux's default PAT setting only includes the other types.
Tested-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3c2e7f7d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功