- 19 11月, 2014 1 次提交
-
-
由 Kees Cook 提交于
When setting up permissions on kernel memory at boot, the end of the PMD that was split from bss remained executable. It should be NX like the rest. This performs a PMD alignment instead of a PAGE alignment to get the correct span of memory. Before: ---[ High Kernel Mapping ]--- ... 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte 0xffffffff82df5000-0xffffffff82e00000 44K RW GLB x pte 0xffffffff82e00000-0xffffffffc0000000 978M pmd After: ---[ High Kernel Mapping ]--- ... 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte 0xffffffff82200000-0xffffffff82e00000 12M RW PSE GLB NX pmd 0xffffffff82e00000-0xffffffffc0000000 978M pmd [ tglx: Changed it to roundup(_brk_end, PMD_SIZE) and added a comment. We really should unmap the reminder along with the holes caused by init,initdata etc. but thats a different issue ] Signed-off-by: NKees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Wang Nan <wangnan0@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20141114194737.GA3091@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 29 10月, 2014 1 次提交
-
-
由 Dexuan Cui 提交于
pte_pfn() returns a PFN of long (32 bits in 32-PAE), so "long << PAGE_SHIFT" will overflow for PFNs above 4GB. Due to this issue, some Linux 32-PAE distros, running as guests on Hyper-V, with 5GB memory assigned, can't load the netvsc driver successfully and hence the synthetic network device can't work (we can use the kernel parameter mem=3000M to work around the issue). Cast pte_pfn() to phys_addr_t before shifting. Fixes: "commit d7656534: x86, mm: Create slow_virt_to_phys()" Signed-off-by: NDexuan Cui <decui@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: gregkh@linuxfoundation.org Cc: linux-mm@kvack.org Cc: olaf@aepfle.de Cc: apw@canonical.com Cc: jasowang@redhat.com Cc: dave.hansen@intel.com Cc: riel@redhat.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1414580017-27444-1-git-send-email-decui@microsoft.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 14 10月, 2014 2 次提交
-
-
由 Xishi Qiu 提交于
If all the nodes are marked hotpluggable, alloc node data will fail. Because __next_mem_range_rev() will skip the hotpluggable memory regions. numa_clear_kernel_node_hotplug() is called after alloc node data. numa_init() ... ret = init_func(); // this will mark hotpluggable flag from SRAT ... memblock_set_bottom_up(false); ... ret = numa_register_memblks(&numa_meminfo); // this will alloc node data(pglist_data) ... numa_clear_kernel_node_hotplug(); // in case all the nodes are hotpluggable ... numa_register_memblks() setup_node_data() memblock_find_in_range_node() __memblock_find_range_top_down() for_each_mem_range_rev() __next_mem_range_rev() This patch moves numa_clear_kernel_node_hotplug() into numa_register_memblks(), clear kernel node hotpluggable flag before alloc node data, then alloc node data won't fail even all the nodes are hotpluggable. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Cc: Dave Jones <davej@redhat.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Travis 提交于
Use the optimized ioresource lookup, "region_is_ram", for the ioremap function. If the region is not found, it falls back to the "page_is_ram" function. If it is found and it is RAM, then the usual warning message is issued, and the ioremap operation is aborted. Otherwise, the ioremap operation continues. Signed-off-by: NMike Travis <travis@sgi.com> Acked-by: NAlex Thorlton <athorlton@sgi.com> Reviewed-by: NCliff Wickman <cpw@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Mark Salter <msalter@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 9月, 2014 2 次提交
-
-
由 David Vrabel 提交于
The _PAGE_IO_MAP PTE flag was only used by Xen PV guests to mark PTEs that were used to map I/O regions that are 1:1 in the p2m. This allowed Xen to obtain the correct PFN when converting the MFNs read from a PTE back to their PFN. Xen guests no longer use _PAGE_IOMAP for this. Instead mfn_to_pfn() returns the correct PFN by using a combination of the m2p and p2m to determine if an MFN corresponds to a 1:1 mapping in the the p2m. Remove _PAGE_IOMAP, replacing it with _PAGE_UNUSED2 to allow for future uses of the PTE flag. Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com> Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
-
由 David Vrabel 提交于
If a fault on a kernel address is due to a non-present page, then it cannot be the result of stale TLB entry from a protection change (RO to RW or NX to X). Thus the pagetable walk in spurious_fault() can be skipped. See the initial if in spurious_fault() and the tests in spurious_fault_check()) for the set of possible error codes checked for spurious faults. These are: IRUWP Before x00xx && ( 1xxxx || xxx1x ) After ( 10001 || 00011 ) && ( 1xxxx || xxx1x ) Thus the new condition is a subset of the previous one, excluding only non-present faults (I == 1 and W == 1 are mutually exclusive). This avoids spurious_fault() oopsing in some cases if the pagetables it attempts to walk are not accessible. This obscures the location of the original fault. This also fixes a crash with Xen PV guests when they access entries in the M2P corresponding to device MMIO regions. The M2P is mapped (read-only) by Xen into the kernel address space of the guest and this mapping may contains holes for non-RAM regions. Read faults will result in calls to spurious_fault(), but because the page tables for the M2P mappings are not accessible by the guest the pagetable walk would fault. This was not normally a problem as MMIO mappings would not normally result in a M2P lookup because of the use of the _PAGE_IOMAP bit the PTE. However, removing the _PAGE_IOMAP bit requires M2P lookups for MMIO mappings as well. Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com> Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: NDave Hansen <dave.hansen@intel.com>
-
- 19 9月, 2014 2 次提交
-
-
由 Aaron Tomlin 提交于
This facility is used in a few places so let's introduce a helper function to improve code readability. Signed-off-by: NAaron Tomlin <atomlin@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dzickus@redhat.com Cc: bmr@redhat.com Cc: jcastillo@redhat.com Cc: oleg@redhat.com Cc: riel@redhat.com Cc: prarit@redhat.com Cc: jgh@redhat.com Cc: minchan@kernel.org Cc: mpe@ellerman.id.au Cc: tglx@linutronix.de Cc: hannes@cmpxchg.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1410527779-8133-3-git-send-email-atomlin@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Aaron Tomlin 提交于
Tasks get their end of stack set to STACK_END_MAGIC with the aim to catch stack overruns. Currently this feature does not apply to init_task. This patch removes this restriction. Note that a similar patch was posted by Prarit Bhargava some time ago but was never merged: http://marc.info/?l=linux-kernel&m=127144305403241&w=2Signed-off-by: NAaron Tomlin <atomlin@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NMichael Ellerman <mpe@ellerman.id.au> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dzickus@redhat.com Cc: bmr@redhat.com Cc: jcastillo@redhat.com Cc: jgh@redhat.com Cc: minchan@kernel.org Cc: tglx@linutronix.de Cc: hannes@cmpxchg.org Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 16 9月, 2014 3 次提交
-
-
由 Luiz Capitulino 提交于
The setup_node_data() function allocates a pg_data_t object, inserts it into the node_data[] array and initializes the following fields: node_id, node_start_pfn and node_spanned_pages. However, a few function calls later during the kernel boot, free_area_init_node() re-initializes those fields, possibly with setup_node_data() is not used. This causes a small glitch when running Linux as a hyperv numa guest: SRAT: PXM 0 -> APIC 0x00 -> Node 0 SRAT: PXM 0 -> APIC 0x01 -> Node 0 SRAT: PXM 1 -> APIC 0x02 -> Node 1 SRAT: PXM 1 -> APIC 0x03 -> Node 1 SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff] Initmem setup node 0 [mem 0x00000000-0x7fffffff] NODE_DATA [mem 0x7ffdc000-0x7ffeffff] Initmem setup node 1 [mem 0x80800000-0x1081fffff] NODE_DATA [mem 0x1081ea000-0x1081fdfff] crashkernel: memory value expected [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0 [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1 Zone ranges: DMA [mem 0x00001000-0x00ffffff] DMA32 [mem 0x01000000-0xffffffff] Normal [mem 0x100000000-0x1081fffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x00001000-0x0009efff] node 0: [mem 0x00100000-0x7ffeffff] node 1: [mem 0x80200000-0xf7ffffff] node 1: [mem 0x100000000-0x1081fffff] On node 0 totalpages: 524174 DMA zone: 64 pages used for memmap DMA zone: 21 pages reserved DMA zone: 3998 pages, LIFO batch:0 DMA32 zone: 8128 pages used for memmap DMA32 zone: 520176 pages, LIFO batch:31 On node 1 totalpages: 524288 DMA32 zone: 7672 pages used for memmap DMA32 zone: 491008 pages, LIFO batch:31 Normal zone: 520 pages used for memmap Normal zone: 33280 pages, LIFO batch:7 In this dmesg, the SRAT table reports that the memory range for node 1 starts at 0x80200000. However, the line starting with "Initmem" reports that node 1 memory range starts at 0x80800000. The "Initmem" line is reported by setup_node_data() and is wrong, because the kernel ends up using the range as reported in the SRAT table. This commit drops all that dead code from setup_node_data(), renames it to alloc_node_data() and adds a printk() to free_area_init_node() so that we report a node's memory range accurately. Here's the same dmesg section with this patch applied: SRAT: PXM 0 -> APIC 0x00 -> Node 0 SRAT: PXM 0 -> APIC 0x01 -> Node 0 SRAT: PXM 1 -> APIC 0x02 -> Node 1 SRAT: PXM 1 -> APIC 0x03 -> Node 1 SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff] NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff] NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff] crashkernel: memory value expected [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0 [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1 Zone ranges: DMA [mem 0x00001000-0x00ffffff] DMA32 [mem 0x01000000-0xffffffff] Normal [mem 0x100000000-0x1081fffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x00001000-0x0009efff] node 0: [mem 0x00100000-0x7ffeffff] node 1: [mem 0x80200000-0xf7ffffff] node 1: [mem 0x100000000-0x1081fffff] Initmem setup node 0 [mem 0x00001000-0x7ffeffff] On node 0 totalpages: 524174 DMA zone: 64 pages used for memmap DMA zone: 21 pages reserved DMA zone: 3998 pages, LIFO batch:0 DMA32 zone: 8128 pages used for memmap DMA32 zone: 520176 pages, LIFO batch:31 Initmem setup node 1 [mem 0x80200000-0x1081fffff] On node 1 totalpages: 524288 DMA32 zone: 7672 pages used for memmap DMA32 zone: 491008 pages, LIFO batch:31 Normal zone: 520 pages used for memmap Normal zone: 33280 pages, LIFO batch:7 This commit was tested on a two node bare-metal NUMA machine and Linux as a numa guest on hyperv and qemu/kvm. PS: The wrong memory range reported by setup_node_data() seems to be harmless in the current kernel because it's just not used. However, that bad range is used in kernel 2.6.32 to initialize the old boot memory allocator, which causes a crash during boot. Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yasuaki Ishimatsu 提交于
When hot-adding/removing memory, sync_global_pgds() is called for synchronizing PGD to PGD entries of all processes MM. But when hot-removing memory, sync_global_pgds() does not work correctly. At first, sync_global_pgds() checks whether target PGD is none or not. And if PGD is none, the PGD is skipped. But when hot-removing memory, PGD may be none since PGD may be cleared by free_pud_table(). So when sync_global_pgds() is called after hot-removing memory, sync_global_pgds() should not skip PGD even if the PGD is none. And sync_global_pgds() must clear PGD entries of all processes MM. Currently sync_global_pgds() does not clear PGD entries of all processes MM when hot-removing memory. So when hot adding memory which is same memory range as removed memory after hot-removing memory, following call traces are shown: kernel BUG at arch/x86/mm/init_64.c:206! ... [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2 [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380 [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0 [<ffffffff815d03d9>] add_memory+0xb9/0x1b0 [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0 [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7 [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7 [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5 [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2 [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15 [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32 [<ffffffff8107e02b>] process_one_work+0x17b/0x460 [<ffffffff8107edfb>] worker_thread+0x11b/0x400 [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400 [<ffffffff81085aef>] kthread+0xcf/0xe0 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140 [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140 This patch clears PGD entries of all processes MM when sync_global_pgds() is called after hot-removing memory Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: NToshi Kani <toshi.kani@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yasuaki Ishimatsu 提交于
When hot-adding memory after hot-removing memory, following call traces are shown: kernel BUG at arch/x86/mm/init_64.c:206! ... [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2 [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380 [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0 [<ffffffff815d03d9>] add_memory+0xb9/0x1b0 [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0 [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7 [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7 [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5 [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2 [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15 [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32 [<ffffffff8107e02b>] process_one_work+0x17b/0x460 [<ffffffff8107edfb>] worker_thread+0x11b/0x400 [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400 [<ffffffff81085aef>] kthread+0xcf/0xe0 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140 [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140 The patch-set fixes the issue. This patch (of 2): remove_pagetable() gets start argument and passes the argument to sync_global_pgds(). In this case, the argument must not be modified. If the argument is modified and passed to sync_global_pgds(), sync_global_pgds() does not correctly synchronize PGD to PGD entries of all processes MM since synchronized range of memory [start, end] is wrong. Unfortunately the start argument is modified in remove_pagetable(). So this patch fixes the issue. Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: NToshi Kani <toshi.kani@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 09 9月, 2014 2 次提交
-
-
由 Jan-Simon Möller 提交于
This fixes a compilation error in clang in that a linker section attribute can't be added to a type: arch/x86/mm/mmap.c:34:8: error: '__section__' attribute only applies to functions and global variables struct __read_mostly ... By moving the section attribute to the variable declaration, the desired effect is achieved. Signed-off-by: NJan-Simon Möller <dl9pf@gmx.de> Signed-off-by: NBehan Webster <behanw@converseincode.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1409959005-11479-1-git-send-email-behanw@converseincode.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mathias Krause 提交于
We should classify the espfix area as such only if we actually have enabled the corresponding option. Otherwise the page table dump might look confusing. Signed-off-by: NMathias Krause <minipli@googlemail.com> Link: http://lkml.kernel.org/r/1410114629-24523-1-git-send-email-minipli@googlemail.com Cc: Arjan van de Ven <arjan.van.de.ven@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
- 01 9月, 2014 1 次提交
-
-
由 Matthew Wilcox 提交于
The last user of set_pmd_pfn() went away in commit f03574f2, so this has been dead code for over a year. Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> arch/x86/include/asm/pgtable_32.h | 3 --- arch/x86/mm/pgtable_32.c | 35 ----------------------------------- 2 files changed, 38 deletions(-)
-
- 27 8月, 2014 1 次提交
-
-
由 Christoph Lameter 提交于
__get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int *x = &__get_cpu_var(y); Converts to int *x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int *x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to __this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to __this_cpu_inc(y) Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Acked-by: NH. Peter Anvin <hpa@linux.intel.com> Acked-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 10 8月, 2014 1 次提交
-
-
由 Jeremiah Mahler 提交于
A sparse warning is generated about 'tlb_single_page_flush_ceiling' not being declared. arch/x86/mm/tlb.c:177:15: warning: symbol 'tlb_single_page_flush_ceiling' was not declared. Should it be static? Since it isn't used anywhere outside this file, fix the warning by making it static. Also, optimize the use of this variable by adding the __read_mostly directive, as suggested by David Rientjes. Suggested-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NJeremiah Mahler <jmmahler@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/1407569913-4035-1-git-send-email-jmmahler@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 8月, 2014 1 次提交
-
-
由 Dave Hansen 提交于
Dave Jones reported seeing a bug from one of my TLB tracepoints: http://lkml.kernel.org/r/20140806181801.GA4605@redhat.com According to Paul McKenney, the right way to fix this is adding an _rcuidle suffix to the tracepoint. http://lkml.kernel.org/r/20140807065055.GA5821@linux.vnet.ibm.com This patch does just that. Reported-by: Dave Jones <davej@redhat.com>, Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Dave Hansen <dave@sr71.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140807175841.5C92D878@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 07 8月, 2014 3 次提交
-
-
由 Wang Nan 提交于
This patch introduces zone_for_memory() to arch_add_memory() on x86_32 to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: NWang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wang Nan 提交于
This patch introduces zone_for_memory() to arch_add_memory() on x86_64 to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: NWang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Paul Cassella 提交于
Add a comment describing the circumstances in which __lock_page_or_retry() will or will not release the mmap_sem when returning 0. Add comments to lock_page_or_retry()'s callers (filemap_fault(), do_swap_page()) noting the impact on VM_FAULT_RETRY returns. Add comments on up the call tree, particularly replacing the false "We return with mmap_sem still held" comments. Signed-off-by: NPaul Cassella <cassella@cray.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 7月, 2014 7 次提交
-
-
由 Dave Hansen 提交于
This has been run through Intel's LKP tests across a wide range of modern sytems and workloads and it wasn't shown to make a measurable performance difference positive or negative. Now that we have some shiny new tracepoints, we can actually figure out what the heck is going on. During a kernel compile, 60% of the flush_tlb_mm_range() calls are for a single page. It breaks down like this: size percent percent<= V V V GLOBAL: 2.20% 2.20% avg cycles: 2283 1: 56.92% 59.12% avg cycles: 1276 2: 13.78% 72.90% avg cycles: 1505 3: 8.26% 81.16% avg cycles: 1880 4: 7.41% 88.58% avg cycles: 2447 5: 1.73% 90.31% avg cycles: 2358 6: 1.32% 91.63% avg cycles: 2563 7: 1.14% 92.77% avg cycles: 2862 8: 0.62% 93.39% avg cycles: 3542 9: 0.08% 93.47% avg cycles: 3289 10: 0.43% 93.90% avg cycles: 3570 11: 0.20% 94.10% avg cycles: 3767 12: 0.08% 94.18% avg cycles: 3996 13: 0.03% 94.20% avg cycles: 4077 14: 0.02% 94.23% avg cycles: 4836 15: 0.04% 94.26% avg cycles: 5699 16: 0.06% 94.32% avg cycles: 5041 17: 0.57% 94.89% avg cycles: 5473 18: 0.02% 94.91% avg cycles: 5396 19: 0.03% 94.95% avg cycles: 5296 20: 0.02% 94.96% avg cycles: 6749 21: 0.18% 95.14% avg cycles: 6225 22: 0.01% 95.15% avg cycles: 6393 23: 0.01% 95.16% avg cycles: 6861 24: 0.12% 95.28% avg cycles: 6912 25: 0.05% 95.32% avg cycles: 7190 26: 0.01% 95.33% avg cycles: 7793 27: 0.01% 95.34% avg cycles: 7833 28: 0.01% 95.35% avg cycles: 8253 29: 0.08% 95.42% avg cycles: 8024 30: 0.03% 95.45% avg cycles: 9670 31: 0.01% 95.46% avg cycles: 8949 32: 0.01% 95.46% avg cycles: 9350 33: 3.11% 98.57% avg cycles: 8534 34: 0.02% 98.60% avg cycles: 10977 35: 0.02% 98.62% avg cycles: 11400 We get in to dimishing returns pretty quickly. On pre-IvyBridge CPUs, we used to set the limit at 8 pages, and it was set at 128 on IvyBrige. That 128 number looks pretty silly considering that less than 0.5% of the flushes are that large. The previous code tried to size this number based on the size of the TLB. Good idea, but it's error-prone, needs maintenance (which it didn't get up to now), and probably would not matter in practice much. Settting it to 33 means that we cover the mallopt M_TRIM_THRESHOLD, which is the most universally common size to do flushes. That's the short version. Here's the long one for why I chose 33: 1. These numbers have a constant bias in the timestamps from the tracing. Probably counts for a couple hundred cycles in each of these tests, but it should be fairly _even_ across all of them. The smallest delta between the tracepoints I have ever seen is 335 cycles. This is one reason the cycles/page cost goes down in general as the flushes get larger. The true cost is nearer to 100 cycles. 2. A full flush is more expensive than a single invlpg, but not by much (single percentages). 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. 4. 22,000 cycles is approximately the equivalent of doing 85 invlpg operations. But, the odds are that the TLB can actually be filled up faster than that because TLB misses that are close in time also tend to leverage the same caches. 6. ~98% of flushes are <=33 pages. There are a lot of flushes of 33 pages, probably because libc's M_TRIM_THRESHOLD is set to 128k (32 pages) 7. I've found no consistent data to support changing the IvyBridge vs. SandyBridge tunable by a factor of 16 I used the performance counters on this hardware (IvyBridge i5-3320M) to figure out the tlb miss costs: ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush 7,720,030,970 dtlb_load_misses_walk_duration [57.13%] 169,856,353 dtlb_load_misses_walk_completed [57.15%] 708,832,859 dtlb_store_misses_walk_duration [57.17%] 19,346,823 dtlb_store_misses_walk_completed [57.17%] 2,779,687,402 itlb_misses_walk_duration [57.15%] 82,241,148 itlb_misses_walk_completed [57.13%] 770,717 itlb_itlb_flush [57.11%] Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. On a SandyBridge system with more cores and larger caches, those are dtlb=13.4ns and itlb=9.5ns. cat perf.stat.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, " ----- ", icyc,imiss, dcyc,dmiss } On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any The assumptions that this code went in under: https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are about 100ns. Being generous, that is over by a factor of 6 on the refill side, although it is fairly close on the cost of an invlpg. An increase of a single invlpg operation seems to lengthen the flush range operation by about 200 cycles. Here is one example of the data collected for flushing 10 and 11 pages (full data are below): 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 How to generate this table: echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb echo x86-tsc > /sys/kernel/debug/tracing/trace_clock echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable Pipe the trace output in to this script: http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt Note that these data were gathered with the invlpg threshold set to 150 pages. Only data points with >=50 of samples were printed: Flush % of %<= in flush this pages es size ------------------------------------------------------------------------------ -1: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 2: 13.78% 72.90% avg cycles: 1505 cycles/page: 752 samples: 150335 3: 8.26% 81.16% avg cycles: 1880 cycles/page: 626 samples: 90131 4: 7.41% 88.58% avg cycles: 2447 cycles/page: 611 samples: 80877 5: 1.73% 90.31% avg cycles: 2358 cycles/page: 471 samples: 18885 6: 1.32% 91.63% avg cycles: 2563 cycles/page: 427 samples: 14397 7: 1.14% 92.77% avg cycles: 2862 cycles/page: 408 samples: 12441 8: 0.62% 93.39% avg cycles: 3542 cycles/page: 442 samples: 6721 9: 0.08% 93.47% avg cycles: 3289 cycles/page: 365 samples: 917 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 12: 0.08% 94.18% avg cycles: 3996 cycles/page: 333 samples: 864 13: 0.03% 94.20% avg cycles: 4077 cycles/page: 313 samples: 289 14: 0.02% 94.23% avg cycles: 4836 cycles/page: 345 samples: 236 15: 0.04% 94.26% avg cycles: 5699 cycles/page: 379 samples: 390 16: 0.06% 94.32% avg cycles: 5041 cycles/page: 315 samples: 643 17: 0.57% 94.89% avg cycles: 5473 cycles/page: 321 samples: 6229 18: 0.02% 94.91% avg cycles: 5396 cycles/page: 299 samples: 224 19: 0.03% 94.95% avg cycles: 5296 cycles/page: 278 samples: 367 20: 0.02% 94.96% avg cycles: 6749 cycles/page: 337 samples: 185 21: 0.18% 95.14% avg cycles: 6225 cycles/page: 296 samples: 1964 22: 0.01% 95.15% avg cycles: 6393 cycles/page: 290 samples: 83 23: 0.01% 95.16% avg cycles: 6861 cycles/page: 298 samples: 61 24: 0.12% 95.28% avg cycles: 6912 cycles/page: 288 samples: 1307 25: 0.05% 95.32% avg cycles: 7190 cycles/page: 287 samples: 533 26: 0.01% 95.33% avg cycles: 7793 cycles/page: 299 samples: 94 27: 0.01% 95.34% avg cycles: 7833 cycles/page: 290 samples: 66 28: 0.01% 95.35% avg cycles: 8253 cycles/page: 294 samples: 73 29: 0.08% 95.42% avg cycles: 8024 cycles/page: 276 samples: 846 30: 0.03% 95.45% avg cycles: 9670 cycles/page: 322 samples: 296 31: 0.01% 95.46% avg cycles: 8949 cycles/page: 288 samples: 79 32: 0.01% 95.46% avg cycles: 9350 cycles/page: 292 samples: 60 33: 3.11% 98.57% avg cycles: 8534 cycles/page: 258 samples: 33936 34: 0.02% 98.60% avg cycles: 10977 cycles/page: 322 samples: 268 35: 0.02% 98.62% avg cycles: 11400 cycles/page: 325 samples: 177 36: 0.01% 98.63% avg cycles: 11504 cycles/page: 319 samples: 161 37: 0.02% 98.65% avg cycles: 11596 cycles/page: 313 samples: 182 38: 0.02% 98.66% avg cycles: 11850 cycles/page: 311 samples: 195 39: 0.01% 98.68% avg cycles: 12158 cycles/page: 311 samples: 128 40: 0.01% 98.68% avg cycles: 11626 cycles/page: 290 samples: 78 41: 0.04% 98.73% avg cycles: 11435 cycles/page: 278 samples: 477 42: 0.01% 98.73% avg cycles: 12571 cycles/page: 299 samples: 74 43: 0.01% 98.74% avg cycles: 12562 cycles/page: 292 samples: 78 44: 0.01% 98.75% avg cycles: 12991 cycles/page: 295 samples: 108 45: 0.01% 98.76% avg cycles: 13169 cycles/page: 292 samples: 78 46: 0.02% 98.78% avg cycles: 12891 cycles/page: 280 samples: 261 47: 0.01% 98.79% avg cycles: 13099 cycles/page: 278 samples: 67 48: 0.01% 98.80% avg cycles: 13851 cycles/page: 288 samples: 77 49: 0.01% 98.80% avg cycles: 13749 cycles/page: 280 samples: 66 50: 0.01% 98.81% avg cycles: 13949 cycles/page: 278 samples: 73 52: 0.00% 98.82% avg cycles: 14243 cycles/page: 273 samples: 52 54: 0.01% 98.83% avg cycles: 15312 cycles/page: 283 samples: 87 55: 0.01% 98.84% avg cycles: 15197 cycles/page: 276 samples: 109 56: 0.02% 98.86% avg cycles: 15234 cycles/page: 272 samples: 208 57: 0.00% 98.86% avg cycles: 14888 cycles/page: 261 samples: 53 58: 0.01% 98.87% avg cycles: 15037 cycles/page: 259 samples: 59 59: 0.01% 98.87% avg cycles: 15752 cycles/page: 266 samples: 63 62: 0.00% 98.89% avg cycles: 16222 cycles/page: 261 samples: 54 64: 0.02% 98.91% avg cycles: 17179 cycles/page: 268 samples: 248 65: 0.12% 99.03% avg cycles: 18762 cycles/page: 288 samples: 1324 85: 0.00% 99.10% avg cycles: 21649 cycles/page: 254 samples: 50 127: 0.01% 99.18% avg cycles: 32397 cycles/page: 255 samples: 75 128: 0.13% 99.31% avg cycles: 31711 cycles/page: 247 samples: 1466 129: 0.18% 99.49% avg cycles: 33017 cycles/page: 255 samples: 1927 181: 0.33% 99.84% avg cycles: 2489 cycles/page: 13 samples: 3547 256: 0.05% 99.91% avg cycles: 2305 cycles/page: 9 samples: 550 512: 0.03% 99.95% avg cycles: 2133 cycles/page: 4 samples: 304 1512: 0.01% 99.99% avg cycles: 3038 cycles/page: 2 samples: 65 Here are the tlb counters during a 10-second slice of a kernel compile for a SandyBridge system. It's better than IvyBridge, but probably due to the larger caches since this was one of the 'X' extreme parts. 10,873,007,282 dtlb_load_misses_walk_duration 250,711,333 dtlb_load_misses_walk_completed 1,212,395,865 dtlb_store_misses_walk_duration 31,615,772 dtlb_store_misses_walk_completed 5,091,010,274 itlb_misses_walk_duration 163,193,511 itlb_misses_walk_completed 1,321,980 itlb_itlb_flush 10.008045158 seconds time elapsed # cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, " ----- ", icyc,imiss, dcyc,dmiss }' itlb ns/miss: 9.45338 dtlb ns/miss: 12.9716 Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154103.10C1115E@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
Most of the logic here is in the documentation file. Please take a look at it. I know we've come full-circle here back to a tunable, but this new one is *WAY* simpler. I challenge anyone to describe in one sentence how the old one worked. Here's the way the new one works: If we are flushing more pages than the ceiling, we use the full flush, otherwise we use per-page flushes. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154101.12B52CAF@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
We don't have any good way to figure out what kinds of flushes are being attempted. Right now, we can try to use the vm counters, but those only tell us what we actually did with the hardware (one-by-one vs full) and don't tell us what was actually _requested_. This allows us to select out "interesting" TLB flushes that we might want to optimize (like the ranged ones) and ignore the ones that we have very little control over (the ones at context switch). Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154059.4C96CBA5@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
There are currently three paths through the remote flush code: 1. full invalidation 2. single page invalidation using invlpg 3. ranged invalidation using invlpg This takes 2 and 3 and combines them in to a single path by making the single-page one just be the start and end be start plus a single page. This makes placement of our tracepoint easier. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154058.E0F90408@viggo.jf.intel.com Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
If we take the if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { local_flush_tlb(); goto out; } path out of flush_tlb_mm_range(), we will have flushed the tlb, but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the way out of the function so that we always take a single path when doing a full tlb flush. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154056.FF763B76@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
I think the flush_tlb_mm_range() code that tries to tune the flush sizes based on the CPU needs to get ripped out for several reasons: 1. It is obviously buggy. It uses mm->total_vm to judge the task's footprint in the TLB. It should certainly be using some measure of RSS, *NOT* ->total_vm since only resident memory can populate the TLB. 2. Haswell, and several other CPUs are missing from the intel_tlb_flushall_shift_set() function. Thus, it has been demonstrated to bitrot quickly in practice. 3. It is plain wrong in my vm: [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] tlb_flushall_shift: 6 Which leads to it to never use invlpg. 4. The assumptions about TLB refill costs are wrong: http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com (more on this in later patches) 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 I believe the sample times were too short. Running the benchmark in a loop yields times that vary quite a bit. Note that this leaves us with a static ceiling of 1 page. This is a conservative, dumb setting, and will be revised in a later patch. This also removes the code which attempts to predict whether we are flushing data or instructions. We expect instruction flushes to be relatively rare and not worth tuning for explicitly. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
The if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) line of code is not exactly the easiest to audit, especially when it ends up at two different indentation levels. This eliminates one of the the copy-n-paste versions. It also gives us a unified exit point for each path through this function. We need this in a minute for our tracepoint. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154054.44F1CDDC@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 12 6月, 2014 1 次提交
-
-
由 Jiri Kosina 提交于
If pagefault triggers due to SMEP triggering, it can't be really easily distinguished from any other oops-causing pagefault, which might lead to quite some confusion when trying to understand the reason for the oops. Print an explanatory message in case the fault happened during instruction fetch for _PAGE_USER page which is present and executable on SMEP-enabled CPUs. This is consistent with what we are doing for NX already; in addition to immediately seeing from the oops what might be happening, it can even easily give a good indication to sysadmins who are carefully monitoring their kernel logs that someone might be trying to pwn them. Signed-off-by: NJiri Kosina <jkosina@suse.cz> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1406102248490.1321@pobox.suse.czSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 05 6月, 2014 4 次提交
-
-
由 Emil Medve 提交于
Signed-off-by: NEmil Medve <Emilian.Medve@Freescale.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yinghai Lu 提交于
On system with 2TiB ram, current x86_64 have 128M as section size, and one memory_block only include one section. So will have 16400 entries under /sys/devices/system/memory/. Current code try to use block id to find block pointer in /sys for any section, and reuse that block pointer. that finding will take some time even after commit 7c243c71 ("mm: speedup in __early_pfn_to_nid") that will skip the search in that case during booting up. So solution could be increase block size just like SGI UV system did. (harded code to 2g). This patch is trying to probe the block size to make it match mmio remap size. for example, Intel Nehalem later system will have memory range [0, TOML), [4g, TOMH]. If the memory hole is 2g and total is 128g, TOM will be 2g, and TOM2 will be 130g. We could use 2g as block size instead of default 128M. That will reduce number of entries in /sys/devices/system/memory/ On system 6TiB system will reduce boot time by 35 seconds. Signed-off-by: NYinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting faults on x86. Care is taken such that _PAGE_NUMA is used only in situations where the VMA flags distinguish between NUMA hinting faults and prot_none faults. This decision was x86-specific and conceptually it is difficult requiring special casing to distinguish between PROTNONE and NUMA ptes based on context. Fundamentally, we only need the _PAGE_NUMA bit to tell the difference between an entry that is really unmapped and a page that is protected for NUMA hinting faults as if the PTE is not present then a fault will be trapped. Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset. This patch shrinks the maximum possible swap size and uses the bit to uniquely distinguish between NUMA hinting ptes and swap ptes. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Noonan <steven@uplinklabs.net> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Currently hugepage migration is available for all archs which support pmd-level hugepage, but testing is done only for x86_64 and there're bugs for other archs. So to avoid breaking such archs, this patch limits the availability strictly to x86_64 until developers of other archs get interested in enabling this feature. Simply disabling hugepage migration on non-x86_64 archs is not enough to fix the reported problem where sys_move_pages() hits the BUG_ON() in follow_page(FOLL_GET), so let's fix this by checking if hugepage migration is supported in vma_migratable(). Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: NMichael Ellerman <mpe@ellerman.id.au> Tested-by: NMichael Ellerman <mpe@ellerman.id.au> Acked-by: NHugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 5月, 2014 2 次提交
-
-
由 Andy Lutomirski 提交于
This removes the last vestiges of arch_vma_name from x86, replacing it with vm_ops->name. Good riddance. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/e681cb56096eee5b8b8767093a4f6fb82839f0a4.1400538962.git.luto@amacapital.netSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Andy Lutomirski 提交于
Using arch_vma_name to give special mappings a name is awkward. x86 currently implements it by comparing the start address of the vma to the expected address of the vdso. This requires tracking the start address of special mappings and is probably buggy if a special vma is split or moved. Improve _install_special_mapping to just name the vma directly. Use it to give the x86 vvar area a name, which should make CRIU's life easier. As a side effect, the vvar area will show up in core dumps. This could be considered weird and is fixable. [hpa: I say we accept this as-is but be prepared to deal with knocking out the vvars from core dumps if this becomes a problem.] Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/276b39b6b645fb11e345457b503f17b83c2c6fd0.1400538962.git.luto@amacapital.netSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 06 5月, 2014 3 次提交
-
-
由 Andy Lutomirski 提交于
This makes the 64-bit and x32 vdsos use the same mechanism as the 32-bit vdso. Most of the churn is deleting all the old fixmap code. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/8af87023f57f6bb96ec8d17fce3f88018195b49b.1399317206.git.luto@amacapital.netSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Andy Lutomirski 提交于
Currently, vdso.so files are prepared and analyzed by a combination of objcopy, nm, some linker script tricks, and some simple ELF parsers in the kernel. Replace all of that with plain C code that runs at build time. All five vdso images now generate .c files that are compiled and linked in to the kernel image. This should cause only one userspace-visible change: the loaded vDSO images are stripped more heavily than they used to be. Everything outside the loadable segment is dropped. In particular, this causes the section table and section name strings to be missing. This should be fine: real dynamic loaders don't load or inspect these tables anyway. The result is roughly equivalent to eu-strip's --strip-sections option. The purpose of this change is to enable the vvar and hpet mappings to be moved to the page following the vDSO load segment. Currently, it is possible for the section table to extend into the page after the load segment, so, if we map it, it risks overlapping the vvar or hpet page. This happens whenever the load segment is just under a multiple of PAGE_SIZE. The only real subtlety here is that the old code had a C file with inline assembler that did 'call VDSO32_vsyscall' and a linker script that defined 'VDSO32_vsyscall = __kernel_vsyscall'. This most likely worked by accident: the linker script entry defines a symbol associated with an address as opposed to an alias for the real dynamic symbol __kernel_vsyscall. That caused ld to relocate the reference at link time instead of leaving an interposable dynamic relocation. Since the VDSO32_vsyscall hack is no longer needed, I now use 'call __kernel_vsyscall', and I added -Bsymbolic to make it work. vdso2c will generate an error and abort the build if the resulting image contains any dynamic relocations, so we won't silently generate bad vdso images. (Dynamic relocations are a problem because nothing will even attempt to relocate the vdso.) Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/2c4fcf45524162a34d87fdda1eb046b2a5cecee7.1399317206.git.luto@amacapital.netSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Andy Lutomirski 提交于
The early_ioremap code requires that its buffers not span a PMD boundary. The logic for ensuring that only works if the fixmap is aligned, so assert that it's aligned correctly. To make this work reliably, reserve_top_address needs to be adjusted. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/e59a5f4362661f75dd4841fa74e1f2448045e245.1399317206.git.luto@amacapital.netSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 03 5月, 2014 1 次提交
-
-
由 Roland Dreier 提交于
In __ioremap_caller() (the guts of ioremap), we loop over the range of pfns being remapped and checks each one individually with page_is_ram(). For large ioremaps, this can be very slow. For example, we have a device with a 256 GiB PCI BAR, and ioremapping this BAR can take 20+ seconds -- sometimes long enough to trigger the soft lockup detector! Internally, page_is_ram() calls walk_system_ram_range() on a single page. Instead, we can make a single call to walk_system_ram_range() from __ioremap_caller(), and do our further checks only for any RAM pages that we find. For the common case of MMIO, this saves an enormous amount of work, since the range being ioremapped doesn't intersect system RAM at all. With this change, ioremap on our 256 GiB BAR takes less than 1 second. Signed-off-by: NRoland Dreier <roland@purestorage.com> Link: http://lkml.kernel.org/r/1399054721-1331-1-git-send-email-roland@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 01 5月, 2014 1 次提交
-
-
由 H. Peter Anvin 提交于
The IRET instruction, when returning to a 16-bit segment, only restores the bottom 16 bits of the user space stack pointer. This causes some 16-bit software to break, but it also leaks kernel state to user space. We have a software workaround for that ("espfix") for the 32-bit kernel, but it relies on a nonzero stack segment base which is not available in 64-bit mode. In checkin: b3b42ac2 x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels we "solved" this by forbidding 16-bit segments on 64-bit kernels, with the logic that 16-bit support is crippled on 64-bit kernels anyway (no V86 support), but it turns out that people are doing stuff like running old Win16 binaries under Wine and expect it to work. This works around this by creating percpu "ministacks", each of which is mapped 2^16 times 64K apart. When we detect that the return SS is on the LDT, we copy the IRET frame to the ministack and use the relevant alias to return to userspace. The ministacks are mapped readonly, so if IRET faults we promote #GP to #DF which is an IST vector and thus has its own stack; we then do the fixup in the #DF handler. (Making #GP an IST exception would make the msr_safe functions unsafe in NMI/MC context, and quite possibly have other effects.) Special thanks to: - Andy Lutomirski, for the suggestion of using very small stack slots and copy (as opposed to map) the IRET frame there, and for the suggestion to mark them readonly and let the fault promote to #DF. - Konrad Wilk for paravirt fixup and testing. - Borislav Petkov for testing help and useful comments. Reported-by: NBrian Gerst <brgerst@gmail.com> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andrew Lutomriski <amluto@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dirk Hohndel <dirk@hohndel.org> Cc: Arjan van de Ven <arjan.van.de.ven@intel.com> Cc: comex <comexk@gmail.com> Cc: Alexander van Heukelum <heukelum@fastmail.fm> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: <stable@vger.kernel.org> # consider after upstream merge
-
- 24 4月, 2014 1 次提交
-
-
由 Masami Hiramatsu 提交于
Use NOKPROBE_SYMBOL macro for protecting functions from kprobes instead of __kprobes annotation under arch/x86. This applies nokprobe_inline annotation for some cases, because NOKPROBE_SYMBOL() will inhibit inlining by referring the symbol address. This just folds a bunch of previous NOKPROBE_SYMBOL() cleanup patches for x86 to one patch. Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Lebon <jlebon@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-