- 24 2月, 2013 11 次提交
-
-
由 Tang Chen 提交于
We now provide an option for users who don't want to specify physical memory address in kernel commandline. /* * For movablemem_map=acpi: * * SRAT: |_____| |_____| |_________| |_________| ...... * node id: 0 1 1 2 * hotpluggable: n y y n * movablemem_map: |_____| |_________| * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. */ So user just specify movablemem_map=acpi, and the kernel will use hotpluggable info in SRAT to determine which memory ranges should be set as ZONE_MOVABLE. If all the memory ranges in SRAT is hotpluggable, then no memory can be used by kernel. But before parsing SRAT, memblock has already reserve some memory ranges for other purposes, such as for kernel image, and so on. We cannot prevent kernel from using these memory. So we need to exclude these ranges even if these memory is hotpluggable. Furthermore, there could be several memory ranges in the single node which the kernel resides in. We may skip one range that have memory reserved by memblock, but if the rest of memory is too small, then the kernel will fail to boot. So, make the whole node which the kernel resides in un-hotpluggable. Then the kernel has enough memory to use. NOTE: Using this way will cause NUMA performance down because the whole node will be set as ZONE_MOVABLE, and kernel cannot use memory on it. If users don't want to lose NUMA performance, just don't use it. [akpm@linux-foundation.org: fix warning] [akpm@linux-foundation.org: use strcmp()] Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
When implementing movablemem_map boot option, we introduced an array movablemem_map.map[] to store the memory ranges to be set as ZONE_MOVABLE. Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify the whole node memory range, we need to extend it to the node end so that we can use it to prevent memblock from allocating memory in the ranges user didn't specify. We now implement movablemem_map boot option like this: /* * For movablemem_map=nn[KMG]@ss[KMG]: * * SRAT: |_____| |_____| |_________| |_________| ...... * node id: 0 1 1 2 * user specified: |__| |___| * movablemem_map: |___| |_________| |______| ...... * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. * * NOTE: In this case, SRAT info will be ingored. */ [akpm@linux-foundation.org: clean up code, fix build warning] Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
On linux, the pages used by kernel could not be migrated. As a result, if a memory range is used by kernel, it cannot be hot-removed. So if we want to hot-remove memory, we should prevent kernel from using it. The way now used to prevent this is specify a memory range by movablemem_map boot option and set it as ZONE_MOVABLE. But when the system is booting, memblock will allocate memory, and reserve the memory for kernel. And before we parse SRAT, and know the node memory ranges, memblock is working. And it may allocate memory in ranges to be set as ZONE_MOVABLE. This memory can be used by kernel, and never be freed. So, let's parse SRAT before memblock is called first. And it is early enough. The first call of memblock_find_in_range_node() is in: setup_arch() |-->setup_real_mode() so, this patch add a function early_parse_srat() to parse SRAT, and call it before setup_real_mode() is called. NOTE: 1) early_parse_srat() is called before numa_init(), and has initialized numa_meminfo. So DO NOT clear numa_nodes_parsed in numa_init() and DO NOT zero numa_meminfo in numa_init(), otherwise we will lose memory numa info. 2) I don't know why using count of memory affinities parsed from SRAT as a return value in original acpi_numa_init(). So I add a static variable srat_mem_cnt to remember this count and use it as the return value of the new acpi_numa_init() [mhocko@suse.cz: parse SRAT before memblock is ready fix] Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Reviewed-by: NWen Congyang <wency@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yasuaki Ishimatsu 提交于
During the implementation of SRAT support, we met a problem. In setup_arch(), we have the following call series: 1) memblock is ready; 2) some functions use memblock to allocate memory; 3) parse ACPI tables, such as SRAT. Before 3), we don't know which memory is hotpluggable, and as a result, we cannot prevent memblock from allocating hotpluggable memory. So, in 2), there could be some hotpluggable memory allocated by memblock. Now, we are trying to parse SRAT earlier, before memblock is ready. But I think we need more investigation on this topic. So in this v5, I dropped all the SRAT support, and v5 is just the same as v3, and it is based on 3.8-rc3. As we planned, we will support getting info from SRAT without users' participation at last. And we will post another patch-set to do so. And also, I think for now, we can add this boot option as the first step of supporting movable node. Since Linux cannot migrate the direct mapped pages, the only way for now is to limit the whole node containing only movable memory. Using SRAT is one way. But even if we can use SRAT, users still need an interface to enable/disable this functionality if they don't want to loose their NUMA performance. So I think, a user interface is always needed. For now, users can disable this functionality by not specifying the boot option. Later, we will post SRAT support, and add another option value "movablecore_map=acpi" to using SRAT. This patch: If system can create movable node which all memory of the node is allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's pg_data_t. So, use memblock_alloc_try_nid() instead of memblock_alloc_nid() to retry when the first allocation fails. Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wen Congyang 提交于
When the node is offlined, there is no memory/cpu on the node. If a sleep task runs on a cpu of this node, it will be migrated to the cpu on the other node. So we can clear cpu-to-node mapping. [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit] Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wen Congyang 提交于
When a cpu is hotpluged, we call acpi_map_cpu2node() in _acpi_map_lsapic() to store the cpu's node and apicid's node. But we don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is hotremoved. If the node is also hotremoved, we will get the following messages: kernel BUG at include/linux/gfp.h:329! invalid opcode: 0000 [#1] SMP Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB RIP: 0010:[<ffffffff811bc3fd>] [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300 RSP: 0018:ffff88078a049cf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246 RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0 R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003 FS: 00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650) Call Trace: new_slab+0x30/0x1b0 __slab_alloc+0x358/0x4c0 kmem_cache_alloc_node_trace+0xb4/0x1e0 alloc_fair_sched_group+0xd0/0x1b0 sched_create_group+0x3e/0x110 sched_autogroup_create_attach+0x4d/0x180 sys_setsid+0xd4/0xf0 system_call_fastpath+0x16/0x1b Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44 RIP [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300 RSP <ffff88078a049cf8> ---[ end trace adf84c90f3fea3e5 ]--- The reason is that the cpu's node is not NUMA_NO_NODE, we will call alloc_pages_exact_node() to alloc memory on the node, but the node is offlined. If the node is onlined, we still need cpu's node. For example: a task on the cpu is sleeped when the cpu is hotremoved. We will choose another cpu to run this task when it is waked up. If we know the cpu's node, we will choose the cpu on the same node first. So we should clear cpu-to-node mapping when the node is offlined. This patch only clears apicid-to-node mapping when the cpu is hotremoved. [akpm@linux-foundation.org: fix section error] Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
Introduce a new API vmemmap_free() to free and remove vmemmap pagetables. Since pagetable implements are different, each architecture has to provide its own version of vmemmap_free(), just like vmemmap_populate(). Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc. [mhocko@suse.cz: fix implicit declaration of remove_pagetable] Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NJianguo Wu <wujianguo@huawei.com> Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
Search a page table about the removed memory, and clear page table for x86_64 architecture. [akpm@linux-foundation.org: make kernel_physical_mapping_remove() static] Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Signed-off-by: NJianguo Wu <wujianguo@huawei.com> Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wen Congyang 提交于
When memory is removed, the corresponding pagetables should alse be removed. This patch introduces some common APIs to support vmemmap pagetable and x86_64 architecture direct mapping pagetable removing. All pages of virtual mapping in removed memory cannot be freed if some pages used as PGD/PUD include not only removed memory but also other memory. So this patch uses the following way to check whether a page can be freed or not. 1) When removing memory, the page structs of the removed memory are filled with 0FD. 2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared. In this case, the page used as PT/PMD can be freed. For direct mapping pages, update direct_pages_count[level] when we freed their pagetables. And do not free the pages again because they were freed when offlining. For vmemmap pages, free the pages and their pagetables. For larger pages, do not split them into smaller ones because there is no way to know if the larger page has been split. As a result, there is no way to decide when to split. We deal the larger pages in the following way: 1) For direct mapped pages, all the pages were freed when they were offlined. And since menmory offline is done section by section, all the memory ranges being removed are aligned to PAGE_SIZE. So only need to deal with unaligned pages when freeing vmemmap pages. 2) For vmemmap pages being used to store page_struct, if part of the larger page is still in use, just fill the unused part with 0xFD. And when the whole page is fulfilled with 0xFD, then free the larger page. [akpm@linux-foundation.org: fix typo in comment] [tangchen@cn.fujitsu.com: do not calculate direct mapping pages when freeing vmemmap pagetables] [tangchen@cn.fujitsu.com: do not free direct mapping pages twice] [tangchen@cn.fujitsu.com: do not free page split from hugepage one by one] [tangchen@cn.fujitsu.com: do not split pages when freeing pagetable pages] [akpm@linux-foundation.org: use pmd_page_vaddr()] [akpm@linux-foundation.org: fix used-uninitialised bug] Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NJianguo Wu <wujianguo@huawei.com> Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yasuaki Ishimatsu 提交于
For removing memmap region of sparse-vmemmap which is allocated bootmem, memmap region of sparse-vmemmap needs to be registered by get_page_bootmem(). So the patch searches pages of virtual mapping and registers the pages by get_page_bootmem(). NOTE: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node() when platform doesn't support it. It's implemented by adding a new Kconfig option named CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected by memory-hotplug feature fully supported archs(currently only on x86_64). Since we have 2 config options called MEMORY_HOTPLUG and MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately, and codes in function register_page_bootmem_info_node() are only used for collecting infomation for hot-remove, so reside it under MEMORY_HOTREMOVE. Besides page_isolation.c selected by MEMORY_ISOLATION under MEMORY_HOTPLUG is also such case, move it too. [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE] [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()] [mhocko@suse.cz: remove the arch specific functions without any implementation] [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed] [rientjes@google.com: fix defined but not used warning] Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Reviewed-by: NWu Jianguo <wujianguo@huawei.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NLin Feng <linfeng@cn.fujitsu.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wen Congyang 提交于
For removing memory, we need to remove page tables. But it depends on architecture. So the patch introduce arch_remove_memory() for removing page table. Now it only calls __remove_pages(). Note: __remove_pages() for some archtecuture is not implemented (I don't know how to implement it for s390). Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2013 2 次提交
-
-
由 Konrad Rzeszutek Wilk 提交于
With commit 8170e6be ("x86, 64bit: Use a #PF handler to materialize early mappings on demand") we started hitting an early bootup crash where the Xen hypervisor would inform us that: (XEN) d7:v0: unhandled page fault (ec=0000) (XEN) Pagetable walk from ffffea000005b2d0: (XEN) L4[0x1d4] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S (XEN) Domain 7 (vcpu#0) crashed on cpu#3: (XEN) ----[ Xen-4.2.0 x86_64 debug=n Not tainted ]---- .. that Xen was unable to context switch back to dom0. Looking at the calling stack we find: [<ffffffff8103feba>] xen_get_user_pgd+0x5a <-- [<ffffffff8103feba>] xen_get_user_pgd+0x5a [<ffffffff81042d27>] xen_write_cr3+0x77 [<ffffffff81ad2d21>] init_mem_mapping+0x1f9 [<ffffffff81ac293f>] setup_arch+0x742 [<ffffffff81666d71>] printk+0x48 We are trying to figure out whether we need to up-date the user PGD as well. Please keep in mind that under 64-bit PV guests we have a limited amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel and user-space. As such the Linux pvops'fied version of write_cr3 checks if it has to update the user-space cr3 as well. That clearly is not needed during early bootup. The recent changes (see above git commit) streamline the x86 page table allocation to be much simpler (And also incidentally the #PF handler ends up in spirit being similar to how the Xen toolstack sets up the initial page-tables). The fix is to have an early-bootup version of cr3 that just loads the kernel %cr3. The later version - which also handles user-page modifications will be used after the initial page tables have been setup. [ hpa: removed a redundant #ifdef and made the new function __init. Also note that x86-32 already has such an early xen_write_cr3. ] Tested-by: N"H. Peter Anvin" <hpa@zytor.com> Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1361579812-23709-1-git-send-email-konrad.wilk@oracle.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
The code requires the use of the proper per-exception-vector stub functions (set up as the early_idt_handlers[] array - note the 's') that make sure to set up the error vector number. This is true regardless of whether CONFIG_EARLY_PRINTK is set or not. Why? The stack offset for the comparison of __KERNEL_CS won't be right otherwise, nor will the new check (from commit 8170e6be: "x86, 64bit: Use a #PF handler to materialize early mappings on demand") for the page fault exception vector. Acked-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 2月, 2013 1 次提交
-
-
由 Marcelo Tosatti 提交于
This reverts commit caf6900f. It is causing migration failures, reference https://bugzilla.kernel.org/show_bug.cgi?id=54061. Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 20 2月, 2013 6 次提交
-
-
由 Ian Campbell 提交于
On ARM we want these to be the same size on 32- and 64-bit. This is an ABI change on ARM. X86 does not change. Signed-off-by: NIan Campbell <ian.campbell@citrix.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Keir (Xen.org) <keir@xen.org> Cc: Tim Deegan <tim@xen.org> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: linux-arm-kernel@lists.infradead.org Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Stefan Bader 提交于
There is a loophole between Xen's current implementation of pv-spinlocks and the scheduler. This was triggerable through a testcase until v3.6 changed the TLB flushing code. The problem potentially is still there just not observable in the same way. What could happen was (is): 1. CPU n tries to schedule task x away and goes into a slow wait for the runq lock of CPU n-# (must be one with a lower number). 2. CPU n-#, while processing softirqs, tries to balance domains and goes into a slow wait for its own runq lock (for updating some records). Since this is a spin_lock_irqsave in softirq context, interrupts will be re-enabled for the duration of the poll_irq hypercall used by Xen. 3. Before the runq lock of CPU n-# is unlocked, CPU n-1 receives an interrupt (e.g. endio) and when processing the interrupt, tries to wake up task x. But that is in schedule and still on_cpu, so try_to_wake_up goes into a tight loop. 4. The runq lock of CPU n-# gets unlocked, but the message only gets sent to the first waiter, which is CPU n-# and that is busily stuck. 5. CPU n-# never returns from the nested interruption to take and release the lock because the scheduler uses a busy wait. And CPU n never finishes the task migration because the unlock notification only went to CPU n-#. To avoid this and since the unlocking code has no real sense of which waiter is best suited to grab the lock, just send the IPI to all of them. This causes the waiters to return from the hyper- call (those not interrupted at least) and do active spinlocking. BugLink: http://bugs.launchpad.net/bugs/1011792Acked-by: NJan Beulich <JBeulich@suse.com> Signed-off-by: NStefan Bader <stefan.bader@canonical.com> Cc: stable@vger.kernel.org Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Stefano Stabellini 提交于
ioremap can't be used to map ring pages on ARM because it uses device memory caching attributes (MT_DEVICE*). Introduce a Xen specific abstraction to map ring pages, called xen_remap, that is defined as ioremap on x86 (no behavioral changes). On ARM it explicitly calls __arm_ioremap with the right caching attributes: MT_MEMORY. Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Konrad Rzeszutek Wilk 提交于
The PV and PVH code CPU init code share some functionality. The PVH code ("xen/pvh: Extend vcpu_guest_context, p2m, event, and XenBus") sets some of these up, but not all. To make it easier to read, this patch removes the PV specific out of the generic way. No functional change - just code movement. Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com> [v2: Fixed compile errors noticed by Fengguang Wu build system] Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Borislav Petkov 提交于
The "x86, AMD: Enable WC+ memory type on family 10 processors" patch currently in -tip added a workaround for AMD F10h CPUs which #GPs my guest when booted in kvm. This is because it accesses MSR_AMD64_BU_CFG2 which is not currently ignored by kvm. Do that because this MSR is only baremetal-relevant anyway. While at it, move the ignored MSRs at the beginning of kvm_set_msr_common so that we exit then and there. Acked-by: NGleb Natapov <gleb@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Andre Przywara <andre@andrep.de> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1361298793-31834-2-git-send-email-bp@alien8.deSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Borislav Petkov 提交于
The WC+ workaround for F10h introduces a new MSR and kvm host #GPs on accesses to unknown MSRs if paravirt is not compiled in. Use the exception-handling MSR accessors so as not to break 3.8 and later guests booting on older hosts. Remove a redundant family check while at it. Cc: Gleb Natapov <gleb@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1361298793-31834-1-git-send-email-bp@alien8.deSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 19 2月, 2013 1 次提交
-
-
由 Marcelo Tosatti 提交于
To match whats mapped via vsyscalls to userspace. Reported-by: NPeter Hurley <peter@hurleysoftware.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 18 2月, 2013 3 次提交
-
-
由 Alexander Holler 提交于
By just reversing the order memtest is using the test patterns, an additional round to zero the memory is not necessary. This might save up to a second or even more for setups which are doing tests on every boot. Signed-off-by: NAlexander Holler <holler@ahsoftware.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1361029097-8308-1-git-send-email-holler@ahsoftware.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Len Brown 提交于
(pm_idle)() is being removed from linux/pm.h because Linux does not have such a cross-architecture concept. x86 uses an idle function pointer in its architecture specific code as a backup to cpuidle. So we re-name x86 use of pm_idle to x86_idle, and make it static to x86. Signed-off-by: NLen Brown <len.brown@intel.com> Cc: x86@kernel.org
-
由 Len Brown 提交于
Update APM to register its local idle routine with cpuidle. This allows us to stop exporting pm_idle to modules on x86. The Kconfig sub-option, APM_CPU_IDLE, now depends on on CPU_IDLE. Compile-tested only. Signed-off-by: NLen Brown <len.brown@intel.com> Reviewed-by: NDaniel Lezcano <daniel.lezcano@linaro.org> Cc: Jiri Kosina <jkosina@suse.cz>
-
- 16 2月, 2013 1 次提交
-
-
由 Jacob Shin 提交于
On AMD family 15h processors, there are 4 new performance counters (in addition to 6 core performance counters) that can be used for counting northbridge events (i.e. DRAM accesses). Their bit fields are almost identical to the core performance counters. However, unlike the core performance counters, these MSRs are shared between multiple cores (that share the same northbridge). We will reuse the same code path as existing family 10h northbridge event constraints handler logic to enforce this sharing. Signed-off-by: NJacob Shin <jacob.shin@amd.com> Acked-by: NStephane Eranian <eranian@google.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Jacob Shin <jacob.shin@amd.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1360171589-6381-7-git-send-email-jacob.shin@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 15 2月, 2013 6 次提交
-
-
由 Satoru Takeuchi 提交于
vread_hpet() uses "0xf0" as the offset of the hpet counter. To clarify the meaning of this code, it should use symbolic name, HPET_COUNTER, instead. Signed-off-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Konrad Rzeszutek Wilk 提交于
This reverts commit 9d02b43d. We are doing this b/c on 32-bit PVonHVM with older hypervisors (Xen 4.1) it ends up bothing up the start_info. This is bad b/c we use it for the time keeping, and the timekeeping code loops forever - as the version field never changes. Olaf says to revert it, so lets do that. Acked-by: NOlaf Hering <olaf@aepfle.de> Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Konrad Rzeszutek Wilk 提交于
This reverts commit a7be94ac. Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Randy Dunlap 提交于
Fix kconfig warning for LGUEST_GUEST config by selecting TTY: warning: (KVMTOOL_TEST_ENABLE && LGUEST_GUEST) selects VIRTIO_CONSOLE which has unmet direct dependencies (VIRTIO && TTY) Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Joe Millenbach <jmillenbach@gmail.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 H. Peter Anvin 提交于
Move the reservation of low memory, except for the 4K which actually does belong to the BIOS, later in the initialization; in particular, after we have already reserved the trampoline. The current code locates the trampoline as high as possible, so by deferring the allocation we will still be able to reserve as much memory as is possible. This allows us to run with reservelow=640k without getting a crash on system startup. Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/n/tip-0y9dqmmsousf69wutxwl3kkf@git.kernel.org
-
由 Paul Gortmaker 提交于
Commit cb57a2b4 ("x86-32: Export kernel_stack_pointer() for modules") added an include of the module.h header in conjunction with adding an EXPORT_SYMBOL_GPL of kernel_stack_pointer. But module.h should be avoided for simple exports, since it in turn includes the world. Swap the module.h for export.h instead. Cc: Jiri Kosina <trivial@kernel.org> Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Link: http://lkml.kernel.org/r/1360872842-28417-1-git-send-email-paul.gortmaker@windriver.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 14 2月, 2013 7 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
__ARCH_WANT_SYS_RT_SIGACTION, __ARCH_WANT_SYS_RT_SIGSUSPEND, __ARCH_WANT_COMPAT_SYS_RT_SIGSUSPEND, __ARCH_WANT_COMPAT_SYS_SCHED_RR_GET_INTERVAL - not used anymore CONFIG_GENERIC_{SIGALTSTACK,COMPAT_RT_SIG{ACTION,QUEUEINFO,PENDING,PROCMASK}} - can be assumed always set.
-
由 Jan Kiszka 提交于
We already pass vmcs12 as argument. Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NGleb Natapov <gleb@redhat.com>
-
由 Satoru Takeuchi 提交于
There was a serious problem in samsung-laptop that its platform driver is designed to run under BIOS and running under EFI can cause the machine to become bricked or can cause Machine Check Exceptions. Discussion about this problem: https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557 https://bugzilla.kernel.org/show_bug.cgi?id=47121 The patches to fix this problem: efi: Make 'efi_enabled' a function to query EFI facilities 83e68189 samsung-laptop: Disable on EFI hardware e0094244 Unfortunately this problem comes back again if users specify "noefi" option. This parameter clears EFI_BOOT and that driver continues to run even if running under EFI. Refer to the document, this parameter should clear EFI_RUNTIME_SERVICES instead. Documentation/kernel-parameters.txt: =============================================================================== ... noefi [X86] Disable EFI runtime services support. ... =============================================================================== Documentation/x86/x86_64/uefi.txt: =============================================================================== ... - If some or all EFI runtime services don't work, you can try following kernel command line parameters to turn off some or all EFI runtime services. noefi turn off all EFI runtime services ... =============================================================================== Signed-off-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Link: http://lkml.kernel.org/r/511C2C04.2070108@jp.fujitsu.com Cc: Matt Fleming <matt.fleming@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Len Brown 提交于
The SMI counter is popular -- so display it by default rather than requiring an option. What the heck, we've blown the 80 column budget on many systems already... Note that the value displayed is the delta during the measurement interval. The absolute value of the counter can still be seen with the generic 32-bit MSR option, ie. -m 0x34 Signed-off-by: NLen Brown <len.brown@intel.com>
-
由 Jan Beulich 提交于
This fixes CVE-2013-0228 / XSA-42 Drew Jones while working on CVE-2013-0190 found that that unprivileged guest user in 32bit PV guest can use to crash the > guest with the panic like this: ------------- general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/vbd-51712/block/xvda/dev Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1250, comm: r Not tainted 2.6.32-356.el6.i686 #1 EIP: 0061:[<c0407462>] EFLAGS: 00010086 CPU: 0 EIP is at xen_iret+0x12/0x2b EAX: eb8d0000 EBX: 00000001 ECX: 08049860 EDX: 00000010 ESI: 00000000 EDI: 003d0f00 EBP: b77f8388 ESP: eb8d1fe0 DS: 0000 ES: 007b FS: 0000 GS: 00e0 SS: 0069 Process r (pid: 1250, ti=eb8d0000 task=c2953550 task.ti=eb8d0000) Stack: 00000000 0027f416 00000073 00000206 b77f8364 0000007b 00000000 00000000 Call Trace: Code: c3 8b 44 24 18 81 4c 24 38 00 02 00 00 8d 64 24 30 e9 03 00 00 00 8d 76 00 f7 44 24 08 00 00 02 80 75 33 50 b8 00 e0 ff ff 21 e0 <8b> 40 10 8b 04 85 a0 f6 ab c0 8b 80 0c b0 b3 c0 f6 44 24 0d 02 EIP: [<c0407462>] xen_iret+0x12/0x2b SS:ESP 0069:eb8d1fe0 general protection fault: 0000 [#2] ---[ end trace ab0d29a492dcd330 ]--- Kernel panic - not syncing: Fatal exception Pid: 1250, comm: r Tainted: G D --------------- 2.6.32-356.el6.i686 #1 Call Trace: [<c08476df>] ? panic+0x6e/0x122 [<c084b63c>] ? oops_end+0xbc/0xd0 [<c084b260>] ? do_general_protection+0x0/0x210 [<c084a9b7>] ? error_code+0x73/ ------------- Petr says: " I've analysed the bug and I think that xen_iret() cannot cope with mangled DS, in this case zeroed out (null selector/descriptor) by either xen_failsafe_callback() or RESTORE_REGS because the corresponding LDT entry was invalidated by the reproducer. " Jan took a look at the preliminary patch and came up a fix that solves this problem: "This code gets called after all registers other than those handled by IRET got already restored, hence a null selector in %ds or a non-null one that got loaded from a code or read-only data descriptor would cause a kernel mode fault (with the potential of crashing the kernel as a whole, if panic_on_oops is set)." The way to fix this is to realize that the we can only relay on the registers that IRET restores. The two that are guaranteed are the %cs and %ss as they are always fixed GDT selectors. Also they are inaccessible from user mode - so they cannot be altered. This is the approach taken in this patch. Another alternative option suggested by Jan would be to relay on the subtle realization that using the %ebp or %esp relative references uses the %ss segment. In which case we could switch from using %eax to %ebp and would not need the %ss over-rides. That would also require one extra instruction to compensate for the one place where the register is used as scaled index. However Andrew pointed out that is too subtle and if further work was to be done in this code-path it could escape folks attention and lead to accidents. Reviewed-by: NPetr Matousek <pmatouse@redhat.com> Reported-by: NPetr Matousek <pmatouse@redhat.com> Reviewed-by: NAndrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: NJan Beulich <jbeulich@suse.com> Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Gleb Natapov 提交于
Reported-by: NPaolo Bonzini <pbonzini@redhat.com> Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NGleb Natapov <gleb@redhat.com>
-
- 13 2月, 2013 2 次提交
-
-
由 Mel Gorman 提交于
A user reported the following oops when a backup process reads /proc/kcore: BUG: unable to handle kernel paging request at ffffbb00ff33b000 IP: [<ffffffff8103157e>] kern_addr_valid+0xbe/0x110 [...] Call Trace: [<ffffffff811b8aaa>] read_kcore+0x17a/0x370 [<ffffffff811ad847>] proc_reg_read+0x77/0xc0 [<ffffffff81151687>] vfs_read+0xc7/0x130 [<ffffffff811517f3>] sys_read+0x53/0xa0 [<ffffffff81449692>] system_call_fastpath+0x16/0x1b Investigation determined that the bug triggered when reading system RAM at the 4G mark. On this system, that was the first address using 1G pages for the virt->phys direct mapping so the PUD is pointing to a physical address, not a PMD page. The problem is that the page table walker in kern_addr_valid() is not checking pud_large() and treats the physical address as if it was a PMD. If it happens to look like pmd_none then it'll silently fail, probably returning zeros instead of real data. If the data happens to look like a present PMD though, it will be walked resulting in the oops above. This patch adds the necessary pud_large() check. Unfortunately the problem was not readily reproducible and now they are running the backup program without accessing /proc/kcore so the patch has not been validated but I think it makes sense. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.coM> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20130211145236.GX21389@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 K. Y. Srinivasan 提交于
Starting with win8, vmbus interrupts can be delivered on any VCPU in the guest and furthermore can be concurrently active on multiple VCPUs. Support this interrupt delivery model by setting up a separate IDT entry for Hyper-V vmbus. interrupts. I would like to thank Jan Beulich <JBeulich@suse.com> and Thomas Gleixner <tglx@linutronix.de>, for their help. In this version of the patch, based on the feedback, I have merged the IDT vector for Xen and Hyper-V and made the necessary adjustments. Furhermore, based on Jan's feedback I have added the necessary compilation switches. Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com> Link: http://lkml.kernel.org/r/1359940959-32168-3-git-send-email-kys@microsoft.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-