- 24 2月, 2011 9 次提交
-
-
ioapic_xlate provides a translation from the information in device tree to ioapic related informations. This includes - obtaining hw irq which is the vector number "=> pin number + gsi" - obtaining type (level/edge/..) - programming this information into ioapic ioapic_add_ofnode adds an irq_domain based on informations from the device tree. This information (irq_domain) is required in order to map a device to its proper interrupt controller. [ tglx: Adapted to the io_apic changes, which let us move that whole code to devicetree.c ] Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com> Acked-by: NGrant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-10-git-send-email-bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
x86_of_pci_init() does two things: - it provides a generic irq enable and disable function. enable queries the device tree for the interrupt information, calls ->xlate on the irq host and updates the pci->irq information for the device. - it walks through PCI bus(es) in the device tree and adds its children (device) nodes to appropriate pci_dev nodes in kernel. So the dtb node information is available at probe time of the PCI device. Adding a PCI bus based on the information in the device tree is currently not supported. Right now direct access via ioports is used. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: NDirk Brandewie <dirk.brandewie@gmail.com> Acked-by: NGrant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-8-git-send-email-bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
APIC and IO_APIC have to be added to the system early because native_init_IRQ() requires it. In order to obtain the address of the ioapic the device tree has to be unflattened so of_address_to_resource() works. The device tree is relocated to ensure it is always covered by the kernel mapping. That way the boot loader does not have to make any assumptions about kernel's memory layout. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: NGrant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org Cc: Dirk Brandewie <dirk.brandewie@gmail.com> LKML-Reference: <1298405266-1624-6-git-send-email-bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
The here introduced irq_domain abstraction represents a generic irq controller. It is a subset of powerpc's irq_host which is going to be renamed to irq_domain and then become generic. This implementation will be removed once it is generic. The xlate callback is resposible to parse irq informations like irq type and number and returns the hardware irq number which is reported by the hardware as active. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: NDirk Brandewie <dirk.brandewie@gmail.com> Acked-by: NGrant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-5-git-send-email-bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
This patch adds minimal support for device tree on x86. The device tree blob is passed to the kernel via setup_data which requires at least boot protocol 2.09. Memory size, restricted memory regions, boot arguments are gathered the traditional way so things like cmd_line are just here to let the code compile. The current plan is use the device tree as an extension and to gather information which can not be enumerated and would have to be hardcoded otherwise. This includes things like - which devices are on this I2C/SPI bus? - how are the interrupts wired to IO APIC? - where could my hpet be? Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com> Acked-by: NGrant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-3-git-send-email-bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
This patch ensures that the memory passed from parse_setup_data() is large enough to cover the complete data structure. That means that the conditional mapping in parse_e820_ext() can go. While here, I also attempt not to map two pages if the address is not aligned to a page boundary. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NDirk Brandewie <dirk.brandewie@gmail.com> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-2-git-send-email-bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Thomas Gleixner 提交于
Required for devicetree based io_apic configuration. Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Thomas Gleixner 提交于
No users outside of io_apic.c. Mark bad_ioapic() __init while at it. Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Thomas Gleixner 提交于
There are about four places in the ioapic code which do exactly the same setup sequence. Also the OF based ioapic setup needs that function to avoid putting the OF specific code into ioapic.c Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 23 2月, 2011 7 次提交
-
-
由 Thomas Gleixner 提交于
Stupid me missed the functions called from setup.c. Add the stubs back for OLPC=n Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Henrik Kretzschmar 提交于
This patch adds IOAPIC dummy functions for compilation with local APIC, but without IOAPIC. The local variable ioapic_entries in enable_IR_x2apic() does not need initialization anymore, since the dummy returns NULL. Signed-off-by: NHenrik Kretzschmar <henne@nachtwindheim.de> LKML-Reference: <1298385487-4708-4-git-send-email-henne@nachtwindheim.de> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Henrik Kretzschmar 提交于
Currently arch_disable_smp_support() on x86 disables only the support for the IOAPIC and is also compiled in if SMP-support is not. Therefore this function is renamed to disable_ioapic_support(), which meets its purpose and is only compiled in the kernel when IOAPIC support is also. A new arch_disable_smp_support() is created in smpboot.c, which calls disable_ioapic_support() and gets only compiled in the kernel when SMP support is also. Signed-off-by: NHenrik Kretzschmar <henne@nachtwindheim.de> LKML-Reference: <1298385487-4708-3-git-send-email-henne@nachtwindheim.de> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Henrik Kretzschmar 提交于
This is a dummy function, used when no IOAPIC is compiled in. Signed-off-by: NHenrik Kretzschmar <henne@nachtwindheim.de> LKML-Reference: <1298385487-4708-2-git-send-email-henne@nachtwindheim.de> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Henrik Kretzschmar 提交于
This enum is used by non IOAPIC code, so apicdef.h is the best place for it. Signed-off-by: NHenrik Kretzschmar <henne@nachtwindheim.de> LKML-Reference: <1298385487-4708-1-git-send-email-henne@nachtwindheim.de> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Thomas Gleixner 提交于
OLPC_OPENFIRMWARE_DT is just there to be selected by OLPC and selects OF_PROMTREE. So let OLPC select OF_PROMTREE and remove that extra config indirection. Fixup code and Makefile and use CONFIG_OF_PROMTREE instead. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andres Salomon <dilinger@queued.net>
-
由 Thomas Gleixner 提交于
Neither CONFIG_OLPC_OPENFIRMWARE nor CONFIG_OLPC_OPENFIRMWARE_DT are really necessary. OLPC selects OLPC_OPENFIRMWARE unconditionally, so move the "select OF" part under OLPC config option and fixup the dependencies in Makefiles and code. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andres Salomon <dilinger@queued.net>
-
- 15 2月, 2011 1 次提交
-
-
由 Feng Tang 提交于
Some wall clock devices use MMIO based HW register, this new function will give them a chance to do some initialization work before their get/set_time service get called, which is usually in early kernel boot phase. Signed-off-by: NFeng Tang <feng.tang@intel.com> Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: NAlan Cox <alan@linux.intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 14 2月, 2011 1 次提交
-
-
由 Borislav Petkov 提交于
We use it in non __cpuinit code now too so drop marker. Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com> LKML-Reference: <20110211171754.GA21047@aftab> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 10 2月, 2011 1 次提交
-
-
由 Jan Beulich 提交于
Additionally doing things conditionally upon smp_processor_id() being zero is generally a bad idea, as this means CPU 0 cannot be offlined and brought back online later again. While there may be other places where this is done, I think adding more of those should be avoided so that some day SMP can really become "symmetrical". Signed-off-by: NJan Beulich <jbeulich@novell.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <4D525C7E0200007800030EE1@vpn.id2.novell.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 05 2月, 2011 1 次提交
-
-
由 H. Peter Anvin 提交于
Since checkin ebba638a we call verify_cpu even in 32-bit mode. Unfortunately, calling a function means using the stack, and the stack pointer was not initialized in the 32-bit setup code! This code initializes the stack pointer, and simplifies the interface slightly since it is easier to rely on just a pointer value rather than a descriptor; we need to have different values for the segment register anyway. This retains start_stack as a virtual address, even though a physical address would be more convenient for 32 bits; the 64-bit code wants the other way around... Reported-by: NMatthieu Castet <castet.matthieu@free.fr> LKML-Reference: <4D41E86D.8060205@free.fr> Tested-by: NKees Cook <kees.cook@canonical.com> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 04 2月, 2011 1 次提交
-
-
由 Suresh Siddha 提交于
Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb IPI's while the cr3 is still pointing to the prev mm. And this window can lead to the possibility of bogus TLB fills resulting in strange failures. One such problematic scenario is mentioned below. T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI etc between the point of clearing the cpu from the mm_cpumask(mm1) and before reloading the cr3 with the new mm2. T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with flushing the TLB for mm1. It doesn't send the flush TLB to CPU-1 as it doesn't see that cpu listed in the mm_cpumask(mm1). T3. After the TLB flush is complete, CPU-2 goes ahead and frees the page-table pages associated with the removed vma mapping. T4. CPU-2 now allocates those freed page-table pages for something else. T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1 can potentially speculate and walk through the page-table caches and can insert new TLB entries. As the page-table pages are already freed and being used on CPU-2, this page walk can potentially insert a bogus global TLB entry depending on the (random) contents of the page that is being used on CPU-2. T6. This bogus TLB entry being global will be active across future CR3 changes and can result in weird memory corruption etc. To avoid this issue, for the prev mm that is handing over the cpu to another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is changed. Marking it for -stable, though we haven't seen any reported failure that can be attributed to this. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Acked-by: NIngo Molnar <mingo@elte.hu> Cc: stable@kernel.org [v2.6.32+] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 1月, 2011 3 次提交
-
-
由 Eric Dumazet 提交于
These recent percpu commits: 2485b646: x86,percpu: Move out of place 64 bit ops into X86_64 section 8270137a: cpuops: Use cmpxchg for xchg to avoid lock semantics Caused this 'perf top' crash: Kernel panic - not syncing: Fatal exception in interrupt Pid: 0, comm: swapper Tainted: G D 2.6.38-rc2-00181-gef71723 #413 Call Trace: <IRQ> [<ffffffff810465b5>] ? panic ? kmsg_dump ? kmsg_dump ? oops_end ? no_context ? __bad_area_nosemaphore ? perf_output_begin ? bad_area_nosemaphore ? do_page_fault ? __task_pid_nr_ns ? perf_event_tid ? __perf_event_header__init_id ? validate_chain ? perf_output_sample ? trace_hardirqs_off ? page_fault ? irq_work_run ? update_process_times ? tick_sched_timer ? tick_sched_timer ? __run_hrtimer ? hrtimer_interrupt ? account_system_vtime ? smp_apic_timer_interrupt ? apic_timer_interrupt ... Looking at assembly code, I found: list = this_cpu_xchg(irq_work_list, NULL); gives this wrong code : (gcc-4.1.2 cross compiler) ffffffff810bc45e: mov %gs:0xead0,%rax cmpxchg %rax,%gs:0xead0 jne ffffffff810bc45e <irq_work_run+0x3e> test %rax,%rax je ffffffff810bc4aa <irq_work_run+0x8a> Tell gcc we dirty eax/rax register in percpu_xchg_op() Compiler must use another register to store pxo_new__ We also dont need to reload percpu value after a jump, since a 'failed' cmpxchg already updated eax/rax Wrong generated code was : xor %rax,%rax /* load 0 into %rax */ 1: mov %gs:0xead0,%rax cmpxchg %rax,%gs:0xead0 jne 1b test %rax,%rax After patch : xor %rdx,%rdx /* load 0 into %rdx */ mov %gs:0xead0,%rax 1: cmpxchg %rdx,%gs:0xead0 jne 1b: test %rax,%rax Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> LKML-Reference: <1295973114.3588.312.camel@edumazet-laptop> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Yinghai Lu 提交于
Left-over from the x86 merge ... Signed-off-by: NYinghai Lu <yinghai@kernel.org> LKML-Reference: <4D3E23D1.7010405@kernel.org> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Andrea Arcangeli 提交于
This fixes TRANSPARENT_HUGEPAGE=y with PARAVIRT=y and HIGHMEM64=n. The #ifdef that this patch removes was erratically introduced to fix a build error for noPAE (where pmd.pmd doesn't exist). So then the kernel built but it failed at runtime because set_pmd_at was a noop. This will correct it by enabling set_pmd_at for noPAE mode too. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: Nwerner <w.landgraf@ru.ru> Reported-by: NMinchan Kim <minchan.kim@gmail.com> Tested-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 1月, 2011 1 次提交
-
-
由 matthieu castet 提交于
If we use jump table in module init, there are marked as removed in __jump_table section after init is done. But we already applied ro permissions on the module, so we can't modify a read only section (crash in remove_jump_label_module_init). Make the __jump_table section rw. Signed-off-by: NMatthieu CASTET <castet.matthieu@free.fr> Cc: Xiaotian Feng <xtfeng@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Siarhei Liakh <sliakh.lkml@gmail.com> Cc: Xuxian Jiang <jiang@cs.ncsu.edu> Cc: James Morris <jmorris@namei.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Dave Jones <davej@redhat.com> Cc: Kees Cook <kees.cook@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <4D3C3F20.7030203@free.fr> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 22 1月, 2011 1 次提交
-
-
由 Borislav Petkov 提交于
ea530692 made a CPU use monitor/mwait when offline. This is not the optimal choice for AMD wrt to powersavings and we'd prefer our cores to halt (i.e. enter C1) instead. For this, the same selection whether to use monitor/mwait has to be used as when we select the idle routine for the machine. With this patch, offlining cores 1-5 on a X6 machine allows core0 to boost again. [ hpa: putting this in urgent since it is a (power) regression fix ] Reported-by: NAndreas Herrmann <andreas.herrmann3@amd.com> Cc: stable@kernel.org # 37.x Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.hl> Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com> LKML-Reference: <1295534572-10730-1-git-send-email-bp@amd64.org> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 21 1月, 2011 1 次提交
-
-
由 Akinobu Mita 提交于
The implementation of the cache flushing interfaces on the x86 is identical with the default implementation in asm-generic. Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: arnd@arndb.de LKML-Reference: <1295523136-4277-2-git-send-email-akinobu.mita@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 19 1月, 2011 1 次提交
-
-
由 Jan Beulich 提交于
In order to be able to suppress the use of SRAT tables that 32-bit Linux can't deal with (in one case known to lead to a non-bootable system, unless disabling ACPI altogether), move the "numa=" option handling to common code. Signed-off-by: NJan Beulich <jbeulich@novell.com> Reviewed-by: NThomas Renninger <trenn@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Renninger <trenn@suse.de> LKML-Reference: <4D36B581020000780002D0FF@vpn.id2.novell.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 14 1月, 2011 12 次提交
-
-
由 Andrea Arcangeli 提交于
For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's correct to return 0 when shadow_access_mask is zero, it requires gup-fast to set the referenced bit). qemu-kvm access already sets the young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we relay on gup-fast to signal the page is in use... We also need to check the young bits on the secondary pagetables for NPT and not nested shadow mmu as the data may never get accessed again by the primary pte. Without this closer accuracy, we'd have to remove the heuristic that avoids collapsing hugepages in hugepage virtual regions that have not even a single subpage in use. ->test_young is full backwards compatible with GRU and other usages that don't have young bits in pagetables set by the hardware and that should nuke the secondary mmu mappings when ->clear_flush_young runs just like EPT does. Removing the heuristic that checks the young bit in khugepaged/collapse_huge_page completely isn't so bad either probably but I thought it was worth it and this makes it reliable. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Archs implementing Transparent Hugepage Support must implement a function called has_transparent_hugepage to be sure the virtual or physical CPU supports Transparent Hugepages. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Add pmd_modify() for use with mprotect() on huge pmds. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Add support for transparent hugepages to x86 32bit. Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support transparent hugepages. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page*. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Add needed pmd mangling functions with symmetry with their pte counterparts. pmdp_splitting_flush() is the only new addition on the pmd_ methods and it's needed to serialize the VM against split_huge_page. It simply atomically sets the splitting bit in a similar way pmdp_clear_flush_young atomically clears the accessed bit. pmdp_splitting_flush() also has to flush the tlb to make it effective against gup_fast, but it wouldn't really require to flush the tlb too. Just the tlb flush is the simplest operation we can invoke to serialize pmdp_splitting_flush() against gup_fast. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
These returns 0 at compile time when the config option is disabled, to allow gcc to eliminate the transparent hugepage function calls at compile time without additional #ifdefs (only the export of those functions have to be visible to gcc but they won't be required at link time and huge_memory.o can be not built at all). _PAGE_BIT_UNUSED1 is never used for pmd, only on pte. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
No paravirt version of set_pmd_at/pmd_update/pmd_update_defer. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary (vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer), but this is to keep full simmetry with pte paravirt ops, which looks cleaner and simpler from a common code POV. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Used by paravirt and not paravirt set_pmd_at. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lasse Collin 提交于
This integrates the XZ decompression code to the x86 pre-boot code. mkpiggy.c is updated to reserve about 32 KiB more buffer safety margin for kernel decompression. It is done unconditionally for all decompressors to keep the code simpler. The XZ decompressor needs around 30 KiB of heap, so the heap size is increased to 32 KiB on both x86-32 and x86-64. Documentation/x86/boot.txt is updated to list the XZ magic number. With the x86 BCJ filter in XZ, XZ-compressed x86 kernel tends to be a few percent smaller than the equivalent LZMA-compressed kernel. Signed-off-by: NLasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andres Salomon 提交于
Drop the old geode_gpio crud, as well as the raw outl() calls; instead, use the Linux GPIO API where possible, and the cs5535_gpio API in other places. Note that we don't actually clean up the driver properly yet (once loaded, it always remains loaded). That'll come later.. This patch is necessary for building the driver. Signed-off-by: NAndres Salomon <dilinger@queued.net> Cc: Greg KH <greg@kroah.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-