- 19 7月, 2007 2 次提交
-
-
由 Peter Zijlstra 提交于
The TRACE_IRQS_ON function in iret_exc: calls a C function without ensuring that the segments are set properly. Move the trace function and the enabling of interrupt into the C stub. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roland McGrath 提交于
The code for LDT segment selectors was not robust in the face of a bogus selector set in %cs via ptrace before the single-step was done. Signed-off-by: NRoland McGrath <roland@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 7月, 2007 19 次提交
-
-
由 Jeremy Fitzhardinge 提交于
Most of the time we can simply use the iret instruction to exit the kernel, rather than having to use the iret hypercall - the only exception is if we're returning into vm86 mode, or from delivering an NMI (which we don't support yet). When running native, iret has the behaviour of testing for a pending interrupt atomically with re-enabling interrupts. Unfortunately there's no way to do this with Xen, so there's a window in which we could get a recursive exception after enabling events but before actually returning to userspace. This causes a problem: if the nested interrupt causes one of the task's TIF_WORK_MASK flags to be set, they will not be checked again before returning to userspace. This means that pending work may be left pending indefinitely, until the process enters and leaves the kernel again. The net effect is that a pending signal or reschedule event could be delayed for an unbounded amount of time. To deal with this, the xen event upcall handler checks to see if the EIP is within the critical section of the iret code, after events are (potentially) enabled up to the iret itself. If its within this range, it calls the iret critical section fixup, which adjusts the stack to deal with any unrestored registers, and then shifts the stack frame up to replace the previous invocation. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
-
由 Jeremy Fitzhardinge 提交于
This patchs adds the mechanism to allow us to patch inline versions of common operations. The implementations of the direct-access versions save_fl, restore_fl, irq_enable and irq_disable are now in assembler, and the same code is used for both out of line and inline uses. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Keir Fraser <keir@xensource.com>
-
由 Jeremy Fitzhardinge 提交于
This patch is a rollup of all the core pieces of the Xen implementation, including: - booting and setup - pagetable setup - privileged instructions - segmentation - interrupt flags - upcalls - multicall batching BOOTING AND SETUP The vmlinux image is decorated with ELF notes which tell the Xen domain builder what the kernel's requirements are; the domain builder then constructs the address space accordingly and starts the kernel. Xen has its own entrypoint for the kernel (contained in an ELF note). The ELF notes are set up by xen-head.S, which is included into head.S. In principle it could be linked separately, but it seems to provoke lots of binutils bugs. Because the domain builder starts the kernel in a fairly sane state (32-bit protected mode, paging enabled, flat segments set up), there's not a lot of setup needed before starting the kernel proper. The main steps are: 1. Install the Xen paravirt_ops, which is simply a matter of a structure assignment. 2. Set init_mm to use the Xen-supplied pagetables (analogous to the head.S generated pagetables in a native boot). 3. Reserve address space for Xen, since it takes a chunk at the top of the address space for its own use. 4. Call start_kernel() PAGETABLE SETUP Once we hit the main kernel boot sequence, it will end up calling back via paravirt_ops to set up various pieces of Xen specific state. One of the critical things which requires a bit of extra care is the construction of the initial init_mm pagetable. Because Xen places tight constraints on pagetables (an active pagetable must always be valid, and must always be mapped read-only to the guest domain), we need to be careful when constructing the new pagetable to keep these constraints in mind. It turns out that the easiest way to do this is use the initial Xen-provided pagetable as a template, and then just insert new mappings for memory where a mapping doesn't already exist. This means that during pagetable setup, it uses a special version of xen_set_pte which ignores any attempt to remap a read-only page as read-write (since Xen will map its own initial pagetable as RO), but lets other changes to the ptes happen, so that things like NX are set properly. PRIVILEGED INSTRUCTIONS AND SEGMENTATION When the kernel runs under Xen, it runs in ring 1 rather than ring 0. This means that it is more privileged than user-mode in ring 3, but it still can't run privileged instructions directly. Non-performance critical instructions are dealt with by taking a privilege exception and trapping into the hypervisor and emulating the instruction, but more performance-critical instructions have their own specific paravirt_ops. In many cases we can avoid having to do any hypercalls for these instructions, or the Xen implementation is quite different from the normal native version. The privileged instructions fall into the broad classes of: Segmentation: setting up the GDT and the GDT entries, LDT, TLS and so on. Xen doesn't allow the GDT to be directly modified; all GDT updates are done via hypercalls where the new entries can be validated. This is important because Xen uses segment limits to prevent the guest kernel from damaging the hypervisor itself. Traps and exceptions: Xen uses a special format for trap entrypoints, so when the kernel wants to set an IDT entry, it needs to be converted to the form Xen expects. Xen sets int 0x80 up specially so that the trap goes straight from userspace into the guest kernel without going via the hypervisor. sysenter isn't supported. Kernel stack: The esp0 entry is extracted from the tss and provided to Xen. TLB operations: the various TLB calls are mapped into corresponding Xen hypercalls. Control registers: all the control registers are privileged. The most important is cr3, which points to the base of the current pagetable, and we handle it specially. Another instruction we treat specially is CPUID, even though its not privileged. We want to control what CPU features are visible to the rest of the kernel, and so CPUID ends up going into a paravirt_op. Xen implements this mainly to disable the ACPI and APIC subsystems. INTERRUPT FLAGS Xen maintains its own separate flag for masking events, which is contained within the per-cpu vcpu_info structure. Because the guest kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely ignored (and must be, because even if a guest domain disables interrupts for itself, it can't disable them overall). (A note on terminology: "events" and interrupts are effectively synonymous. However, rather than using an "enable flag", Xen uses a "mask flag", which blocks event delivery when it is non-zero.) There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which are implemented to manage the Xen event mask state. The only thing worth noting is that when events are unmasked, we need to explicitly see if there's a pending event and call into the hypervisor to make sure it gets delivered. UPCALLS Xen needs a couple of upcall (or callback) functions to be implemented by each guest. One is the event upcalls, which is how events (interrupts, effectively) are delivered to the guests. The other is the failsafe callback, which is used to report errors in either reloading a segment register, or caused by iret. These are implemented in i386/kernel/entry.S so they can jump into the normal iret_exc path when necessary. MULTICALL BATCHING Xen provides a multicall mechanism, which allows multiple hypercalls to be issued at once in order to mitigate the cost of trapping into the hypervisor. This is particularly useful for context switches, since the 4-5 hypercalls they would normally need (reload cr3, update TLS, maybe update LDT) can be reduced to one. This patch implements a generic batching mechanism for hypercalls, which gets used in many places in the Xen code. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: NChris Wright <chrisw@sous-sol.org> Cc: Ian Pratt <ian.pratt@xensource.com> Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Cc: Adrian Bunk <bunk@stusta.de>
-
由 Jeremy Fitzhardinge 提交于
Add the "nosegneg" fake capabilty to the vsyscall page notes. This is used by the runtime linker to select a glibc version which then disables negative-offset accesses to the thread-local segment via %gs. These accesses require emulation in Xen (because segments are truncated to protect the hypervisor address space) and avoiding them provides a measurable performance boost. Signed-off-by: NIan Pratt <ian.pratt@xensource.com> Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: NChris Wright <chrisw@sous-sol.org> Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Acked-by: NZachary Amsden <zach@vmware.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com>
-
由 Jeremy Fitzhardinge 提交于
The tsc-based get_scheduled_cycles interface is not a good match for Xen's runstate accounting, which reports everything in nanoseconds. This patch replaces this interface with a sched_clock interface, which matches both Xen and VMI's requirements. In order to do this, we: 1. replace get_scheduled_cycles with sched_clock 2. hoist cycles_2_ns into a common header 3. update vmi accordingly One thing to note: because sched_clock is implemented as a weak function in kernel/sched.c, we must define a real function in order to override this weak binding. This means the usual paravirt_ops technique of using an inline function won't work in this case. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Cc: Zachary Amsden <zach@vmware.com> Cc: Dan Hecht <dhecht@vmware.com> Cc: john stultz <johnstul@us.ibm.com>
-
由 Jeremy Fitzhardinge 提交于
In a virtual environment, device drivers such as legacy IDE will waste quite a lot of time probing for their devices which will never appear. This helper function allows a paravirt implementation to lay claim to the whole iomem and ioport space, thereby disabling all device drivers trying to claim IO resources. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: NChris Wright <chrisw@sous-sol.org> Cc: Rusty Russell <rusty@rustcorp.com.au>
-
由 Jeremy Fitzhardinge 提交于
Paravirt implementations need to set the sibling map on new cpus. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
-
由 Jeremy Fitzhardinge 提交于
Paravirt implementations need to store cpu info when bringing up cpus. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
-
由 Jeremy Fitzhardinge 提交于
Make globally leave_mm visible, specifically so that Xen can use it to shoot-down lazy uses of cr3. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: NChris Wright <chrisw@sous-sol.org>
-
由 Jeremy Fitzhardinge 提交于
Add a hook so that the paravirt backend knows when the allocator is ready. This is useful for the obvious reason that the allocator is available, but the other side-effect of having the bootmem allocator available is that each page now has an associated "struct page". Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
-
由 Jeremy Fitzhardinge 提交于
It's useful to know which mm is allocating a pagetable. Xen uses this to determine whether the pagetable being added to is pinned or not. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
-
由 Jeremy Fitzhardinge 提交于
Use existing elfnote.h to generate vsyscall notes, rather than doing it locally. Changes elfnote.h a bit to suit, since this is the first asm user, and it wasn't quite right. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.com>
-
由 Amit Arora 提交于
fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: NAmit Arora <aarora@in.ibm.com>
-
由 Jeff Garzik 提交于
Mark variables with uninitialized_var() if such a warning appears, and analysis proves that the var is initialized properly on all paths it is used. Signed-off-by: NJeff Garzik <jeff@garzik.org>
-
由 Andrew Morton 提交于
Avoid dirtying remote cpu's memory if it already has the correct value. Cc: Andi Kleen <ak@suse.de> Cc: Konrad Rzeszutek <konrad@darnok.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
Identical implementations of PTRACE_POKEDATA go into generic_ptrace_pokedata() function. AFAICS, fix bug on xtensa where successful PTRACE_POKEDATA will nevertheless return EPERM. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <linux-arch@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
Identical implementations of PTRACE_PEEKDATA go into generic_ptrace_peekdata() function. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <linux-arch@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Emelianov 提交于
If the kernel OOPSed or BUGed then it probably should be considered as tainted. Thus, all subsequent OOPSes and SysRq dumps will report the tainted kernel. This saves a lot of time explaining oddities in the calltraces. Signed-off-by: NPavel Emelianov <xemul@openvz.org> Acked-by: NRandy Dunlap <randy.dunlap@oracle.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> [ Added parisc patch from Matthew Wilson -Linus ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rafael J. Wysocki 提交于
Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl> Acked-by: NNigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 7月, 2007 3 次提交
-
-
由 Heiko Carstens 提交于
The current generic bug implementation has a call to dump_stack() in case a WARN_ON(whatever) gets hit. Since report_bug(), which calls dump_stack(), gets called from an exception handler we can do better: just pass the pt_regs structure to report_bug() and pass it to show_regs() in case of a warning. This will give more debug informations like register contents, etc... In addition this avoids some pointless lines that dump_stack() emits, since it includes a stack backtrace of the exception handler which is of no interest in case of a warning. E.g. on s390 the following lines are currently always present in a stack backtrace if dump_stack() gets called from report_bug(): [<000000000001517a>] show_trace+0x92/0xe8) [<0000000000015270>] show_stack+0xa0/0xd0 [<00000000000152ce>] dump_stack+0x2e/0x3c [<0000000000195450>] report_bug+0x98/0xf8 [<0000000000016cc8>] illegal_op+0x1fc/0x21c [<00000000000227d6>] sysc_return+0x0/0x10 Acked-by: NJeremy Fitzhardinge <jeremy@goop.org> Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com> Cc: Andi Kleen <ak@suse.de> Cc: Kyle McMartin <kyle@parisc-linux.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
This follows a suggestion from Chuck Ebbert on how to make seccomp absolutely zerocost in schedule too. The only remaining footprint of seccomp is in terms of the bzImage size that becomes a few bytes (perhaps even a few kbytes) larger, measure it if you care in the embedded. Signed-off-by: NAndrea Arcangeli <andrea@cpushare.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric W. Biderman 提交于
Needed to get fixed virtual address for USB debug and earlycon with mmio. Signed-off-by: NEric W. Biderman <ebiderman@xmisson.com> Signed-off-by: NYinghai Lu <yinghai.lu@sun.com> Cc: Andi Kleen <ak@suse.de> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Gerd Hoffmann <kraxel@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 7月, 2007 2 次提交
-
-
由 Avi Kivity 提交于
This removes the requirement for callers to get_cpu() to check in simple cases. Cc: Andi Kleen <ak@suse.de> Signed-off-by: NAvi Kivity <avi@qumranet.com>
-
由 Avi Kivity 提交于
CPU_DYING is notified in atomic context, so no taking mutexes here. Signed-off-by: NAvi Kivity <avi@qumranet.com>
-
- 14 7月, 2007 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commit 904f7a3f. As noted by Peter Anvin: "It causes build failures on i386. Yet another case of unnecessary divergence between i386 and x86-64 I'm afraid..." Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 7月, 2007 10 次提交
-
-
由 Dave Jones 提交于
Based on a patch from Joachim which didn't apply, so I fixed it up by hand, and also corrected the surrounding indentation a little. Signed-off-by: NJoachim.Deguara <joachim.deguara@amd.com> Acked-by: NMark Langsdorf <mark.langsdorf@amd.com> Signed-off-by: NDave Jones <davej@redhat.com>
-
由 Andrew Morton 提交于
Make it compile on UP. Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NDave Jones <davej@redhat.com>
-
由 Adrian Bunk 提交于
This patch contains the overdue removal of X86_SPEEDSTEP_CENTRINO_ACPI. Signed-off-by: NAdrian Bunk <bunk@stusta.de> Acked-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: NDave Jones <davej@redhat.com>
-
由 Rafał Bilski 提交于
On some motherboards ACPI C3 is available, but it isn't causing frequency transition on VIA Nehemiah. Longhaul wasn't working at all earlier, but due to scaling_cur_speed returning true CPU frequency now, it looks like CPU is getting stuck at highest frequency since 2.6.21. I didn't find a reason. Halt is causing frequency transition. Signed-off-by: NRafal Bilski <rafalbilski@interia.pl> Signed-off-by: NDave Jones <davej@redhat.com>
-
由 H. Peter Anvin 提交于
This removes the old i386 setup code. This is done as a separate patch to avoid breaking git bisect as some of the i386 code was also used by the old x86-64 code. Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 H. Peter Anvin 提交于
Make struct boot_params a real structure, and remove the handling of some obsolete fields, in particular hd*_info, which was only used by the ST-506 driver, and likely to be wrong for that driver on any modern BIOS. Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 H. Peter Anvin 提交于
Make definitions for struct e820entry and struct e820map consistent between i386 and x86-64. Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Venki Pallipadi 提交于
Some Intel features are spread around in different CPUID leafs like 0x5, 0x6 and 0xA. Make this feature detection code common across i386 and x86_64. Display Intel Dynamic Acceleration feature in /proc/cpuinfo. This feature will be enabled automatically by current acpi-cpufreq driver. Refer to Intel Software Developer's Manual for more details about the feature. Thanks to hpa (H Peter Anvin) for the making the actual code detecting the scattered features data-driven. Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 H. Peter Anvin 提交于
The X86_MINIMUM_CPU_MODEL name isn't really right, so change it to X86_MINIMUM_CPU_FAMILY. Also, the default minimum should be 3, not 0. Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 H. Peter Anvin 提交于
Unify the handling of the CPU features vectors between i386 and x86-64. This also adopts the collapsing of features which are required at compile-time into constant tests from x86-64 to i386. Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 7月, 2007 1 次提交
-
-
由 Auke Kok 提交于
Instead of all drivers reading pci config space to get the revision ID, they can now use the pci_device->revision member. This exposes some issues where drivers where reading a word or a dword for the revision number, and adding useless error-handling around the read. Some drivers even just read it for no purpose of all. In devices where the revision ID is being copied over and used in what appears to be the equivalent of hotpath, I have left the copy code and the cached copy as not to influence the driver's performance. Compile tested with make all{yes,mod}config on x86_64 and i386. Signed-off-by: NAuke Kok <auke-jan.h.kok@intel.com> Acked-by: NDave Jones <davej@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
-
- 10 7月, 2007 2 次提交
-
-
由 Ingo Molnar 提交于
track TSC-unstable events and propagate it to the scheduler code. Also allow sched_clock() to be used when the TSC is unstable, the rq_clock() wrapper creates a reliable clock out of it. Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Ingo Molnar 提交于
the SMP load-balancer uses the boot-time migration-cost estimation code to attempt to improve the quality of balancing. The reason for this code is that the discrete priority queues do not preserve the order of scheduling accurately, so the load-balancer skips tasks that were running on a CPU 'recently'. this code is fundamental fragile: the boot-time migration cost detector doesnt really work on systems that had large L3 caches, it caused boot delays on large systems and the whole cache-hot concept made the balancing code pretty undeterministic as well. (and hey, i wrote most of it, so i can say it out loud that it sucks ;-) under CFS the same purpose of cache affinity can be achieved without any special cache-hot special-case: tasks are sorted in the 'timeline' tree and the SMP balancer picks tasks from the left side of the tree, thus the most cache-cold task is balanced automatically. Signed-off-by: NIngo Molnar <mingo@elte.hu>
-