提交 · ef020ab0109aa5cd6eac2e93519b7641c9862828 · xiphi1978 / linux

27 10月, 2008 2 次提交

x86/uv: memory allocation at initialization · ef020ab0

由 Cliff Wickman 提交于 10月 23, 2008

Impact: on SGI UV platforms, fix boot crash

UV initialization is currently called too late to call alloc_bootmem_pages().
The current sequence is:

 start_kernel()
   mem_init()
     free_all_bootmem()           <--- discard of bootmem
   rest_init()
     kernel_init()
       smp_prepare_cpus()
       native_smp_prepare_cpus()
         uv_system_init()         <--- uses alloc_bootmem_pages()

It should be calling kmalloc().
Signed-off-by: NCliff Wickman <cpw@sgi.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ef020ab0

xen: fix Xen domU boot with batched mprotect · 9f32d21c

由 Chris Lalancette 提交于 10月 23, 2008

Impact: fix guest kernel boot crash on certain configs

Recent i686 2.6.27 kernels with a certain amount of memory (between
736 and 855MB) have a problem booting under a hypervisor that supports
batched mprotect (this includes the RHEL-5 Xen hypervisor as well as
any 3.3 or later Xen hypervisor).

The problem ends up being that xen_ptep_modify_prot_commit() is using
virt_to_machine to calculate which pfn to update.  However, this only
works for pages that are in the p2m list, and the pages coming from
change_pte_range() in mm/mprotect.c are kmap_atomic pages.  Because of
this, we can run into the situation where the lookup in the p2m table
returns an INVALID_MFN, which we then try to pass to the hypervisor,
which then (correctly) denies the request to a totally bogus pfn.

The right thing to do is to use arbitrary_virt_to_machine, so that we
can be sure we are modifying the right pfn.  This unfortunately
introduces a performance penalty because of a full page-table-walk,
but we can avoid that penalty for pages in the p2m list by checking if
virt_addr_valid is true, and if so, just doing the lookup in the p2m
table.

The attached patch implements this, and allows my 2.6.27 i686 based
guest with 768MB of memory to boot on a RHEL-5 hypervisor again.
Thanks to Jeremy for the suggestions about how to fix this particular
issue.
Signed-off-by: NChris Lalancette <clalance@redhat.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Chris Lalancette <clalance@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9f32d21c

23 10月, 2008 5 次提交

proc: switch /proc/meminfo to seq_file · e1759c21

由 Alexey Dobriyan 提交于 10月 15, 2008

and move it to fs/proc/meminfo.c while I'm at it.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>

e1759c21

x86: canonicalize remaining header guards · 5e1b0075

由 H. Peter Anvin 提交于 10月 23, 2008

Canonicalize a few remaining header guards, with the exception for
those which are still in subarchitecture directories.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

5e1b0075

x86: drop double underscores from header guards · 05e4d316

由 H. Peter Anvin 提交于 10月 23, 2008

Drop double underscores from header guards in arch/x86/include.  They
are used inconsistently, and are not necessary.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

05e4d316

x86: Fix ASM_X86__ header guards · 1965aae3

由 H. Peter Anvin 提交于 10月 22, 2008

Change header guards named "ASM_X86__*" to "_ASM_X86_*" since:

a. the double underscore is ugly and pointless.
b. no leading underscore violates namespace constraints.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

1965aae3

x86, um: ... and asm-x86 move · bb898558

由 Al Viro 提交于 8月 17, 2008

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

bb898558

22 10月, 2008 17 次提交

x86: fix section mismatch warning - apic_x2apic_phys · 3cfba089

由 Marcin Slusarz 提交于 10月 12, 2008

Impact: cleanup only, no functionality changed

WARNING: vmlinux.o(.data+0xc008): Section mismatch in reference from the variable apic_x2apic_phys to the function .init.text:x2apic_acpi_madt_oem_check()
The variable apic_x2apic_phys references
the function __init x2apic_acpi_madt_oem_check()
Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3cfba089

x86: fix section mismatch warning - apic_x2apic_cluster · 2caa3715

由 Marcin Slusarz 提交于 10月 12, 2008

Impact: cleanup only, no functionality changed

WARNING: vmlinux.o(.data+0xbf88): Section mismatch in reference from the variable apic_x2apic_cluster to the function .init.text:x2apic_acpi_madt_oem_check()
The variable apic_x2apic_cluster references
the function __init x2apic_acpi_madt_oem_check()
Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2caa3715

x86: fix section mismatch warning - apic_x2apic_uv_x · f8827c01

由 Marcin Slusarz 提交于 10月 12, 2008

Impact: cleanup only, no functionality changed

WARNING: vmlinux.o(.data+0xbf08): Section mismatch in reference from the variable apic_x2apic_uv_x to the function .init.text:uv_acpi_madt_oem_check()
The variable apic_x2apic_uv_x references
the function __init uv_acpi_madt_oem_check()
Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f8827c01

x86: fix section mismatch warning - apic_physflat · fae17216

由 Marcin Slusarz 提交于 10月 12, 2008

Impact: cleanup only, no functionality changed

WARNING: vmlinux.o(.data+0xbe88): Section mismatch in reference from the variable apic_physflat to the function .init.text:physflat_acpi_madt_oem_check()
The variable apic_physflat references
the function __init physflat_acpi_madt_oem_check()
Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fae17216

x86: fix section mismatch warning - apic_flat · 983f91ff

由 Marcin Slusarz 提交于 10月 12, 2008

Impact: cleanup only, no functionality changed

WARNING: vmlinux.o(.data+0xbe08): Section mismatch in reference from the variable apic_flat to the function .init.text:flat_acpi_madt_oem_check()
The variable apic_flat references
the function __init flat_acpi_madt_oem_check()

This is harmless, because the .acpi_madt_oem_check is only called
during init time. But we keep the function pointer around in a .data
function pointer template, so it's better we do not keep that stale
- so mark this function non-__init.
Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

983f91ff

x86: memtest fix use of reserve_early() · 2cb0ebee

由 Daniele Calore 提交于 10月 13, 2008

Hi all,

Wrong usage of 2nd parameter in reserve_early call.
66/75: reserve_early(start_bad, last_bad - start_bad, "BAD RAM");
                                ^^^^^^^^^^^^^^^^^^^^

The correct way is to use 'end' address and not 'size'.
As a bonus a fix to the printk format.
Signed-off-by: NDaniele Calore <orkaan@orkaan.org>
Acked-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2cb0ebee

x86/tlb_uv: remove strange mc146818rtc include · aef8f5b8

由 Jeremy Fitzhardinge 提交于 10月 14, 2008

For some reason tlb_uv was including linux/mc146818rtc.h.  It really
just needs linux/seq_file.h
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citix.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

aef8f5b8

x86: remove redundant KERN_DEBUG on pr_debug · 55410791

由 Gustavo F. Padovan 提交于 10月 15, 2008

pr_debug don't need KERN_DEBUG.
Signed-off-by: NGustavo F. Padovan <gustavo@las.ic.unicamp.br>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

55410791

x86: do_boot_cpu - check if we have ESR register · db96b0a0

由 Cyrill Gorcunov 提交于 10月 22, 2008

Impact: fix APIC IRQ irregularities on certain older boxes

We should touch the APIC ESR register if only we have it.

The patch fixes the problem mentioned by Max Kellermann:

	http://lkml.org/lkml/2008/10/17/147Bisected-by: NMax Kellermann <mk@cm4all.com>
Signed-off-by: NCyrill Gorcunov <gorcunov@gmail.com>
[ mingo@elte.hu: build fix ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

db96b0a0

x86/proc: fix /proc/cpuinfo cpu offline bug · bc8bcc79

由 Lai Jiangshan 提交于 10月 22, 2008

Impact: fix missing CPUs in /proc/cpuinfo after CPU hotunplug/hotreplug

In my test, I found that if a cpu has been offline,
the next cpus may not be shown in the /proc/cpuinfo.

if one read() cannot consume the whole /proc/cpuinfo,
c_start() will be called again in the next read() calls.
And *pos has been increased by 1 by the caller(seq_read()).
if this time the cpu#*pos is offline, c_start() will return
NULL, and the next cpus can not be shown.

this fix use next_cpu_nr(*pos - 1, cpu_online_map) to
search the next unshown cpu.

the most easy way to reproduce this bug:
1) offline cpu#1             (cpu#0 is online)
2) dd ibs=2 if=/proc/cpuinfo
   the result is that only cpu#0 is shown.
   cpu#2 and cpu#3 .... cannot be shown in /proc/cpuinfo
   it's bug.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

bc8bcc79

x86: call dmi-quirks for HP Laptops after early-quirks are executed · 35af2821

由 Andreas Herrmann 提交于 10月 22, 2008

Impact: make warning message disappear - functionality unchanged

Problems with bogus IRQ0 override of those laptops should be fixed
with commits

x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC
x86: SB450: skip IRQ0 override if it is not routed to INT2 of IOAPIC

that introduce early-quirks based on chipset configuration.

For further information, see
http://bugzilla.kernel.org/show_bug.cgi?id=11516

Instead of removing the related dmi-quirks completely we'd like to
keep them for (at least) one kernel version -- to double-check whether
the early-quirks really took effect. But the dmi-quirks need to be
called after early-quirks are executed. With this patch calling
sequence for dmi-quriks is changed as follows:

 acpi_boot_table_init()   (dmi-quirks)
 ...
 early_quirks()           (detect bogus IRQ0 override)
 ...
 acpi_boot_init()         (late dmi-quirks and setup IO APIC)

Note: Plan is to remove the "late dmi-quirks" with next kernel version.
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

35af2821

x86, kexec: fix hang on i386 when panic occurs while console_sem is held · cf52ebed

由 Neil Horman 提交于 10月 17, 2008

There's a corner case in 32 bit x86 kdump at the moment.  When the box
panics via nmi, we call bust_spinlocks(1) to disable sensitivity to the
console_sem (allowing us to print to the console in all cases), but we don't
call crash_kexec, until after we call bust_spinlocks(0), which re-enables
console_sem sensitivity.

The result is that, if we get an nmi while the console_sem is held and
kdump is configured, and we try to print something to the console during
kdump shutdown (which we often do) we deadlock the box.  The fix is to
simply do what 64 bit die_nmi does which is to not call bust_spinlocks(0)
until after we call crash_kexec.

Patch below tested successfully by me.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cf52ebed

MCE: Don't run 32bit machine checks with interrupts on · d2f6f7ae

由 Andi Kleen 提交于 10月 21, 2008

Running machine checks with interrupt on is a extremly bad idea. The machine
check handler only runs when the system is broken and needs to finish
as quickly as possible.

Remove the respective bogus post 2.6.27 regression and call
the machine check vector directly again.

This removes only code.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
[Cherry-picked from x86/mce]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

d2f6f7ae

x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC · 2bfef69d

由 Andreas Herrmann 提交于 10月 15, 2008

Impact: fix hung bootup and other misbehavior on certain laptops

On some more HP laptops BIOS reports an IRQ0 override
but the SB600 chipset is configured such that timer
interrupts go to INT0 of IOAPIC.

Check IRQ0 routing and if it is routed to INT0 of IOAPIC skip the
timer override.

See following bug reports:

  http://bugzilla.kernel.org/show_bug.cgi?id=11715
  http://bugzilla.kernel.org/show_bug.cgi?id=11516Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Cc: <stable@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2bfef69d

x86: make variables static · 8bcad30f

由 roel kluin 提交于 10月 21, 2008

These variables are only used in their source files, so make them static.
Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8bcad30f

i7300_idle driver v1.55 · 27471fdb

由 Andy Henroid 提交于 10月 09, 2008

The Intel 7300 Memory Controller supports dynamic throttling of memory which can
be used to save power when system is idle. This driver does the memory
throttling when all CPUs are idle on such a system.

Refer to "Intel 7300 Memory Controller Hub (MCH)" datasheet
for the config space description.
Signed-off-by: NAndy Henroid <andrew.d.henroid@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>

27471fdb

x86 allow modules to register idle notifiers · c7d87d79

由 Venkatesh Pallipadi 提交于 10月 16, 2008

needed if the i7300_idle driver is to be modular.
Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NLen Brown <len.brown@intel.com>

c7d87d79

21 10月, 2008 5 次提交

genirq: fix off by one and coding style · e9f95e63

由 Ingo Molnar 提交于 10月 21, 2008

Fix off-by-one in for_each_irq_desc_reverse().

Impact is near zero in practice, because nothing substantial wants to
iterate down to IRQ#0 - but fix it nevertheless.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

e9f95e63

Update email addresses. · f4432c5c

由 Dave Jones 提交于 10月 20, 2008

Update assorted email addresses and related info to point
to a single current, valid address.

additionally
- trivial CREDITS entry updates. (Not that this file means much any more)
- remove arjans dead redhat.com address from powernow driver
Signed-off-by: NDave Jones <davej@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f4432c5c

Update .gitignore files for generated targets · db7a6d8d

由 Linus Torvalds 提交于 10月 20, 2008

The generated 'capflags.c' file wasn't properly ignored, and the list of
files in scripts/basic/ wasn't up-to-date.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db7a6d8d

x86/PCI: irq and pci_ids patch for Intel Ibex Peak DeviceIDs · 37a84ec6

由 Seth Heasley 提交于 8月 28, 2008

This patch updates the Intel Ibex Peak (PCH) LPC and SMBus Controller
DeviceIDs.

The LPC Controller ID is set by Firmware within the range of
0x3b00-3b1f.  This range is included in pci_ids.h using min and max
values, and irq.c now has code to handle the range (in lieu of 32
additions to a SWITCH statement).

The SMBus Controller ID is a fixed-value and will not change.
Signed-off-by: NSeth Heasley <seth.heasley@intel.com>
Acked-by: NJean Delvare <khali@linux-fr.org>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

37a84ec6

x86/PCI: follow lspci device/vendor style · d768cb69

由 Bjorn Helgaas 提交于 8月 25, 2008

Use "[%04x:%04x]" for PCI vendor/device IDs to follow the format
used by lspci(8).
Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

d768cb69

20 10月, 2008 4 次提交

rtc: use bcd2bin/bin2bcd · 357c6e63

由 Adrian Bunk 提交于 10月 18, 2008

Change various rtc related code to use the new bcd2bin/bin2bcd functions
instead of the obsolete BCD_TO_BIN/BIN_TO_BCD/BCD2BIN/BIN2BCD macros.
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Acked-by: NAlessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

357c6e63

kdump: make elfcorehdr_addr independent of CONFIG_PROC_VMCORE · 57cac4d1

由 Vivek Goyal 提交于 10月 18, 2008

o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE
  but also by the code which is not inside CONFIG_PROC_VMCORE.  For
  example, is_kdump_kernel() is used by powerpc code to determine if
  kernel is booting after a panic then use previous kernel's TCE table.
  So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be
  able to correctly determine that we are booting after a panic and setup
  calgary iommu accordingly.

o So remove the assumption that elfcorehdr_addr is under
  CONFIG_PROC_VMCORE.

o Move definition of elfcorehdr_addr to arch dependent crash files.
  (Unfortunately crash dump does not have an arch independent file
  otherwise that would have been the best place).

o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
  second kernel without KEXEC being enabled.

o I don't see sh setup code parsing the command line for
  elfcorehdr_addr.  I am wondering how does vmcore interface work on sh.
  Anyway, I am atleast defining elfcoredhr_addr so that compilation is not
  broken on sh.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NSimon Horman <horms@verge.net.au>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

57cac4d1

container freezer: implement freezer cgroup subsystem · dc52ddc0

由 Matt Helsley 提交于 10月 18, 2008

This patch implements a new freezer subsystem in the control groups
framework.  It provides a way to stop and resume execution of all tasks in
a cgroup by writing in the cgroup filesystem.

The freezer subsystem in the container filesystem defines a file named
freezer.state.  Writing "FROZEN" to the state file will freeze all tasks
in the cgroup.  Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup.  Reading will return the current state.

* Examples of usage :

   # mkdir /containers/freezer
   # mount -t cgroup -ofreezer freezer  /containers
   # mkdir /containers/0
   # echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

   # cat /containers/0/freezer.state
   RUNNING

to freeze all tasks in the container :

   # echo FROZEN > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   FREEZING
   # cat /containers/0/freezer.state
   FROZEN

to unfreeze all tasks in the container :

   # echo RUNNING > /containers/0/freezer.state
   # cat /containers/0/freezer.state
   RUNNING

This is the basic mechanism which should do the right thing for user space
task in a simple scenario.

It's important to note that freezing can be incomplete.  In that case we
return EBUSY.  This means that some tasks in the cgroup are busy doing
something that prevents us from completely freezing the cgroup at this
time.  After EBUSY, the cgroup will remain partially frozen -- reflected
by freezer.state reporting "FREEZING" when read.  The state will remain
"FREEZING" until one of these things happens:

	1) Userspace cancels the freezing operation by writing "RUNNING" to
		the freezer.state file
	2) Userspace retries the freezing operation by writing "FROZEN" to
		the freezer.state file (writing "FREEZING" is not legal
		and returns EIO)
	3) The tasks that blocked the cgroup from entering the "FROZEN"
		state disappear from the cgroup's set of tasks.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: export thaw_process]
Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
Tested-by: NMatt Helsley <matthltc@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dc52ddc0

mm: rewrite vmap layer · db64fe02

由 Nick Piggin 提交于 10月 18, 2008

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap.  Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache.  This is all done under a global lock.  As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
 This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock.  It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems.  The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated.  So the addresses aren't allocated again until
a subsequent TLB flush.  A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap.  Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages.  Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
in nanoseconds per map+touch+unmap.

threads           vanilla         vmap rewrite
1                 14700           2900
2                 33600           3000
4                 49500           2800
8                 70631           2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram...  along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system.  I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now.  vmap is pretty well blown off the
profiles.

Before:
1352059 total                                      0.1401
798784 _write_lock                              8320.6667 <- vmlist_lock
529313 default_idle                             1181.5022
 15242 smp_call_function                         15.8771  <- vmap tlb flushing
  2472 __get_vm_area_node                         1.9312  <- vmap
  1762 remove_vm_area                             4.5885  <- vunmap
   316 map_vm_area                                0.2297  <- vmap
   312 kfree                                      0.1950
   300 _spin_lock                                 3.1250
   252 sn_send_IPI_phys                           0.4375  <- tlb flushing
   238 vmap                                       0.8264  <- vmap
   216 find_lock_page                             0.5192
   196 find_next_bit                              0.3603
   136 sn2_send_IPI                               0.2024
   130 pio_phys_write_mmr                         2.0312
   118 unmap_kernel_range                         0.1229

After:
 78406 total                                      0.0081
 40053 default_idle                              89.4040
 33576 ia64_spinlock_contention                 349.7500
  1650 _spin_lock                                17.1875
   319 __reg_op                                   0.5538
   281 _atomic_dec_and_lock                       1.0977
   153 mutex_unlock                               1.5938
   123 iget_locked                                0.1671
   117 xfs_dir_lookup                             0.1662
   117 dput                                       0.1406
   114 xfs_iget_core                              0.0268
    92 xfs_da_hashname                            0.1917
    75 d_alloc                                    0.0670
    68 vmap_page_range                            0.0462 <- vmap
    58 kmem_cache_alloc                           0.0604
    57 memset                                     0.0540
    52 rb_next                                    0.1625
    50 __copy_user                                0.0208
    49 bitmap_find_free_region                    0.2188 <- vmap
    46 ia64_sn_udelay                             0.1106
    45 find_inode_fast                            0.1406
    42 memcmp                                     0.2188
    42 finish_task_switch                         0.1094
    42 __d_lookup                                 0.0410
    40 radix_tree_lookup_slot                     0.1250
    37 _spin_unlock_irqrestore                    0.3854
    36 xfs_bmapi                                  0.0050
    36 kmem_cache_free                            0.0256
    35 xfs_vn_getattr                             0.0322
    34 radix_tree_lookup                          0.1062
    33 __link_path_walk                           0.0035
    31 xfs_da_do_buf                              0.0091
    30 _xfs_buf_find                              0.0204
    28 find_get_page                              0.0875
    27 xfs_iread                                  0.0241
    27 __strncpy_from_user                        0.2812
    26 _xfs_buf_initialize                        0.0406
    24 _xfs_buf_lookup_pages                      0.0179
    24 vunmap_page_range                          0.0250 <- vunmap
    23 find_lock_page                             0.0799
    22 vm_map_ram                                 0.0087 <- vmap
    20 kfree                                      0.0125
    19 put_page                                   0.0330
    18 __kmalloc                                  0.0176
    17 xfs_da_node_lookup_int                     0.0086
    17 _read_lock                                 0.0885
    17 page_waitqueue                             0.0664

vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.

[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db64fe02

18 10月, 2008 3 次提交

amd_iommu: fix nasty bug that caused ILLEGAL_DEVICE_TABLE_ENTRY errors · f609891f

由 Andreas Herrmann 提交于 10月 16, 2008

We are on 64-bit so better use u64 instead of u32 to deal with
addresses:

static void __init iommu_set_device_table(struct amd_iommu *iommu)
{
        u64 entry;
  ...
        entry = virt_to_phys(amd_iommu_dev_table);
  ...

(I am wondering why gcc 4.2.x did not warn about the assignment
between u32 and unsigned long.)

Cc: iommu@lists.linux-foundation.org
Cc: stable@kernel.org
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>

f609891f

intel-iommu: IA64 support · 5b6985ce

由 Fenghua Yu 提交于 10月 16, 2008

The current Intel IOMMU code assumes that both host page size and Intel
IOMMU page size are 4KiB. The first patch supports variable page size.
This provides support for IA64 which has multiple page sizes.

This patch also adds some other code hooks for IA64 platform including
DMAR_OPERATION_TIMEOUT definition.

[dwmw2: some cleanup]
Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>

5b6985ce

Export kmap_atomic_pfn for DRM-GEM. · d1d8c925

由 Eric Anholt 提交于 8月 21, 2008

The driver would like to map IO space directly for copying data in when
appropriate, to avoid CPU cache flushing for streaming writes.
kmap_atomic_pfn lets us avoid IPIs associated with ioremap for this process.
Signed-off-by: NEric Anholt <eric@anholt.net>
Signed-off-by: NDave Airlie <airlied@redhat.com>

d1d8c925

17 10月, 2008 4 次提交

x86 ACPI: fix breakage of resume on 64-bit UP systems with SMP kernel · 3038edab

由 Rafael J. Wysocki 提交于 10月 17, 2008

x86 ACPI: Fix breakage of resume on 64-bit UP systems with SMP kernel

We are now using per CPU GDT tables in head_64.S and the original
early_gdt_descr.address is invalidated after boot by
setup_per_cpu_areas().  This breaks resume from suspend to RAM on
x86_64 UP systems using SMP kernels, because this part of head_64.S
is also executed during the resume and the invalid GDT address
causes the system to crash.  It doesn't break on 'true' SMP systems,
because early_gdt_descr.address is modified every time
native_cpu_up() runs.  However, during resume it should point to the
GDT of the boot CPU rather than to another CPU's GDT.

For this reason, during suspend to RAM always make
early_gdt_descr.address point to the boot CPU's GDT.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=11568, which
is a regression from 2.6.26.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NPavel Machek <pavel@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Reported-and-tested-by: NAndy Wettstein <ajw1980@gmail.com>

3038edab

cpuidle: upon BIOS bug, default to default_idle rather than polling · 89cedfef

由 Venkatesh Pallipadi 提交于 10月 16, 2008

http://bugzilla.kernel.org/show_bug.cgi?id=11345Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

89cedfef

x86: SB600: skip ACPI IRQ0 override if it is not routed to INT2 of IOAPIC · 26adcfbf

由 Andreas Herrmann 提交于 10月 14, 2008

On some more HP laptops BIOS reports an IRQ0 override
but the SB600 chipset is configured such that timer
interrupts go to INT0 of IOAPIC.

Check IRQ0 routing and if it is routed to INT0 of IOAPIC skip the
timer override.

http://bugzilla.kernel.org/show_bug.cgi?id=11715
http://bugzilla.kernel.org/show_bug.cgi?id=11516Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

26adcfbf

fbdev: ignore VESA modes if framebuffer does not support them · 4d31a2b7

由 Michal Januszewski 提交于 10月 15, 2008

Currently, it is possible to set a graphics VESA mode at boot time via the
vga= parameter even when no framebuffer driver supporting this is
configured.  This could lead to the system booting with a black screen,
without a usable console.

Fix this problem by only allowing to set graphics modes at boot time if a
supporting framebuffer driver is configured.
Signed-off-by: NMichal Januszewski <spock@gentoo.org>
Acked-by: NKrzysztof Helt <krzysztof.h1@wp.pl>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4d31a2b7