提交 · c475a8ab625d567eacf5e30ec35d6d8704558062 · openeuler / raspberrypi-kernel

22 6月, 2005 40 次提交

[PATCH] can_share_swap_page: use page_mapcount · c475a8ab

由 Hugh Dickins 提交于 6月 21, 2005

Remember that ironic get_user_pages race? when the raised page_count on a
page swapped out led do_wp_page to decide that it had to copy on write, so
substituted a different page into userspace. 2.6.7 onwards have Andrea's
solution, where try_to_unmap_one backs out if it finds page_count raised.

Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
and was found a few months ago to hang an intensive page migration test. A
year ago I was hesitant to engage page_mapcount, now it seems the right fix.

So remove the page_count hack from try_to_unmap_one; and use activate_page in
unuse_mm when dropping lock, to replace its secondary effect of helping
swapoff to make progress in that case.

Simplify can_share_swap_page (now called only on anonymous pages) to check
page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
their (pessimistic) sum, but does not need swapper_space.tree_lock for that.

In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
keep sum on the high side, and correct when can_share_swap_page called.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

c475a8ab

[PATCH] do_wp_page: cannot share file page · d296e9cd

由 Hugh Dickins 提交于 6月 21, 2005

A small optimization to do_wp_page's check for whether to avoid copy by
reusing the page already mapped. It can never share a cached file page,
nor can it share a reserved page (often the empty zero page), so it's a
waste of time to lock and unlock in those cases. Which nowadays can both
be neatly excluded by a preliminary PageAnon test.

Christoph has reported that a preliminary page_count test proved valuable
for scalability here, but PageAnon covers more common cases all at once.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

d296e9cd

[PATCH] get_user_pages: kill get_page_map · 08ef4729

由 Hugh Dickins 提交于 6月 21, 2005

Since its birth, get_user_pages has been calling a misguided get_page_map
function.  follow_page has already returned NULL if the pfn is invalid, we
cannot reach an invalid pfn from a validated struct page.

Remove get_page_map, and the messy rewind in get_user_pages to cope with
its failure.  Oh, and could we please call that "struct page *page" like
everywhere else, instead of "struct page *map"?
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

08ef4729

[PATCH] rme96xx: fix PageReserved range · 7c2f3fda

由 Hugh Dickins 提交于 6月 21, 2005

rme96xx busmaster_malloc miscalculates and fails to set PageReserved on any
page of char *buf; but busmaster_free does it right, so do the same (I
don't have the card, just noticed this while sifting for rmap BUGs).
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

7c2f3fda

[PATCH] bad_page: clear reclaim and slab · 334795ec

由 Hugh Dickins 提交于 6月 21, 2005

Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
ought to clear them to avoid repetitive reports (Nikita noticed this too).
Let prep_new_page check page_count and PG_slab as free_pages_check does.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

334795ec

[PATCH] dup_mmap: update comment on new vma · 45918e1a

由 Hugh Dickins 提交于 6月 21, 2005

Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap
came in, try_to_unmap_one knows the vma without needing find_vma.  But add
a comment to note that here vma is inserted without mmap_sem.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

45918e1a

[PATCH] mbind: check_range use standard ptwalk · 91612e0d

由 Hugh Dickins 提交于 6月 21, 2005

Strict mbind's check for currently mapped pages being on node has been
using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry:
replace that by a standard four-level page table walk like others in mm.
Since mmap_sem is held for writing, page_table_lock can be taken at the
inner level to limit latency.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

91612e0d

[PATCH] mbind: fix verify_pages pte_page · 941150a3

由 Hugh Dickins 提交于 6月 21, 2005

Strict mbind's check that pages already mapped are on right node has been
using pte_page without checking if pfn_valid, and without page_table_lock
to prevent spurious failures when try_to_unmap_one intervenes between the
pte_present and the pte_page.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

941150a3

[PATCH] ia64: pfn_to_nid() implementation · 400e6514

由 Bob Picco 提交于 6月 21, 2005

pfn_to_nid is undefined.  We haven't had this interface on ia64.  The
sys_mbind patches need it.

Oh, the paddr_to_nid call could fail when DISCONTIG+NUMA is configured
because there isn't any ACPI SRAT NUMA information.
Signed-off-by: NBob Picco <bob.picco@hp.com>
Acked-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

400e6514

[PATCH] shmem: restore superblock info · 0edd73b3

由 Hugh Dickins 提交于 6月 21, 2005

To improve shmem scalability, we allowed tmpfs instances which don't need
their blocks or inodes limited not to count them, and not to allocate any
sbinfo. Which was okay when the only use for the sbinfo was accounting
blocks and inodes; but since then a couple of unrelated projects extending
tmpfs want to store other data in the sbinfo. Whether either extension
reaches mainline is beside the point: I'm guilty of a bad design decision,
and should restore sbinfo to make any such future extensions easier.

So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
inodes. Brent Casavant verified (many months ago) that this does not
perceptibly impact the scalability (since the unlimited sbinfo cacheline is
repeatedly accessed but only once dirtied).

And merge shmem_set_size into its sole caller shmem_remount_fs.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

0edd73b3

[PATCH] SN2 XPC build patches · 65ed0b33

由 Jes Sorensen 提交于 6月 21, 2005

This patch contains the bits to make the XPC code use the uncached
allocator rather than calling into the mspec driver.  It also includes the
mspec.h header which is required to build the XPC modules.
Signed-off-by: NJes Sorensen <jes@wildopensource.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

65ed0b33

[PATCH] ia64 uncached alloc · f14f75b8

由 Jes Sorensen 提交于 6月 21, 2005

This patch contains the ia64 uncached page allocator and the generic
allocator (genalloc). The uncached allocator was formerly part of the SN2
mspec driver but there are several other users of it so it has been split
off from the driver.

The generic allocator can be used by device driver to manage special memory
etc. The generic allocator is based on the allocator from the sym53c8xx_2
driver.

Various users on ia64 needs uncached memory. The SGI SN architecture requires
it for inter-partition communication between partitions within a large NUMA
cluster. The specific user for this is the XPC code. Another application is
large MPI style applications which use it for synchronization, on SN this can
be done using special 'fetchop' operations but it also benefits non SN
hardware which may use regular uncached memory for this purpose. Performance
of doing this through uncached vs cached memory is pretty substantial. This
is handled by the mspec driver which I will push out in a seperate patch.

Rather than creating a specific allocator for just uncached memory I came up
with genalloc which is a generic purpose allocator that can be used by device
drivers and other subsystems as they please. For instance to handle onboard
device memory. It was derived from the sym53c7xx_2 driver's allocator which
is also an example of a potential user (I am refraining from modifying sym2
right now as it seems to have been under fairly heavy development recently).

On ia64 memory has various properties within a granule, ie. it isn't safe to
access memory as uncached within the same granule as currently has memory
accessed in cached mode. The regular system therefore doesn't utilize memory
in the lower granules which is mixed in with device PAL code etc. The
uncached driver walks the EFI memmap and pulls out the spill uncached pages
and sticks them into the uncached pool. Only after these chunks have been
utilized, will it start converting regular cached memory into uncached memory.
Hence the reason for the EFI related code additions.
Signed-off-by: NJes Sorensen <jes@wildopensource.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f14f75b8

[PATCH] Reduce size of huge boot per_cpu_pageset · 2caaad41

由 Christoph Lameter 提交于 6月 21, 2005

Reduce size of the huge per_cpu_pageset structure in __initdata introduced
into mm1 with the pageset localization patchset.  Use one specially
configured pageset per cpu for all zones and nodes during bootup.

- Avoid duplication of pageset initialization code.
- do the adding to the pageset list before potential free_pages_bulk
  in free_hot_cold_page (otherwise we would have to hold a page
  in a pageset during the period that the boot pagesets are in use).
- remove mistaken __cpuinitdata attribute and revert back to __initdata
  for the boot pageset. A boot pageset is not necessary for cpu hotplug.

Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
sparsemem patches)
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

2caaad41

[PATCH] Periodically drain non local pagesets · 4ae7c039

由 Christoph Lameter 提交于 6月 21, 2005

The pageset array can potentially acquire a huge amount of memory on large
NUMA systems.  F.e.  on a system with 512 processors and 256 nodes there
will be 256*512 pagesets.  If each pageset only holds 5 pages then we are
talking about 655360 pages.With a 16K page size on IA64 this results in
potentially 10 Gigabytes of memory being trapped in pagesets.  The typical
cases are much less for smaller systems but there is still the potential of
memory being trapped in off node pagesets.  Off node memory may be rarely
used if local memory is available and so we may potentially have memory in
seldom used pagesets without this patch.

The slab allocator flushes its per cpu caches every 2 seconds.  The
following patch flushes the off node pageset caches in the same way by
tying into the slab flush.

The patch also changes /proc/zoneinfo to include the number of pages
currently in each pageset.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

4ae7c039

[PATCH] add OOM debug · 578c2fd6

由 Janet Morgan 提交于 6月 21, 2005

This patch provides more debug info when the system is OOM.  It displays
memory stats (basically sysrq-m info) from __alloc_pages() when page
allocation fails and during OOM kill.

Thanks to Dave Jones for coming up with the idea.
Signed-off-by: NJanet Morgan <janetmor@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

578c2fd6

[PATCH] __read_page_state(): pass unsigned long instead of unsigned · c2f29ea1

由 Benjamin LaHaise 提交于 6月 21, 2005

By making the offset argument of __read_page_state an unsigned long instead of
unsigned, we can avoid forcing the compiler to sign extend a usually constant
argument.  This saves 1 instruction on x86-64.
Signed-off-by: NBenjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

c2f29ea1

[PATCH] __mod_page_state(): pass unsigned long instead of unsigned · 83e5d8f7

由 Benjamin LaHaise 提交于 6月 21, 2005

By making the offset argument of __mod_page_state an unsigned long instead
of unsigned, we can avoid forcing the compiler to sign extend a usually
constant argument.  This saves 1 instruction on x86-64.
Signed-off-by: NBenjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

83e5d8f7

[PATCH] vm: try_to_free_pages unused argument · 1ad539b2

由 Darren Hart 提交于 6月 21, 2005

try_to_free_pages accepts a third argument, order, but hasn't used it since
before 2.6.0.  The following patch removes the argument and updates all the
calls to try_to_free_pages.
Signed-off-by: NDarren Hart <dvhltc@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1ad539b2

[PATCH] mm: remove PG_highmem · cbe37d09

由 Badari Pulavarty 提交于 6月 21, 2005

Remove PG_highmem, to save a page flag.  Use is_highmem() instead.  It'll
generate a little more code, but we don't use PageHigheMem() in many places.
Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

cbe37d09

[PATCH] mmap topdown fix for large stack limit, large allocation · 73219d17

由 Chris Wright 提交于 6月 21, 2005

The topdown changes in 2.6.12-rc1 can cause large allocations with large
stack limit to fail, despite there being space available.  The
mmap_base-len is only valid when len >= mmap_base.  However, nothing in
topdown allocator checks this.  It's only (now) caught at higher level,
which will cause allocation to simply fail.  The following change restores
the fallback to bottom-up path, which will allow large allocations with
large stack limit to potentially still succeed.
Signed-off-by: NChris Wright <chrisw@osdl.org>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

73219d17

[PATCH] Avoiding mmap fragmentation · 1363c3cd

由 Wolfgang Wander 提交于 6月 21, 2005

Ingo recently introduced a great speedup for allocating new mmaps using the
free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
causes huge performance increases in thread creation.

The downside of this patch is that it does lead to fragmentation in the
mmap-ed areas (visible via /proc/self/maps), such that some applications
that work fine under 2.4 kernels quickly run out of memory on any 2.6
kernel.

The problem is twofold:

  1) the free_area_cache is used to continue a search for memory where
     the last search ended.  Before the change new areas were always
     searched from the base address on.

     So now new small areas are cluttering holes of all sizes
     throughout the whole mmap-able region whereas before small holes
     tended to close holes near the base leaving holes far from the base
     large and available for larger requests.

  2) the free_area_cache also is set to the location of the last
     munmap-ed area so in scenarios where we allocate e.g.  five regions of
     1K each, then free regions 4 2 3 in this order the next request for 1K
     will be placed in the position of the old region 3, whereas before we
     appended it to the still active region 1, placing it at the location
     of the old region 2.  Before we had 1 free region of 2K, now we only
     get two free regions of 1K -> fragmentation.

The patch addresses thes issues by introducing yet another cache descriptor
cached_hole_size that contains the largest known hole size below the
current free_area_cache.  If a new request comes in the size is compared
against the cached_hole_size and if the request can be filled with a hole
below free_area_cache the search is started from the base instead.

The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
(earlier posted) leakme.c test program terminates after 50000+ iterations
with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
(as expected) with thread creation, Ingo's test_str02 with 20000 threads
requires 0.7s system time.

Taking out Ingo's patch (un-patch available per request) by basically
deleting all mentions of free_area_cache from the kernel and starting the
search for new memory always at the respective bases we observe: leakme
terminates successfully with 11 distinctive hardly fragmented areas in
/proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
time for Ingo's test_str02 with 20000 threads.

Now - drumroll ;-) the appended patch works fine with leakme: it ends with
only 7 distinct areas in /proc/self/maps and also thread creation seems
sufficiently fast with 0.71s for 20000 threads.
Signed-off-by: NWolfgang Wander <wwc@rentec.com>
Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1363c3cd

[PATCH] node local per-cpu-pages · e7c8d5c9

由 Christoph Lameter 提交于 6月 21, 2005

This patch modifies the way pagesets in struct zone are managed.

Each zone has a per-cpu array of pagesets. So any particular CPU has some
memory in each zone structure which belongs to itself. Even if that CPU is
not local to that zone.

So the patch relocates the pagesets for each cpu to the node that is nearest
to the cpu instead of allocating the pagesets in the (possibly remote) target
zone. This means that the operations to manage pages on remote zone can be
done with information available locally.

We play a macro trick so that non-NUMA pmachines avoid the additional
pointer chase on the page allocator fastpath.

AIM7 benchmark on a 32 CPU SGI Altix

w/o patches:
Tasks jobs/min jti jobs/min/task real cpu
1 484.68 100 484.6769 12.01 1.97 Fri Mar 25 11:01:42 2005
100 27140.46 89 271.4046 21.44 148.71 Fri Mar 25 11:02:04 2005
200 30792.02 82 153.9601 37.80 296.72 Fri Mar 25 11:02:42 2005
300 32209.27 81 107.3642 54.21 451.34 Fri Mar 25 11:03:37 2005
400 34962.83 78 87.4071 66.59 588.97 Fri Mar 25 11:04:44 2005
500 31676.92 75 63.3538 91.87 742.71 Fri Mar 25 11:06:16 2005
600 36032.69 73 60.0545 96.91 885.44 Fri Mar 25 11:07:54 2005
700 35540.43 77 50.7720 114.63 1024.28 Fri Mar 25 11:09:49 2005
800 33906.70 74 42.3834 137.32 1181.65 Fri Mar 25 11:12:06 2005
900 34120.67 73 37.9119 153.51 1325.26 Fri Mar 25 11:14:41 2005
1000 34802.37 74 34.8024 167.23 1465.26 Fri Mar 25 11:17:28 2005

with slab API changes and pageset patch:

Tasks jobs/min jti jobs/min/task real cpu
1 485.00 100 485.0000 12.00 1.96 Fri Mar 25 11:46:18 2005
100 28000.96 89 280.0096 20.79 150.45 Fri Mar 25 11:46:39 2005
200 32285.80 79 161.4290 36.05 293.37 Fri Mar 25 11:47:16 2005
300 40424.15 84 134.7472 43.19 438.42 Fri Mar 25 11:47:59 2005
400 39155.01 79 97.8875 59.46 590.05 Fri Mar 25 11:48:59 2005
500 37881.25 82 75.7625 76.82 730.19 Fri Mar 25 11:50:16 2005
600 39083.14 78 65.1386 89.35 872.79 Fri Mar 25 11:51:46 2005
700 38627.83 77 55.1826 105.47 1022.46 Fri Mar 25 11:53:32 2005
800 39631.94 78 49.5399 117.48 1169.94 Fri Mar 25 11:55:30 2005
900 36903.70 79 41.0041 141.94 1310.78 Fri Mar 25 11:57:53 2005
1000 36201.23 77 36.2012 160.77 1458.31 Fri Mar 25 12:00:34 2005
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NShobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: NShai Fultheim <Shai@Scalex86.org>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

e7c8d5c9

[PATCH] Hugepage consolidation · 63551ae0

由 David Gibson 提交于 6月 21, 2005

A lot of the code in arch/*/mm/hugetlbpage.c is quite similar.  This patch
attempts to consolidate a lot of the code across the arch's, putting the
combined version in mm/hugetlb.c.  There are a couple of uglyish hacks in
order to covert all the hugepage archs, but the result is a very large
reduction in the total amount of code.  It also means things like hugepage
lazy allocation could be implemented in one place, instead of six.

Tested, at least a little, on ppc64, i386 and x86_64.

Notes:
	- this patch changes the meaning of set_huge_pte() to be more
	  analagous to set_pte()
	- does SH4 need s special huge_ptep_get_and_clear()??
Acked-by: NWilliam Lee Irwin <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

63551ae0

[PATCH] VM: rate limit early reclaim · 1e7e5a90

由 Martin Hicks 提交于 6月 21, 2005

When early zone reclaim is turned on the LRU is scanned more frequently when a
zone is low on memory.  This limits when the zone reclaim can be called by
skipping the scan if another thread (either via kswapd or sync reclaim) is
already reclaiming from the zone.
Signed-off-by: NMartin Hicks <mort@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1e7e5a90

[PATCH] VM: add __GFP_NORECLAIM · 0c35bbad

由 Martin Hicks 提交于 6月 21, 2005

When using the early zone reclaim, it was noticed that allocating new pages
that should be spread across the whole system caused eviction of local pages.

This adds a new GFP flag to prevent early reclaim from happening during
certain allocation attempts. The example that is implemented here is for page
cache pages. We want page cache pages to be spread across the whole system,
and we don't want page cache pages to evict other pages to get local memory.
Signed-off-by: NMartin Hicks <mort@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

0c35bbad

[PATCH] VM: early zone reclaim · 753ee728

由 Martin Hicks 提交于 6月 21, 2005

This is the core of the (much simplified) early reclaim.  The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.

One of the major uses of this is NUMA machines.  With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.

This adds a zone tuneable to enable early zone reclaim.  It is selected on a
per-zone basis and is turned on/off via syscall.

Adding some extra throttling on the reclaim was also required (patch
4/4).  Without the machine would grind to a crawl when doing a "make -j"
kernel build.  Even with this patch the System Time is higher on
average, but it seems tolerable.  Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

			wall  user   sys   %cpu  ctx sw.  sleeps
			----  ----   ---   ----   ------  ------
No patch		1009  1384   847   258   298170   504402
w/patch, no reclaim     880   1376   667   288   254064   396745
w/patch & reclaim       1079  1385   926   252   291625   548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot.  Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.cSigned-off-by: NMartin Hicks <mort@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

753ee728

[PATCH] VM: add may_swap flag to scan_control · bfbb38fb

由 Martin Hicks 提交于 6月 21, 2005

Here's the next round of these patches.  These are totally different in
an attempt to meet the "simpler" request after the last patches.  For
reference the earlier threads are:

http://marc.theaimsgroup.com/?l=linux-kernel&m=110839604924587&w=2
http://marc.theaimsgroup.com/?l=linux-mm&m=111461480721249&w=2

This set of patches replaces my other vm- patches that are currently in
-mm.  So they're against 2.6.12-rc5-mm1 about half way through the -mm
patchset.

As I said already this patch is a lot simpler.  The reclaim is turned on
or off on a per-zone basis using a syscall.  I haven't tested the x86
syscall, so it might be wrong.  It uses the existing reclaim/pageout
code with the small addition of a may_swap flag to scan_control
(patch 1/4).

I also added __GFP_NORECLAIM (patch 3/4) so that certain allocation
types can be flagged to never cause reclaim.  This was a deficiency
that was in all of my earlier patch sets.  Previously, doing a big
buffered read would fill one zone with page cache and then start to
reclaim from that same zone, leaving the other zones untouched.

Adding some extra throttling on the reclaim was also required (patch
4/4).  Without the machine would grind to a crawl when doing a "make -j"
kernel build.  Even with this patch the System Time is higher on
average, but it seems tolerable.  Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

			wall  user   sys   %cpu  ctx sw.  sleeps
			----  ----   ---   ----   ------  ------
No patch		1009  1384   847   258   298170   504402
w/patch, no reclaim     880   1376   667   288   254064   396745
w/patch & reclaim       1079  1385   926   252   291625   548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot.  Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c

This patch:

This adds an extra switch to the scan_control struct.  It simply lets the
reclaim code know if its allowed to swap pages out.

This was required for a simple per-zone reclaimer.  Without this addition
pages would be swapped out as soon as a zone ran out of memory and the early
reclaim kicked in.
Signed-off-by: NMartin Hicks <mort@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

bfbb38fb

[PATCH] mm: add /proc/zoneinfo · 295ab934

由 Nikita Danilov 提交于 6月 21, 2005

Add /proc/zoneinfo file to display information about memory zones.  Useful
to analyze VM behaviour.
Signed-off-by: NNikita Danilov <nikita@clusterfs.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

295ab934

[PATCH] madvise: merge the maps · 05b74384

由 Prasanna Meda 提交于 6月 21, 2005

This attempts to merge back the split maps.  This code is mostly copied
from Chrisw's mlock merging from post 2.6.11 trees.  The only difference is
in munmapped_error handling.  Also passed prev to willneed/dontneed,
eventhogh they do not handle it now, since I felt it will be cleaner,
instead of handling prev in madvise_vma in some cases and in subfunction in
some cases.
Signed-off-by: NPrasanna Meda <pmeda@akamai.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

05b74384

[PATCH] madvise: do not split the maps · e798c6e8

由 Prasanna Meda 提交于 6月 21, 2005

This attempts to avoid splittings when it is not needed, that is when
vm_flags are same as new flags.  The idea is from the <2.6.11 mlock_fixup
and others.  This will provide base for the next madvise merging patch.
Signed-off-by: NPrasanna Meda <pmeda@akamai.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

e798c6e8

[PATCH] vmscan: notice slab shrinking · b15e0905

由 akpm@osdl.org 提交于 6月 21, 2005

Fix a problem identified by Andrea Arcangeli <andrea@suse.de>

kswapd will set a zone into all_unreclaimable state if it sees that we're not
successfully reclaiming LRU pages.  But that fails to notice that we're
successfully reclaiming slab obects, so we can set all_unreclaimable too soon.

So change shrink_slab() to return a success indication if it actually
reclaimed some objects, and don't assume that the zone is all_unreclaimable if
that is true.  This means that we won't enter all_unreclaimable state if we
are successfully freeing slab objects but we're not yet actually freeing slab
pages, due to internal fragmentation.

(hm, this has a shortcoming.  We could be successfully freeing ZONE_NORMAL
slab objects while being really oom on ZONE_DMA.  If that happens then kswapd
might burn a lot of CPU.  But given that there might be some slab objects in
ZONE_DMA, perhaps that is appropriate.)
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b15e0905

[PATCH] smp_processor_id() cleanup · 39c715b7

由 Ingo Molnar 提交于 6月 21, 2005

This patch implements a number of smp_processor_id() cleanup ideas that
Arjan van de Ven and I came up with.

The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
spaghetti was hard to follow both on the implementational and on the
usage side.

Some of the complexity arose from picking wrong names, some of the
complexity comes from the fact that not all architectures defined
__smp_processor_id.

In the new code, there are two externally visible symbols:

 - smp_processor_id(): debug variant.

 - raw_smp_processor_id(): nondebug variant. Replaces all existing
   uses of _smp_processor_id() and __smp_processor_id(). Defined
   by every SMP architecture in include/asm-*/smp.h.

There is one new internal symbol, dependent on DEBUG_PREEMPT:

 - debug_smp_processor_id(): internal debug variant, mapped to
                             smp_processor_id().

Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
lib/smp_processor_id.c file.  All related comments got updated and/or
clarified.

I have build/boot tested the following 8 .config combinations on x86:

 {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}

I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT.  (Other
architectures are untested, but should work just fine.)
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NArjan van de Ven <arjan@infradead.org>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

39c715b7

[PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes · 84929801

由 Suresh Siddha 提交于 6月 21, 2005

Appended patch will setup compatibility mode TASK_SIZE properly.  This will
fix atleast three known bugs that can be encountered while running
compatibility mode apps.

a) A malicious 32bit app can have an elf section at 0xffffe000.  During
   exec of this app, we will have a memory leak as insert_vm_struct() is
   not checking for return value in syscall32_setup_pages() and thus not
   freeing the vma allocated for the vsyscall page.  And instead of exec
   failing (as it has addresses > TASK_SIZE), we were allowing it to
   succeed previously.

b) With a 32bit app, hugetlb_get_unmapped_area/arch_get_unmapped_area
   may return addresses beyond 32bits, ultimately causing corruption
   because of wrap-around and resulting in SEGFAULT, instead of returning
   ENOMEM.

c) 32bit app doing this below mmap will now fail.

  mmap((void *)(0xFFFFE000UL), 0x10000UL, PROT_READ|PROT_WRITE,
	MAP_FIXED|MAP_PRIVATE|MAP_ANON, 0, 0);
Signed-off-by: NZou Nan hai <nanhai.zou@intel.com>
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

84929801

[PATCH] coverity: idr_get_new_above_int() overrun fix · 589777ea

由 Zaur Kambarov 提交于 6月 21, 2005

This patch fixes overrun of array pa:
92   		struct idr_layer *pa[MAX_LEVEL];

in

98   		l = idp->layers;
99   		pa[l--] = NULL;

by passing idp->layers, set in
202  		idp->layers = layers;
to function  sub_alloc in
203  		v = sub_alloc(idp, ptr, &id);
Signed-off-by: NZaur Kambarov <zkambarov@coverity.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

589777ea

[PATCH] coverity: ipmi: avoid overrun of ipmi_interfaces[] · 3a845099

由 Zaur Kambarov 提交于 6月 21, 2005

Fix overrun of static array "ipmi_interfaces" of size 4 at position 4 with
index variable "if_num".

Definitions involved:
297  	#define MAX_IPMI_INTERFACES 4
298  	static ipmi_smi_t ipmi_interfaces[MAX_IPMI_INTERFACES];
Signed-off-by: NZaur Kambarov <zkambarov@coverity.com>
Cc: Corey Minyard <minyard@acm.org>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

3a845099

[PATCH] megaraid build fix · 7f20b6a4

由 bobl 提交于 6月 21, 2005

Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

7f20b6a4

[PATCH] arm: irqs_disabled() type fix · 9a558cb4

由 Andrew Morton 提交于 6月 21, 2005

kernel/sched.c: In function `__might_sleep':
kernel/sched.c:5461: warning: int format, long unsigned int arg (arg 3)

We expect irqs_disabled() to return an int (poor man's bool).
Acked-by: NRussell King <rmk@arm.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

9a558cb4

L

Merge rsync://rsync.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6 · 9723d95d
由 Linus Torvalds 提交于 6月 21, 2005

9723d95d

[SPARC64]: Add prefetch support. · 7049e680

由 David S. Miller 提交于 6月 21, 2005

The implementation is optimal for UltraSPARC-III and later.
It will work, however suboptimally, on UltraSPARC-II and
be treated as a NOP on UltraSPARC-I.

It is not worth code patching this thing as the highest cost
is the code space, and code patching cannot eliminate that.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7049e680

L

Merge rsync://rsync.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · 4a4f8fdb
由 Linus Torvalds 提交于 6月 21, 2005

4a4f8fdb