提交 · 02febdf7f62f2fbfa89ca9dc5d929beea89c96b1 · openeuler / raspberrypi-kernel

31 7月, 2007 1 次提交

slub: add lock debugging check · 02febdf7

由 Peter Zijlstra 提交于 7月 26, 2007

Ingo noticed that the SLUB code does include the lock debugging free
check.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NIngo Molnar <mingo@elte.hu>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NChristoph Lameter <clameter@sgi.com>

02febdf7

30 7月, 2007 3 次提交

Remove fs.h from mm.h · 4e950f6f

由 Alexey Dobriyan 提交于 7月 30, 2007

Remove fs.h from mm.h. For this,
1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
2) Add back fs.h or less bloated headers (err.h) to files that need it.

As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
rebuilt down to 3444 (-12.3%).

Cross-compile tested without regressions on my two usual configs and (sigh):

alpha arm-mx1ads mips-bigsur powerpc-ebony
alpha-allnoconfig arm-neponset mips-capcella powerpc-g5
alpha-defconfig arm-netwinder mips-cobalt powerpc-holly
alpha-up arm-netx mips-db1000 powerpc-iseries
arm arm-ns9xxx mips-db1100 powerpc-linkstation
arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200
arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple
arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2
arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads
arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb
arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds
arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb
arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx
arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp
arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds
arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds
arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads
arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds
arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads
arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds
arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds
arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn
arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads
arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads
arm-ep93xx i386-up mips-pb1100 powerpc-pasemi
arm-footbridge ia64 mips-pb1500 powerpc-pmac32
arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64
arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800
arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3
arm-h7202 ia64-gensparse mips-qemu powerpc-pseries
arm-hackkit ia64-sim mips-rbhma4200 powerpc-up
arm-integrator ia64-sn2 mips-rbhma4500 s390
arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig
arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig
arm-iop33x ia64-zx1 mips-sead s390-up
arm-ixp2000 m68k mips-tb0219 sparc
arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig
arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig
arm-jornada720 m68k-atari mips-workpad sparc-up
arm-kafa m68k-bvme6000 mips-wrppmc sparc64
arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig
arm-ks8695 m68k-mac parisc sparc64-defconfig
arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up
arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64
arm-lpd7a400 m68k-q40 parisc-up x86_64
arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig
arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig
arm-lusl7200 mips powerpc-celleb x86_64-up
arm-mainstone mips-atlas powerpc-chrp32
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e950f6f

Introduce CONFIG_SUSPEND for suspend-to-Ram and standby · 296699de

由 Rafael J. Wysocki 提交于 7月 29, 2007

Introduce CONFIG_SUSPEND representing the ability to enter system sleep
states, such as the ACPI S3 state, and allow the user to choose SUSPEND
and HIBERNATION independently of each other.

Make HOTPLUG_CPU be selected automatically if SUSPEND or HIBERNATION has
been chosen and the kernel is intended for SMP systems.

Also, introduce CONFIG_PM_SLEEP which is automatically selected if
CONFIG_SUSPEND or CONFIG_HIBERNATION is set and use it to select the
code needed for both suspend and hibernation.

The top-level power management headers and the ACPI code related to
suspend and hibernation are modified to use the new definitions (the
changes in drivers/acpi/sleep/main.c are, mostly, moving code to reduce
the number of ifdefs).

There are many other files in which CONFIG_PM can be replaced with
CONFIG_PM_SLEEP or even with CONFIG_SUSPEND, but they can be updated in
the future.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

296699de

Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION · b0cb1a19

由 Rafael J. Wysocki 提交于 7月 29, 2007

Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION to avoid
confusion (among other things, with CONFIG_SUSPEND introduced in the
next patch).
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b0cb1a19

27 7月, 2007 3 次提交

Allow nodes to exist that only contain ZONE_MOVABLE · b5445f95

由 Mel Gorman 提交于 7月 26, 2007

With the introduction of kernelcore=, a configurable zone is created on
request.  In some cases, this value will be small enough that some nodes
contain only ZONE_MOVABLE.  On some NUMA configurations when this occurs,
arch-independent zone-sizing will get the size of the memory holes within
the node incorrect.  The value of present_pages goes negative and the boot
fails.

This patch fixes the bug in the calculation of the size of the hole.  The
test case is to boot test a NUMA machine with a low value of kernelcore=
before and after the patch is applied.  While this bug exists in early
kernel it cannot be triggered in practice.

This patch has been boot-tested on a variety machines with and without
kernelcore= set.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b5445f95

memory unplug: isolate_lru_page fix · 3dd9fe8c

由 KAMEZAWA Hiroyuki 提交于 7月 26, 2007

release_pages() in mm/swap.c changes page_count() to be 0 without removing
PageLRU flag...

This means isolate_lru_page() can see a page, PageLRU() &&
page_count(page)==0..  This is BUG.  (get_page() will be called against
count=0 page.)
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3dd9fe8c

memory unplug: migration by kernel · dc386d4d

由 KAMEZAWA Hiroyuki 提交于 7月 26, 2007

In usual, migrate_pages(page,,) is called with holding mm->sem by system call.
(mm here is a mm_struct which maps the migration target page.)
This semaphore helps avoiding some race conditions.

But, if we want to migrate a page by some kernel codes, we have to avoid
some races. This patch adds check code for following race condition.

1. A page which page->mapping==NULL can be target of migration. Then, we have
   to check page->mapping before calling try_to_unmap().

2. anon_vma can be freed while page is unmapped, but page->mapping remains as
   it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping.
   So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping from
   being freed during migration.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dc386d4d

25 7月, 2007 2 次提交

slab: correctly handle __GFP_ZERO · b8c1c5da

由 Andrew Morton 提交于 7月 24, 2007

Use the correct local variable when calling into the page allocator.  Local
`flags' can have __GFP_ZERO set, which causes us to pass __GFP_ZERO into the
page allocator, possibly from illegal contexts.  The page allocator will later
do prep_zero_page()->kmap_atomic(..., KM_USER0) from irq contexts and will
then go BUG.

Cc: Mike Galbraith <efault@gmx.de>
Acked-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b8c1c5da

fix hugetlb page allocation leak · 5ab3ee7b

由 Ken Chen 提交于 7月 23, 2007

dequeue_huge_page() has a serious memory leak upon hugetlb page
allocation.  The for loop continues on allocating hugetlb pages out of
all allowable zone, where this function is supposedly only dequeue one
and only one pages.

Fixed it by breaking out of the for loop once a hugetlb page is found.
Signed-off-by: NKen Chen <kenchen@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ab3ee7b

24 7月, 2007 1 次提交

[BLOCK] Get rid of request_queue_t typedef · 165125e1

由 Jens Axboe 提交于 7月 24, 2007

Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

165125e1

23 7月, 2007 1 次提交

x86_64: fix section mismatch warning in init.c · dec2e6b7

由 Sam Ravnborg 提交于 7月 22, 2007

Fix following warning:
WARNING: vmlinux.o(.text+0x188ea): Section mismatch: reference to .init.text:__alloc_bootmem_core (between 'alloc_bootmem_high_node' and 'get_gate_vma')

alloc_bootmem_high_node() is only used from __init scope so declare it __init.
And in addition declare the weak variant __init too.
Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dec2e6b7

22 7月, 2007 3 次提交

slob: reduce list scanning · d6269543

由 Matt Mackall 提交于 7月 21, 2007

The version of SLOB in -mm always scans its free list from the beginning,
which results in small allocations and free segments clustering at the
beginning of the list over time.  This causes the average search to scan
over a large stretch at the beginning on each allocation.

By starting each page search where the last one left off, we evenly
distribute the allocations and greatly shorten the average search.

Without this patch, kernel compiles on a 1.5G machine take a large amount
of system time for list scanning.  With this patch, compiles are within a
few seconds of performance of a SLAB kernel with no notable change in
system time.
Signed-off-by: NMatt Mackall <mpm@selenic.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d6269543

remove handle_mm_fault export · 41f9dc5c

由 Christoph Hellwig 提交于 7月 21, 2007

Now that arch/powerpc/platforms/cell/spufs/fault.c is always built in
the kernel there is no need to export handle_mm_fault anymore.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41f9dc5c

nommu: vmalloc_32_user()/vm_insert_page() and symbol exports. · b5073173

由 Paul Mundt 提交于 7月 21, 2007

Trying to survive an allmodconfig on a nommu platform results in many
screen lengths of module unhappiness. Many of the mmap related things that
binfmt_flat hooks in to are never exported despite being global, and there
are also missing definitions for vmalloc_32_user() and vm_insert_page().

I've implemented vmalloc_32_user() trying to stick as close to the
mm/vmalloc.c implementation as possible, though we don't have any need for
VM_USERMAP, so groveling for the VMA can be skipped. vm_insert_page() has
been stubbed for now in order to keep the build happy.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b5073173

20 7月, 2007 26 次提交

mm: fix memory hotplug oops from ZONE_MOVABLE changes. · e228929b

由 Paul Mundt 提交于 7月 20, 2007

zone_movable_pfn is presently marked as __initdata and referenced from
adjust_zone_range_for_zone_movable(), which in turn is referenced by
zone_spanned_pages_in_node().  Both of these are __meminit annotated.  When
memory hotplug is enabled, this will oops on a hot-add, due to
zone_movable_pfn having been freed.

__meminitdata annotation gives the desired behaviour.

This will only impact platforms that enable both memory hotplug
and ARCH_POPULATES_NODE_MAP.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e228929b

mm: Remove slab destructors from kmem_cache_create(). · 20c2df83

由 Paul Mundt 提交于 7月 20, 2007

Slab destructors were no longer supported after Christoph's
c59def9f change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>

20c2df83

slub: fix ksize() for zero-sized pointers · 9550b105

由 Linus Torvalds 提交于 7月 19, 2007

The slab and slob allocators already did this right, but slub would call
"get_object_page()" on the magic ZERO_SIZE_PTR, with all kinds of nasty
end results.

Noted by Ingo Molnar.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9550b105

Fix up non-NUMA SLAB configuration for zero-sized allocations · a5c96d8a

由 Linus Torvalds 提交于 7月 19, 2007

I suspect Christoph tested his code only in the NUMA configuration, for
the combination of SLAB+non-NUMA the zero-sized kmalloc's would not work.

Of course, this would only trigger in configurations where those zero-
sized allocations happen (not very common), so that may explain why it
wasn't more widely noticed.

Seen by by Andi Kleen under qemu, and there seems to be a report by
Michael Tsirkin on it too.

Cc: Andi Kleen <ak@suse.de>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a5c96d8a

lguest: export symbols for lguest as a module · 5992b6da

由 Rusty Russell 提交于 7月 19, 2007

lguest does some fairly lowlevel things to support a host, which
normal modules don't need:

math_state_restore:
	When the guest triggers a Device Not Available fault, we need
	to be able to restore the FPU

__put_task_struct:
	We need to hold a reference to another task for inter-guest
	I/O, and put_task_struct() is an inline function which calls
	__put_task_struct.

access_process_vm:
	We need to access another task for inter-guest I/O.

map_vm_area & __get_vm_area:
	We need to map the switcher shim (ie. monitor) at 0xFFC01000.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5992b6da

move page writeback acounting out of macros · d688abf5

由 Andrew Morton 提交于 7月 19, 2007

page-writeback accounting is presently performed in the page-flags macros.
This is inconsistent and a bit ugly and makes it awkward to implement
per-backing_dev under-writeback page accounting.

So move this accounting down to the callsite(s).
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d688abf5

hugetlb: use set_compound_page_dtor · f8af0bb8

由 Akinobu Mita 提交于 7月 19, 2007

Use appropriate accessor function to set compound page destructor
function.

Cc:  William Irwin <wli@holomorphy.com>
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Acked-by: NAdam Litke <agl@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f8af0bb8

Remove nid_lock from alloc_fresh_huge_page · 7ed5cb2b

由 Hugh Dickins 提交于 7月 19, 2007

The fix to that race in alloc_fresh_huge_page() which could give an illegal
node ID did not need nid_lock at all: the fix was to replace static int nid
by static int prev_nid and do the work on local int nid. nid_lock did make
sure that racers strictly roundrobin the nodes, but that's not something we
need to enforce strictly. Kill nid_lock.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7ed5cb2b

vmalloc_32 should use GFP_KERNEL · 7ac674f5

由 Benjamin Herrenschmidt 提交于 7月 19, 2007

I've noticed lots of failures of vmalloc_32 on machines where it
shouldn't have failed unless it was doing an atomic operation.

Looking closely, I noticed that:

#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
#define GFP_VMALLOC32 GFP_DMA32
#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
#define GFP_VMALLOC32 GFP_DMA
#else
#define GFP_VMALLOC32 GFP_KERNEL
#endif

Which seems to be incorrect, it should always -or- in the DMA flags
on top of GFP_KERNEL, thus this patch.

This fixes frequent errors launchin X with the nouveau DRM for example.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7ac674f5

FRV: work around a possible compiler bug · ea02e3dd

由 David Howells 提交于 7月 19, 2007

Work around a possible bug in the FRV compiler.

What appears to be happening is that gcc resolves the
__builtin_constant_p() in kmalloc() to true, but then fails to reduce the
therefore constant conditions in the if-statements it guards to constant
results.

When compiling with -O2 or -Os, one single spurious error crops up in
cpuup_callback() in mm/slab.c.  This can be avoided by making the memsize
variable const.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ea02e3dd

dequeue_huge_page() warning fix · 3abf7afd

由 Andrew Morton 提交于 7月 19, 2007

mm/hugetlb.c: In function `dequeue_huge_page':
mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function

Cc: Christoph Lameter <clameter@sgi.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3abf7afd

mm: variable length argument support · b6a2fea3

由 Ollie Wild 提交于 7月 19, 2007

Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
the old mm into the new mm.

We create the new mm before the binfmt code runs, and place the new stack at
the very top of the address space.  Once the binfmt code runs and figures out
where the stack should be, we move it downwards.

It is a bit peculiar in that we have one task with two mm's, one of which is
inactive.

[a.p.zijlstra@chello.nl: limit stack size]
Signed-off-by: NOllie Wild <aaw@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
[bunk@stusta.de: unexport bprm_mm_init]
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6a2fea3

readahead: sanify file_ra_state names · f9acc8c7

由 Fengguang Wu 提交于 7月 19, 2007

Rename some file_ra_state variables and remove some accessors.

It results in much simpler code.
Kudos to Rusty!
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9acc8c7

readahead: split ondemand readahead interface into two functions · cf914a7d

由 Rusty Russell 提交于 7月 19, 2007

Split ondemand readahead interface into two functions.  I think this makes it
a little clearer for non-readahead experts (like Rusty).

Internally they both call ondemand_readahead(), but the page argument is
changed to an obvious boolean flag.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cf914a7d

mm: share PG_readahead and PG_reclaim · fe3cba17

由 Fengguang Wu 提交于 7月 19, 2007

Share the same page flag bit for PG_readahead and PG_reclaim.

One is used only on file reads, another is only for emergency writes.  One
is used mostly for fresh/young pages, another is for old pages.

Combinations of possible interactions are:

a) clear PG_reclaim => implicit clear of PG_readahead
	it will delay an asynchronous readahead into a synchronous one
	it actually does _good_ for readahead:
		the pages will be reclaimed soon, it's readahead thrashing!
		in this case, synchronous readahead makes more sense.

b) clear PG_readahead => implicit clear of PG_reclaim
	one(and only one) page will not be reclaimed in time
	it can be avoided by checking PageWriteback(page) in readahead first

c) set PG_reclaim => implicit set of PG_readahead
	will confuse readahead and make it restart the size rampup process
	it's a trivial problem, and can mostly be avoided by checking
	PageWriteback(page) first in readahead

d) set PG_readahead => implicit set of PG_reclaim
	PG_readahead will never be set on already cached pages.
	PG_reclaim will always be cleared on dirtying a page.
	so not a problem.

In summary,
	a)   we get better behavior
	b,d) possible interactions can be avoided
	c)   racy condition exists that might affect readahead, but the chance
	     is _really_ low, and the hurt on readahead is trivial.

Compound pages also use PG_reclaim, but for now they do not interact with
reclaim/readahead code.
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fe3cba17

readahead: remove the old algorithm · c743d96b

由 Fengguang Wu 提交于 7月 19, 2007

Remove the old readahead algorithm.
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c743d96b

readahead: convert filemap invocations · 3ea89ee8

由 Fengguang Wu 提交于 7月 19, 2007

Convert filemap reads to use on-demand readahead.

The new call scheme is to
- call readahead on non-cached page
- call readahead on look-ahead page
- update prev_index when finished with the read request
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ea89ee8

readahead: on-demand readahead logic · 122a21d1

由 Fengguang Wu 提交于 7月 19, 2007

This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance.  Also it is full integrated with adaptive readahead.

It is designed to be called on demand:
	- on a missing page, to do synchronous readahead
	- on a lookahead page, to do asynchronous readahead

In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read.  It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.

HEURISTICS

The logic deals with four cases:

	- sequential-next
		found a consistent readahead window, so push it forward

	- random
		standalone small read, so read as is

	- sequential-first
		create a new readahead window for a sequential/oversize request

	- lookahead-clueless
		hit a lookahead page not associated with the readahead window,
		so create a new readahead window and ramp it up

In each case, three parameters are determined:

	- readahead index: where the next readahead begins
	- readahead size:  how much to readahead
	- lookahead size:  when to do the next readahead (for pipelining)

BEHAVIORS

The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:

	- It no longer imposes strict sequential checks.
	  It might help some interleaved cases, and clustered random reads.
	  It does introduce risks of a random lookahead hit triggering an
	  unexpected readahead. But in general it is more likely to do good
	  than to do evil.

	- Interleaved reads are supported in a minimal way.
	  Their chances of being detected and proper handled are still low.

	- Readahead thrashings are better handled.
	  The current readahead leads to tiny average I/O sizes, because it
	  never turn back for the thrashed pages.  They have to be fault in
	  by do_generic_mapping_read() one by one.  Whereas the on-demand
	  readahead will redo readahead for them.

OVERHEADS

The new code reduced the overheads of

	- excessively calling the readahead routine on small sized reads
	  (the current readahead code insists on seeing all requests)

	- doing a lot of pointless page-cache lookups for small cached files
	  (the current readahead only turns itself off after 256 cache hits,
	  unfortunately most files are < 1MB, so never see that chance)

That accounts for speedup of
	- 0.3% on 1-page sequential reads on sparse file
	- 1.2% on 1-page cache hot sequential reads
	- 3.2% on 256-page cache hot sequential reads
	- 1.3% on cache hot `tar /lib`

However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.

PERFORMANCE

The basic benchmark setup is
	- 2.6.20 kernel with on-demand readahead
	- 1MB max readahead size
	- 2.9GHz Intel Core 2 CPU
	- 2GB memory
	- 160G/8M Hitachi SATA II 7200 RPM disk

The benchmarks show that
	- it maintains the same performance for trivial sequential/random reads
	- sysbench/OLTP performance on MySQL gains up to 8%
	- performance on readahead thrashing gains up to 3 times

iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k

			       2.6.20          on-demand      gain
first run
	  "  Initial write "   61437.27        64521.53      +5.0%
	  "        Rewrite "   47893.02        48335.20      +0.9%
	  "           Read "   62111.84        62141.49      +0.0%
	  "        Re-read "   62242.66        62193.17      -0.1%
	  "   Reverse Read "   50031.46        49989.79      -0.1%
	  "    Stride read "    8657.61         8652.81      -0.1%
	  "    Random read "   13914.28        13898.23      -0.1%
	  " Mixed workload "   19069.27        19033.32      -0.2%
	  "   Random write "   14849.80        14104.38      -5.0%
	  "         Pwrite "   62955.30        65701.57      +4.4%
	  "          Pread "   62209.99        62256.26      +0.1%

second run
	  "  Initial write "   60810.31        66258.69      +9.0%
	  "        Rewrite "   49373.89        57833.66     +17.1%
	  "           Read "   62059.39        62251.28      +0.3%
	  "        Re-read "   62264.32        62256.82      -0.0%
	  "   Reverse Read "   49970.96        50565.72      +1.2%
	  "    Stride read "    8654.81         8638.45      -0.2%
	  "    Random read "   13901.44        13949.91      +0.3%
	  " Mixed workload "   19041.32        19092.04      +0.3%
	  "   Random write "   14019.99        14161.72      +1.0%
	  "         Pwrite "   64121.67        68224.17      +6.4%
	  "          Pread "   62225.08        62274.28      +0.1%

In summary, writes are unstable, reads are pretty close on average:

			  access pattern  2.6.20  on-demand   gain
				   Read  62085.61  62196.38  +0.2%
				Re-read  62253.49  62224.99  -0.0%
			   Reverse Read  50001.21  50277.75  +0.6%
			    Stride read   8656.21   8645.63  -0.1%
			    Random read  13907.86  13924.07  +0.1%
	 		 Mixed workload  19055.29  19062.68  +0.0%
				  Pread  62217.53  62265.27  +0.1%

aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

					2.6.20      on-demand  delta
			sequential	 92.57s      92.54s    -0.0%
			random		311.87s     312.15s    +0.1%

sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
	 --file-total-size=4G --file-block-size=64K \
	 --num-threads=001 --max-requests=10000 --max-time=900 run

				threads    2.6.20   on-demand    delta
		first run
				      1   59.1974s    59.2262s  +0.0%
				      2   58.0575s    58.2269s  +0.3%
				      4   48.0545s    47.1164s  -2.0%
				      8   41.0684s    41.2229s  +0.4%
				     16   35.8817s    36.4448s  +1.6%
				     32   32.6614s    32.8240s  +0.5%
				     64   23.7601s    24.1481s  +1.6%
				    128   24.3719s    23.8225s  -2.3%
				    256   23.2366s    22.0488s  -5.1%

		second run
				      1   59.6720s    59.5671s  -0.2%
				      8   41.5158s    41.9541s  +1.1%
				     64   25.0200s    23.9634s  -4.2%
				    256   22.5491s    20.9486s  -7.1%

Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:

                sum all up               495.046s    491.514s   -0.7%

sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
	 --mysql-socket=/var/run/mysqld/mysqld.sock \
	 --mysql-user=root --mysql-password=readahead \
	 --num-threads=064 --max-requests=10000 --max-time=900 run

	10000-transactions run
				threads    2.6.20   on-demand    gain
				      1     62.81       64.56   +2.8%
				      2     67.97       70.93   +4.4%
				      4     81.81       85.87   +5.0%
				      8     94.60       97.89   +3.5%
				     16     99.07      104.68   +5.7%
				     32     95.93      104.28   +8.7%
				     64     96.48      103.68   +7.5%
	5000-transactions run
				      1     48.21       48.65   +0.9%
				      8     68.60       70.19   +2.3%
				     64     70.57       74.72   +5.9%
	2000-transactions run
				      1     37.57       38.04   +1.3%
				      2     38.43       38.99   +1.5%
				      4     45.39       46.45   +2.3%
				      8     51.64       52.36   +1.4%
				     16     54.39       55.18   +1.5%
				     32     52.13       54.49   +4.5%
				     64     54.13       54.61   +0.9%

That's interesting results. Some investigations show that
	- MySQL is accessing the db file non-uniformly: some parts are
	  more hot than others
	- It is mostly doing 4-page random reads, and sometimes doing two
	  reads in a row, the latter one triggers a 16-page readahead.
	- The on-demand readahead leaves many lookahead pages (flagged
	  PG_readahead) there. Many of them will be hit, and trigger
	  more readahead pages. Which might save more seeks.
	- Naturally, the readahead windows tend to lie in hot areas,
	  and the lookahead pages in hot areas is more likely to be hit.
	- The more overall read density, the more possible gain.

That also explains the adaptive readahead tricks for clustered random reads.

readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.

			      max throughput     min avg I/O size
		2.6.20:            5MB/s               16KB
		on-demand:        15MB/s              140KB
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

122a21d1

readahead: data structure and routines · 5ce1110b

由 Fengguang Wu 提交于 7月 19, 2007

Extend struct file_ra_state to support the on-demand readahead logic.  Also
define some helpers for it.
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ce1110b

readahead: MIN_RA_PAGES/MAX_RA_PAGES macros · f615bfca

由 Fengguang Wu 提交于 7月 19, 2007

Define two convenient macros for read-ahead:
	- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
	- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD

Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
page sizes like 64k.
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f615bfca

readahead: add look-ahead support to __do_page_cache_readahead() · 46fc3e7b

由 Fengguang Wu 提交于 7月 19, 2007

Add look-ahead support to __do_page_cache_readahead().

It works by
	- mark the Nth backwards page with PG_readahead,
	(which instructs the page's first reader to invoke readahead)
	- and only do the marking for newly allocated pages.
	(to prevent blindly doing readahead on already cached pages)

Look-ahead is a technique to achieve I/O pipelining:

While the application is working through a chunk of cached pages, the kernel
reads-ahead the next chunk of pages _before_ time of need.  It effectively
hides low level I/O latencies to high level applications.
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

46fc3e7b

readahead: introduce PG_readahead · d77c2d7c

由 Fengguang Wu 提交于 7月 19, 2007

Introduce a new page flag: PG_readahead.

It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
invoke the read-ahead logic.  For the sake of I/O pipelining, don't wait until
it runs out of cached pages!
Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d77c2d7c

only allow nonlinear vmas for ram backed filesystems · 3ee6dafc

由 Miklos Szeredi 提交于 7月 19, 2007

page_mkclean() doesn't re-protect ptes for non-linear mappings, so a later
re-dirty through such a mapping will not generate a fault, PG_dirty will
not reflect the dirty state and the dirty count will be skewed.  This
implies that msync() is also currently broken for nonlinear mappings.

The easiest solution is to emulate remap_file_pages on non-linear mappings
with simple mmap() for non ram-backed filesystems.  Applications continue
to work (albeit slower), as long as the number of remappings remain below
the maximum vma count.

However all currently known real uses of non-linear mappings are for ram
backed filesystems, which this patch doesn't affect.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ee6dafc

mm: fix clear_page_dirty_for_io vs fault race · 79352894

由 Nick Piggin 提交于 7月 19, 2007

Fix msync data loss and (less importantly) dirty page accounting
inaccuracies due to the race remaining in clear_page_dirty_for_io().

The deleted comment explains what the race was, and the added comments
explain how it is fixed.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

79352894

mm: fault feedback #2 · 83c54070

由 Nick Piggin 提交于 7月 19, 2007

This patch completes Linus's wish that the fault return codes be made into
bit flags, which I agree makes everything nicer.  This requires requires
all handle_mm_fault callers to be modified (possibly the modifications
should go further and do things like fault accounting in handle_mm_fault --
however that would be for another patch).

[akpm@linux-foundation.org: fix alpha build]
[akpm@linux-foundation.org: fix s390 build]
[akpm@linux-foundation.org: fix sparc build]
[akpm@linux-foundation.org: fix sparc64 build]
[akpm@linux-foundation.org: fix ia64 build]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: NKyle McMartin <kyle@mcmartin.ca>
Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
Acked-by: NRalf Baechle <ralf@linux-mips.org>
Acked-by: NAndi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
[ Still apparently needs some ARM and PPC loving - Linus ]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

83c54070

mm: fault feedback #1 · d0217ac0

由 Nick Piggin 提交于 7月 19, 2007

Change ->fault prototype.  We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
 FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things.  This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).

This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.

struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.

The page is now returned via a page pointer in the vm_fault struct.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d0217ac0