提交 · 71088785c6bc68fddb450063d57b1bd1c78e0ea1 · OpenHarmony / kernel_linux

20 10月, 2008 2 次提交

mm: cleanup to make remove_memory() arch-neutral · 71088785

由 Badari Pulavarty 提交于 10月 18, 2008

There is nothing architecture specific about remove_memory().
remove_memory() function is common for all architectures which support
hotplug memory remove.  Instead of duplicating it in every architecture,
collapse them into arch neutral function.

[akpm@linux-foundation.org: fix the export]
Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

71088785

anon_vma_prepare: properly lock even newly allocated entries · d9d332e0

由 Linus Torvalds 提交于 10月 19, 2008

The anon_vma code is very subtle, and we end up doing optimistic lookups
of anon_vmas under RCU in page_lock_anon_vma() with no locking.  Other
CPU's can also see the newly allocated entry immediately after we've
exposed it by setting "vma->anon_vma" to the new value.

We protect against the anon_vma being destroyed by having the SLAB
marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the
allocation not being destroyed - but it might still be free'd and
re-allocated here to a new vma.

As a result, we should not do the anon_vma list ops on a newly allocated
vma without proper locking.
Acked-by: NNick Piggin <npiggin@suse.de>
Acked-by: NHugh Dickins <hugh@veritas.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d9d332e0

18 10月, 2008 1 次提交

Export shmem_file_setup for DRM-GEM · 395e0ddc

由 Keith Packard 提交于 6月 20, 2008

GEM needs to create shmem files to back buffer objects.  Though currently
creation of files for objects could have been driven from userland, the
modesetting work will require allocation of buffer objects before userland
is running, for boot-time message display.
Signed-off-by: NEric Anholt <eric@anholt.net>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NDave Airlie <airlied@redhat.com>

395e0ddc

17 10月, 2008 6 次提交

Remove Andrew Morton's old email accounts · e1f8e874

由 Francois Cami 提交于 10月 15, 2008

People can use the real name an an index into MAINTAINERS to find the
current email address.
Signed-off-by: NFrancois Cami <francois.cami@free.fr>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e1f8e874

Kconfig: eliminate "def_bool n" constructs · 9ba16087

由 Jan Beulich 提交于 10月 15, 2008

Using "def_bool n" is pointless, simply using bool here appears more
appropriate.

Further, retaining such options that don't have a prompt and aren't
selected by anything seems also at least questionable.
Signed-off-by: NJan Beulich <jbeulich@novell.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ba16087

misc: replace __FUNCTION__ with __func__ · 80a914dc

由 Harvey Harrison 提交于 10月 15, 2008

__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

80a914dc

mm: do_generic_file_read() never gets a NULL 'filp' argument · 0c6aa263

由 Krishna Kumar 提交于 10月 15, 2008

The 'filp' argument to do_generic_file_read() is never NULL.
Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0c6aa263

hugetlb: handle updating of ACCESSED and DIRTY in hugetlb_fault() · b4d1d99f

由 David Gibson 提交于 10月 15, 2008

The page fault path for normal pages, if the fault is neither a no-page
fault nor a write-protect fault, will update the DIRTY and ACCESSED bits
in the page table appropriately.

The hugepage fault path, however, does not do this, handling only no-page
or write-protect type faults.  It assumes that either the ACCESSED and
DIRTY bits are irrelevant for hugepages (usually true, since they are
never swapped) or that they are handled by the arch code.

This is inconvenient for some software-loaded TLB architectures, where the
_PAGE_ACCESSED (_PAGE_DIRTY) bits need to be set to enable read (write)
access to the page at the TLB miss.  This could be worked around in the
arch TLB miss code, but the TLB miss fast path can be made simple more
easily if the hugetlb_fault() path handles this, as the normal page fault
path does.
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b4d1d99f

mm/page_alloc.c:free_area_init_nodes() fix inappropriate use of enum · db99100d

由 Andrew Morton 提交于 10月 15, 2008

Local variable `i' is a) misleadingly-named for an `enum zone_type' and b)
used for indexing zones as well as nodes as well as node_maps.

Make it an `int'.
Reported-by: NFrans Pop <elendil@planet.nl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db99100d

16 10月, 2008 1 次提交

vfs: Add no_nrwrite_index_update writeback control flag · 17bc6c30

由 Aneesh Kumar K.V 提交于 10月 16, 2008

If no_nrwrite_index_update is set we don't update nr_to_write and
address space writeback_index in write_cache_pages.  This change
enables a file system to skip these updates in write_cache_pages and do
them in the writepages() callback.  This patch will be followed by an
ext4 patch that make use of these new flags.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
CC: linux-fsdevel@vger.kernel.org

17bc6c30

14 10月, 2008 2 次提交

do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails · 85462323

由 Oleg Nesterov 提交于 6月 08, 2008

If lock_page_killable() fails because the task was killed by SIGKILL or
any other fatal signal, do_generic_file_read() returns -EIO.

This seems to be OK, because in fact the userspace won't see this error,
the task will dequeue SIGKILL and exit.

However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
return to the user-space with the bogus -EIO.

Change the code to return the error code from lock_page_killable(), -EINTR.
This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
change the code looks a bit more logical, and the "good" init should handle
the spurious EINTR or short read.

Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
but this can't prevent the short reads.
Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

85462323

vfs: Remove the range_cont writeback mode. · 74baaaae

由 Aneesh Kumar K.V 提交于 10月 14, 2008

Ext4 was the only user of range_cont writeback mode and ext4 switched
to a different method. So remove the range_cont mode which is not used
in the kernel.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
CC: linux-fsdevel@vger.kernel.org

74baaaae

13 10月, 2008 1 次提交

integrity: special fs magic · 92562927

由 Mimi Zohar 提交于 10月 07, 2008

Discussion on the mailing list questioned the use of these
magic values in userspace, concluding these values are already
exported to userspace via statfs and their correct/incorrect
usage is left up to the userspace application.

  - Move special fs magic number definitions to magic.h
  - Add magic.h include
Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
Reviewed-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NJames Morris <jmorris@namei.org>

92562927

10 10月, 2008 1 次提交

SLOB: fix bogus ksize calculation fix · 70096a56

由 Matt Mackall 提交于 10月 08, 2008

This fixes the previous fix, which was completely wrong on closer
inspection. This version has been manually tested with a user-space
test harness and generates sane values. A nearly identical patch has
been boot-tested.

The problem arose from changing how kmalloc/kfree handled alignment
padding without updating ksize to match. This brings it in sync.
Signed-off-by: NMatt Mackall <mpm@selenic.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

70096a56

09 10月, 2008 1 次提交
- J
  highmem: use bio_has_data() in the bounce path · 36144077
  由 Jens Axboe 提交于 8月 14, 2008
```
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
  36144077
08 10月, 2008 1 次提交

SLOB: fix bogus ksize calculation · 85ba94ba

由 Matt Mackall 提交于 10月 07, 2008

SLOB's ksize calculation was braindamaged and generally harmlessly
underreported the allocation size. But for very small buffers, it could
in fact overreport them, leading code depending on krealloc to overrun
the allocation and trample other data.
Signed-off-by: NMatt Mackall <mpm@selenic.com>
Tested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

85ba94ba

03 10月, 2008 3 次提交

mm: handle initialising compound pages at orders greater than MAX_ORDER · 6babc32c

由 Andy Whitcroft 提交于 10月 02, 2008

When we initialise a compound page we initialise the page flags and head
page pointer for all base pages spanned by that page.  When we initialise
a gigantic page (a page of order greater than or equal to MAX_ORDER) we
have to initialise more than MAX_ORDER_NR_PAGES pages.  Currently we
assume that all elements of the mem_map in this page are contigious in
memory.  However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
and with SPARSEMEM enabled they will not be contigious.  This leads us to
walk off the end of the first section and scribble on everything which
follows, BAD.

When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
section of the mem_map.  As gigantic pages can only be maximally aligned
we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
the start of the page.

This is a bug fix for the gigantic page support in hugetlbfs.

Credit to Mel Gorman for spotting the issue.
Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6babc32c

mm: tiny-shmem nommu fix · 4b19de6d

由 Nick Piggin 提交于 10月 02, 2008

The previous patch db203d53 ("mm:
tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock
ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU
architectures because it was using the expanding truncate to signal ramfs
to allocate a physically contiguous RAM backing the inode (otherwise it is
unusable for "memory mapping" it to userspace).

However do_truncate is what caused the lock ordering error, due to it
taking i_mutex.  In this case, we can actually just call ramfs directly to
allocate memory for the mapping, rather than go via truncate.
Acked-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4b19de6d

memory hotplug: missing zone->lock in test_pages_isolated() · 6c1b7f68

由 Gerald Schaefer 提交于 10月 02, 2008

__test_page_isolated_in_pageblock() in mm/page_isolation.c has a comment
saying that the caller must hold zone->lock. But the only caller of that
function, test_pages_isolated(), does not hold zone->lock and the lock is
also not acquired anywhere before. This patch adds the missing zone->lock
to test_pages_isolated().

We reproducibly run into BUG_ON(!PageBuddy(page)) in __offline_isolated_pages()
during memory hotplug stress test, see trace below. This patch fixes that
problem, it would be good if we could have it in 2.6.27.

kernel BUG at /home/autobuild/BUILD/linux-2.6.26-20080909/mm/page_alloc.c:4561!
illegal operation: 0001 [#1] PREEMPT SMP
Modules linked in: dm_multipath sunrpc bonding qeth_l3 dm_mod qeth ccwgroup vmur
CPU: 1 Not tainted 2.6.26-29.x.20080909-s390default #1
Process memory_loop_all (pid: 10025, task: 2f444028, ksp: 2b10dd28)
Krnl PSW : 040c0000 801727ea (__offline_isolated_pages+0x18e/0x1c4)
 R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0
Krnl GPRS: 00000000 7e27fc00 00000000 7e27fc00
 00000000 00000400 00014000 7e27fc01
 00606f00 7e27fc00 00013fe0 2b10dd28
 00000005 80172662 801727b2 2b10dd28
Krnl Code: 801727de: 5810900c l %r1,12(%r9)
 801727e2: a7f4ffb3 brc 15,80172748
 801727e6: a7f40001 brc 15,801727e8
 >801727ea: a7f4ffbc brc 15,80172762
 801727ee: a7f40001 brc 15,801727f0
 801727f2: a7f4ffaf brc 15,80172750
 801727f6: 0707 bcr 0,%r7
 801727f8: 0017 unknown
Call Trace:
([<0000000000172772>] __offline_isolated_pages+0x116/0x1c4)
 [<00000000001953a2>] offline_isolated_pages_cb+0x22/0x34
 [<000000000013164c>] walk_memory_resource+0xcc/0x11c
 [<000000000019520e>] offline_pages+0x36a/0x498
 [<00000000001004d6>] remove_memory+0x36/0x44
 [<000000000028fb06>] memory_block_change_state+0x112/0x150
 [<000000000028ffb8>] store_mem_state+0x90/0xe4
 [<0000000000289c00>] sysdev_store+0x34/0x40
 [<00000000001ee048>] sysfs_write_file+0xd0/0x178
 [<000000000019b1a8>] vfs_write+0x74/0x118
 [<000000000019b9ae>] sys_write+0x46/0x7c
 [<000000000011160e>] sysc_do_restart+0x12/0x16
 [<0000000077f3e8ca>] 0x77f3e8ca
Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6c1b7f68

29 9月, 2008 1 次提交

mm owner: fix race between swapoff and exit · 31a78f23

由 Balbir Singh 提交于 9月 28, 2008

There's a race between mm->owner assignment and swapoff, more easily
seen when task slab poisoning is turned on.  The condition occurs when
try_to_unuse() runs in parallel with an exiting task.  A similar race
can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
or ptrace or page migration.

CPU0                                    CPU1
                                        try_to_unuse
                                        looks at mm = task0->mm
                                        increments mm->mm_users
task 0 exits
mm->owner needs to be updated, but no
new owner is found (mm_users > 1, but
no other task has task->mm = task0->mm)
mm_update_next_owner() leaves
                                        mmput(mm) decrements mm->mm_users
task0 freed
                                        dereferencing mm->owner fails

The fix is to notify the subsystem via mm_owner_changed callback(),
if no new owner is found, by specifying the new task as NULL.

Jiri Slaby:
mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
must be set after that, so as not to pass NULL as old owner causing oops.

Daisuke Nishimura:
mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
and its callers need to take account of this situation to avoid oops.

Hugh Dickins:
Lockdep warning and hang below exec_mmap() when testing these patches.
exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
so exec_mmap() now needs to do the same.  And with that repositioning,
there's now no point in mm_need_new_owner() allowing for NULL mm.
Reported-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

31a78f23

23 9月, 2008 2 次提交

memcg: check under limit at shrink_usage · a10cebf5

由 Daisuke Nishimura 提交于 9月 22, 2008

Current memory cgroup(both in mainline and -mm) doesn't account swap
caches as memory(swap cache support is dropped temporarily now).

So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that
have been moved to swap cache.

But this makes mem_cgroup_shrink_usage fail easily if most of the pages
are anon/shmem, and then shmem_getpage returns -ENOMEM and the process
will be killed.

This patch adds res_counter_check_under_limit to avoid these cases.

BTW, even if swap cache support is enabled again, if a process is moved to
another cgroup, which has been just made, between precharge and
shrink_usage in shmem_getpage, shrink_usage may fail just because there is
no pages to reclaim.

So this change would make sense anyway.
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a10cebf5

mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex · db203d53

由 Nick Piggin 提交于 9月 22, 2008

tiny-shmem calls do_truncate in shmem_file_setup.  do_truncate takes
i_mutex, and shmem_file_setup is called with mmap_sem held.  However
i_mutex nests outside mmap_sem.

Copy the code in shmem.c to avoid this problem.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Reported-and-tested-by: NIngo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db203d53

15 9月, 2008 1 次提交

slub: fixed uninitialized counter in struct kmem_cache_node · 02b71b70

由 Salman Qazi 提交于 9月 11, 2008

Initialized total objects atomic for the node in init_kmem_cache_node.  The
uninitialized value was ruining the stats in /proc/slabinfo.
Acked-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NSalman Qazi <sqazi@google.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

02b71b70

14 9月, 2008 2 次提交

generic: add phys_addr_t for holding physical addresses · 600715dc

由 Jeremy Fitzhardinge 提交于 9月 11, 2008

Add a kernel-wide "phys_addr_t" which is guaranteed to be able to hold
any physical address.  By default it equals the word size of the
architecture, but a 32-bit architecture can set ARCH_PHYS_ADDR_T_64BIT
if it needs a 64-bit phys_addr_t.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

600715dc

mm: mark the correct zone as full when scanning zonelists · 5bead2a0

由 Mel Gorman 提交于 9月 13, 2008

The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
scanning zonelists to keep track of where in the zonelist it is.  The
zoneref that is returned corresponds to the the next zone that is to be
scanned, not the current one.  It was intended to be treated as an opaque
list.

When the page allocator is scanning a zonelist, it marks elements in the
zonelist corresponding to zones that are temporarily full.  As the
zonelist is being updated, it uses the cursor here;

  if (NUMA_BUILD)
        zlc_mark_zone_full(zonelist, z);

This is intended to prevent rescanning in the near future but the zoneref
cursor does not correspond to the zone that has been found to be full.
This is an easy misunderstanding to make so this patch corrects the
problem by changing zoneref cursor to be the current zone being scanned
instead of the next one.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>		[2.6.26.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5bead2a0

04 9月, 2008 1 次提交

mmap: fix petty bug in anonymous shared mmap offset handling · ce363942

由 Tejun Heo 提交于 9月 03, 2008

Anonymous mappings should ignore offset but shared anonymous mapping
forgot to clear it and makes the following legit test program trigger
SIGBUS.

 #include <sys/mman.h>
 #include <stdio.h>
 #include <errno.h>

 #define PAGE_SIZE	4096

 int main(void)
 {
	 char *p;
	 int i;

	 p = mmap(NULL, 2 * PAGE_SIZE, PROT_READ|PROT_WRITE,
		  MAP_SHARED|MAP_ANONYMOUS, -1, PAGE_SIZE);
	 if (p == MAP_FAILED) {
		 perror("mmap");
		 return 1;
	 }

	 for (i = 0; i < 2; i++) {
		 printf("page %d\n", i);
		 p[i * 4096] = i;
	 }
	 return 0;
 }

Fix it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NHugh Dickins <hugh@veritas.com>
Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ce363942

03 9月, 2008 4 次提交

mm: size of quicklists shouldn't be proportional to the number of CPUs · b9541852

由 KOSAKI Motohiro 提交于 9月 02, 2008

Quicklists store pages for each CPU as caches.  (Each CPU can cache
node_free_pages/16 pages)

It is used for page table cache.  exit() will increase the cache size,
while fork() consumes it.

So for example if an apache-style application runs (one parent and many
child model), one CPU process will fork() while another CPU will process
the middleware work and exit().

At that time, the CPU on which the parent runs doesn't have page table
cache at all.  Others (on which children runs) have maximum caches.

	QList_max = (#ofCPUs - 1) x Free / 16
	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)

So, How much quicklist memory is used in the maximum case?

This is proposional to # of CPUs because the limit of per cpu quicklist
cache doesn't see the number of cpus.

Above calculation mean

	 Number of CPUs per node            2    4    8   16
	 ==============================  ====================
	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%

Wow! Quicklist can spend about 50% memory at worst case.

My demonstration program is here
--------------------------------------------------------------------------------
#define _GNU_SOURCE

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <sched.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>

#define BUFFSIZE 512

int max_cpu(void)	/* get max number of logical cpus from /proc/cpuinfo */
{
  FILE *fd;
  char *ret, buffer[BUFFSIZE];
  int cpu = 1;

  fd = fopen("/proc/cpuinfo", "r");
  if (fd == NULL) {
    perror("fopen(/proc/cpuinfo)");
    exit(EXIT_FAILURE);
  }
  while (1) {
    ret = fgets(buffer, BUFFSIZE, fd);
    if (ret == NULL)
      break;
    if (!strncmp(buffer, "processor", 9))
      cpu = atoi(strchr(buffer, ':') + 2);
  }
  fclose(fd);
  return cpu;
}

void cpu_bind(int cpu)	/* bind current process to one cpu */
{
  cpu_set_t mask;
  int ret;

  CPU_ZERO(&mask);
  CPU_SET(cpu, &mask);
  ret = sched_setaffinity(0, sizeof(mask), &mask);
  if (ret == -1) {
    perror("sched_setaffinity()");
    exit(EXIT_FAILURE);
  }
  sched_yield();	/* not necessary */
}

#define MMAP_SIZE (10 * 1024 * 1024)	/* 10 MB */
#define FORK_INTERVAL 1	/* 1 second */

main(int argc, char *argv[])
{
  int cpu_max, nextcpu;
  long pagesize;
  pid_t pid;

  /* set max number of logical cpu */
  if (argc > 1)
    cpu_max = atoi(argv[1]) - 1;
  else
    cpu_max = max_cpu();

  /* get the page size */
  pagesize = sysconf(_SC_PAGESIZE);
  if (pagesize == -1) {
    perror("sysconf(_SC_PAGESIZE)");
    exit(EXIT_FAILURE);
  }

  /* prepare parent process */
  cpu_bind(0);
  nextcpu = cpu_max;

loop:

  /* select destination cpu for child process by round-robin rule */
  if (++nextcpu > cpu_max)
    nextcpu = 1;

  pid = fork();

  if (pid == 0) { /* child action */

    char *p;
    int i;

    /* consume page tables */
    p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    i = MMAP_SIZE / pagesize;
    while (i-- > 0) {
      *p = 1;
      p += pagesize;
    }

    /* move to other cpu */
    cpu_bind(nextcpu);
/*
    printf("a child moved to cpu%d after mmap().\n", nextcpu);
    fflush(stdout);
 */

    /* back page tables to pgtable_quicklist */
    exit(0);

  } else if (pid > 0) { /* parent action */

    sleep(FORK_INTERVAL);
    waitpid(pid, NULL, WNOHANG);

  }

  goto loop;
}
----------------------------------------

When above program which does task migration runs, my 8GB box spends
800MB of memory for quicklist.  This is not memory leak but doesn't seem
good.

% cat /proc/meminfo

MemTotal:        7701568 kB
MemFree:         4724672 kB
(snip)
Quicklists:       844800 kB

because

- My machine spec is
	number of numa node: 2
	number of cpus:      8 (4CPU x2 node)
        total mem:           8GB (4GB x2 node)
        free mem:            about 5GB

- Then, 4.7GB x 16% ~= 880MB.
  So, Quicklist can use 800MB.

So, if following spec machine run that program

   CPUs: 64 (8cpu x 8node)
   Mem:  1TB (128GB x8node)

Then, quicklist can waste 300GB (= 1TB x 30%).  It is too large.

So, I don't like cache policies which is proportional to # of cpus.

My patch changes the number of caches
from:
   per-cpu-cache-amount = memory_on_node / 16
to
   per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Acked-by: NChristoph Lameter <cl@linux-foundation.org>
Tested-by: NDavid Miller <davem@davemloft.net>
Acked-by: NMike Travis <travis@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b9541852

mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data · 52765583

由 Marcin Slusarz 提交于 9月 02, 2008

WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
The variable contig_page_data references
the variable __initdata bootmem_node_data
If the reference is valid then annotate the
variable with __init* (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
Cc: Johannes Weiner <hannes@saeurebad.de>
Cc: Sean MacLennan <smaclennan@pikatech.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52765583

VFS: fix dio write returning EIO when try_to_release_page fails · 6ccfa806

由 Hisashi Hifumi 提交于 9月 02, 2008

Dio write returns EIO when try_to_release_page fails because bh is
still referenced.

The patch

    commit 3f31fddf
    Author: Mingming Cao <cmm@us.ibm.com>
    Date:   Fri Jul 25 01:46:22 2008 -0700

        jbd: fix race between free buffer and commit transaction

was merged into 2.6.27-rc1, but I noticed that this patch is not enough
to fix the race.

I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
sometimes got EIO through this test.

The patch above fixed race between freeing buffer(dio) and committing
transaction(jbd) but I discovered that there is another race, freeing
buffer(dio) and ext3/4_ordered_writepage.

: background_writeout()
     ->write_cache_pages()
       ->ext3_ordered_writepage()
     	   walk_page_buffers() -> take a bh ref
 	   block_write_full_page() -> unlock_page
		: <- end_page_writeback
                : <- race! (dio write->try_to_release_page fails)
      	   walk_page_buffers() ->release a bh ref

ext3_ordered_writepage holds bh ref and does unlock_page remaining
taking a bh ref, so this causes the race and failure of
try_to_release_page.

To fix this race, I used the approach of falling back to buffered
writes if try_to_release_page() fails on a page.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6ccfa806

mm: make setup_zone_migrate_reserve() aware of overlapping nodes · 344c790e

由 Adam Litke 提交于 9月 02, 2008

I have gotten to the root cause of the hugetlb badness I reported back on
August 15th.  My system has the following memory topology (note the
overlapping node):

            Node 0 Memory: 0x8000000-0x44000000
            Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
for a pageblock to move onto the MIGRATE_RESERVE list.  Finding no
candidates, it happily continues the scan into 0x8000000-0x44000000.  When
a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
the wrong zone.  Oops.

setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.
Signed-off-by: NAdam Litke <agl@us.ibm.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

344c790e

02 9月, 2008 1 次提交
- D
  Remove '#include <stddef.h>' from mm/page_isolation.c · 0ed97ee4
  由 David Woodhouse 提交于 9月 01, 2008
```
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
```
  0ed97ee4
28 8月, 2008 1 次提交

[ARM] Skip memory holes in FLATMEM when reading /proc/pagetypeinfo · e80d6a24

由 Mel Gorman 提交于 8月 14, 2008

Ordinarily, memory holes in flatmem still have a valid memmap and is safe
to use. However, an architecture (ARM) frees up the memmap backing memory
holes on the assumption it is never used. /proc/pagetypeinfo reads the
whole range of pages in a zone believing that the memmap is valid and that
pfn_valid will return false if it is not. On ARM, freeing the memmap breaks
the page->zone linkages even though pfn_valid() returns true and the kernel
can oops shortly afterwards due to accessing a bogus struct zone *.

This patch lets architectures say when FLATMEM can have holes in the
memmap. Rather than an expensive check for valid memory, /proc/pagetypeinfo
will confirm that the page linkages are still valid by checking page->zone
is still the expected zone. The lookup of page_zone is safe as there is a
limited range of memory that is accessed when calling page_zone. Even if
page_zone happens to return the correct zone, the impact is that the counters
in /proc/pagetypeinfo are slightly off but fragmentation monitoring is
unlikely to be relevant on an embedded system.
Reported-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Tested-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>

e80d6a24

21 8月, 2008 8 次提交

mm: xip/ext2 fix block allocation race · 14bac5ac

由 Nick Piggin 提交于 8月 20, 2008

XIP can call into get_xip_mem concurrently with the same file,offset with
create=1.  This usually maps down to get_block, which expects the page
lock to prevent such a situation.  This causes ext2 to explode for one
reason or another.

Serialise those calls for the moment.  For common usages today, I suspect
get_xip_mem rarely is called to create new blocks.  In future as XIP
technologies evolve we might need to look at which operations require
scalability, and rework the locking to suit.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Acked-by: NCarsten Otte <cotte@freenet.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

14bac5ac

mm: xip fix fault vs sparse page invalidate race · 538f8ea6

由 Nick Piggin 提交于 8月 20, 2008

XIP has a race between sparse pages being inserted into page tables, and
sparse pages being zapped when its time to put a non-sparse page in.

What can happen is that a process can be left with a dangling sparse page
in a MAP_SHARED mapping, while the rest of the world sees the non-sparse
version.  Ie.  data corruption.

Guard these operations with a seqlock, making fault-in-sparse-pages the
slowpath, and try-to-unmap-sparse-pages the fastpath.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Acked-by: NCarsten Otte <cotte@freenet.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

538f8ea6

mm: dirty page tracking race fix · 479db0bf

由 Nick Piggin 提交于 8月 20, 2008

There is a race with dirty page accounting where a page may not properly
be accounted for.

clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.

page_mkclean walks the rmaps for that page, and for each one it cleans and
write protects the pte if it was dirty.  It uses page_check_address to
find the pte.  That function has a shortcut to avoid the ptl if the pte is
not present.  Unfortunately, the pte can be switched to not-present then
back to present by other code while holding the page table lock -- this
should not be a signal for page_mkclean to ignore that pte, because it may
be dirty.

For example, powerpc64's set_pte_at will clear a previously present pte
before setting it to the desired value.  There may also be other code in
core mm or in arch which do similar things.

The consequence of the bug is loss of data integrity due to msync, and
loss of dirty page accounting accuracy.  XIP's __xip_unmap could easily
also be unreliable (depending on the exact XIP locking scheme), which can
lead to data corruption.

Fix this by having an option to always take ptl to check the pte in
page_check_address.

It's possible to retain this optimization for page_referenced and
try_to_unmap.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Carsten Otte <cotte@freenet.de>
Cc: Hugh Dickins <hugh@veritas.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

479db0bf

bootmem: fix aligning of node-relative indexes and offsets · 481ebd0d

由 Johannes Weiner 提交于 8月 20, 2008

Absolute alignment requirements may never be applied to node-relative
offsets.  Andreas Herrmann spotted this flaw when a bootmem allocation on
an unaligned node was itself not aligned because the combination of an
unaligned node with an aligned offset into that node is not garuanteed to
be aligned itself.

This patch introduces two helper functions that align a node-relative
index or offset with respect to the node's starting address so that the
absolute PFN or virtual address that results from combining the two
satisfies the requested alignment.

Then all the broken ALIGN()s in alloc_bootmem_core() are replaced by these
helpers.
Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
Reported-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Debugged-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Reviewed-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Tested-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

481ebd0d

mm: mminit_loglevel cannot be __meminitdata anymore · 759f9a2d

由 Marcin Slusarz 提交于 8月 20, 2008

mminit_loglevel is now used from mminit_verify_zonelist <- build_all_zonelists <-

1. online_pages <- memory_block_action <- memory_block_change_state <- store_mem_state (sys handler)
2. numa_zonelist_order_handler (proc handler)

so it cannot be annotated __meminit - drop it

fixes following section mismatch warning:
WARNING: vmlinux.o(.text+0x71628): Section mismatch in reference from the function mminit_verify_zonelist() to the variable .meminit.data:mminit_loglevel
The function mminit_verify_zonelist() references
the variable __meminitdata mminit_loglevel.
This is often because mminit_verify_zonelist lacks a __meminitdata
annotation or the annotation of mminit_loglevel is wrong.
Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

759f9a2d

mm: show free swap as signed · 07279cdf

由 Hugh Dickins 提交于 8月 20, 2008

Adjust <Alt><SysRq>m show_swap_cache_info() to show "Free swap" as a
signed long: the signed format is preferable, because during swapoff
nr_swap_pages can legitimately go negative, so makes more sense thus
(it used to be shown redundantly, once as signed and once as unsigned).
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

07279cdf

mm: page_remove_rmap comments on PageAnon · 16f8c5b2

由 Hugh Dickins 提交于 8月 20, 2008

Add a comment to s390's page_test_dirty/page_clear_dirty/page_set_dirty
dance in page_remove_rmap(): I was wrong to think the PageSwapCache test
could be avoided, and would like a comment in there to remind me. And
mention s390, to help us remember that this block is not really common.

Also move down the "It would be tidy to reset PageAnon" comment: it does
not belong to s390's block, and it would be unwise to reset PageAnon
before we're done with testing it.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

16f8c5b2

slub: Disable NUMA remote node defragmentation by default · e2cb96b7

由 Christoph Lameter 提交于 8月 19, 2008

Switch remote node defragmentation off by default. The current settings can
cause excessive node local allocations with hackbench:

  SLAB:

    % cat /proc/meminfo
    MemTotal:        7701760 kB
    MemFree:         5940096 kB
    Slab:             123840 kB

  SLUB:

    % cat /proc/meminfo
    MemTotal:        7701376 kB
    MemFree:         4740928 kB
    Slab:            1591680 kB

[Note: this feature is not related to slab defragmentation.]

You can find the original discussion here:

  http://lkml.org/lkml/2008/8/4/308Reported-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

e2cb96b7

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年