提交 · 5beb49305251e5669852ed541e8e2f2f7696c53e · xiphi1978 / linux

07 3月, 2010 1 次提交

mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930

由 Rik van Riel 提交于 3月 05, 2010

The old anon_vma code can lead to scalability issues with heavily forking
workloads.  Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes.  However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock.  This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands.  Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA.  At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated.  The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
 This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations.  This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures.  This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock.  To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time.  The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: NRik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5beb4930

17 1月, 2010 4 次提交

nommu: fix shared mmap after truncate shrinkage problems · 7e660872

由 David Howells 提交于 1月 15, 2010

Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation.  The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.

The following sequence of events can cause the problem:

	fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
	ftruncate(fd, 32 * 1024);
	a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	munmap(a, 32 * 1024);
	ftruncate(fd, 16 * 1024);
	c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Mapping 'a' creates a vm_region covering 32KB of the file.  Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.

Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'.  However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.

Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.

However:

	d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Mapping 'd' should work, and should end up sharing the region allocated by
'a'.

To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7e660872

nommu: don't need get_unmapped_area() for NOMMU · efc1a3b1

由 David Howells 提交于 1月 15, 2010

get_unmapped_area() is unnecessary for NOMMU as no-one calls it.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

efc1a3b1

nommu: remove a superfluous check of vm_region::vm_usage · 779c1023

由 David Howells 提交于 1月 15, 2010

In split_vma(), there's no need to check if the VMA being split has a
region that's in use by more than one VMA because:

 (1) The preceding test prohibits splitting of non-anonymous VMAs and regions
     (eg: file or chardev backed VMAs).

 (2) Anonymous regions can't be mapped multiple times because there's no handle
     by which to refer to the already existing region.

 (3) If a VMA has previously been split, then the region backing it has also
     been split into two regions, each of usage 1.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

779c1023

nommu: struct vm_region's vm_usage count need not be atomic · 1e2ae599

由 David Howells 提交于 1月 15, 2010

The vm_usage count field in struct vm_region does not need to be atomic as
it's only even modified whilst nommu_region_sem is write locked.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1e2ae599

07 1月, 2010 2 次提交

NOMMU: Use copy_*_user_page() in access_process_vm() · 7959722b

由 Jie Zhang 提交于 1月 06, 2010

The MMU code uses the copy_*_user_page() variants in access_process_vm()
rather than copy_*_user() as the former includes an icache flush.  This
is important when doing things like setting software breakpoints with
gdb.  So switch the NOMMU code over to do the same.

This patch makes the reasonable assumption that copy_from_user_page()
won't fail - which is probably fine, as we've checked the VMA from which
we're copying is usable, and the copy is not allowed to cross VMAs.  The
one case where it might go wrong is if the VMA is a device rather than
RAM, and that device returns an error which - in which case rubbish will
be returned rather than EIO.
Signed-off-by: NJie Zhang <jie.zhang@analog.com>
Signed-off-by: NMike Frysinger <vapier@gentoo.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NDavid McCullough <david_mccullough@mcafee.com>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Acked-by: NGreg Ungerer <gerg@uclinux.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7959722b

NOMMU: Avoiding duplicate icache flushes of shared maps · cfe79c00

由 Mike Frysinger 提交于 1月 06, 2010

When working with FDPIC, there are many shared mappings of read-only
code regions between applications (the C library, applet packages like
busybox, etc.), but the current do_mmap_pgoff() function will issue an
icache flush whenever a VMA is added to an MM instead of only doing it
when the map is initially created.

The flush can instead be done when a region is first mmapped PROT_EXEC.
Note that we may not rely on the first mapping of a region being
executable - it's possible for it to be PROT_READ only, so we have to
remember whether we've flushed the region or not, and then flush the
entire region when a bit of it is made executable.

However, this also affects the brk area. That will no longer be
executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
for NOMMU mode kernels, when it increases the brk allocation, making
sys_brk() flush the extra from the icache should suffice. The brk area
probably isn't used by NOMMU programs since the brk area can only use up
the leavings from the stack allocation, where the stack allocation is
larger than requested.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NMike Frysinger <vapier@gentoo.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cfe79c00

31 12月, 2009 1 次提交

mm: move sys_mmap_pgoff from util.c · 66f0dc48

由 Hugh Dickins 提交于 12月 30, 2009

Move sys_mmap_pgoff() from mm/util.c to mm/mmap.c and mm/nommu.c,
where we'd expect to find such code: especially now that it contains
the MAP_HUGETLB handling. Revert mm/util.c to how it was in 2.6.32.

This patch just ignores MAP_HUGETLB in the nommu case, as in 2.6.32,
whereas 2.6.33-rc2 reported -ENOSYS. Perhaps validate_mmap_request()
should reject it with -EINVAL? Add that later if necessary.
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

66f0dc48

16 12月, 2009 1 次提交

nommu: fix malloc performance by adding uninitialized flag · ea637639

由 Jie Zhang 提交于 12月 14, 2009

The NOMMU code currently clears all anonymous mmapped memory.  While this
is what we want in the default case, all memory allocation from userspace
under NOMMU has to go through this interface, including malloc() which is
allowed to return uninitialized memory.  This can easily be a significant
performance penalty.  So for constrained embedded systems were security is
irrelevant, allow people to avoid clearing memory unnecessarily.

This also alters the ELF-FDPIC binfmt such that it obtains uninitialised
memory for the brk and stack region.
Signed-off-by: NJie Zhang <jie.zhang@analog.com>
Signed-off-by: NRobin Getz <rgetz@blackfin.uclinux.org>
Signed-off-by: NMike Frysinger <vapier@gentoo.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Acked-by: NGreg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ea637639

01 11月, 2009 1 次提交

NOMMU: Don't pass NULL pointers to fput() in do_mmap_pgoff() · 89a86402

由 David Howells 提交于 10月 30, 2009

Don't pass NULL pointers to fput() in the error handling paths of the NOMMU
do_mmap_pgoff() as it can't handle it.

The following can be used as a test program:

	int main() { static long long a[1024 * 1024 * 20] = { 0 }; return a;}

Without the patch, the code oopses in atomic_long_dec_and_test() as called by
fput() after the kernel complains that it can't allocate that big a chunk of
memory.  With the patch, the kernel just complains about the allocation size
and then the program segfaults during execve() as execve() can't complete the
allocation of all the new ELF program segments.
Reported-by: NRobin Getz <rgetz@blackfin.uclinux.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NRobin Getz <rgetz@blackfin.uclinux.org>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

89a86402

28 9月, 2009 1 次提交

const: mark struct vm_struct_operations · f0f37e2f

由 Alexey Dobriyan 提交于 9月 27, 2009

* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code

But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f0f37e2f

25 9月, 2009 2 次提交

NOMMU: Ignore mmap() address param as it is a hint · 06aab5a3

由 David Howells 提交于 9月 24, 2009

Ignore the address parameter given to NOMMU mmap() as it is a hint, rather
than giving an error if it's non-zero. MAP_FIXED still gets an error.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

06aab5a3

NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly · 645d83c5

由 David Howells 提交于 9月 24, 2009

Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
might be mapped directly.  Use the BDI_CAP_MAP_DIRECT capability flag to govern
whether or not we should be trying to map a file directly.  This can be used to
determine whether or not a region has been filled in at the point where we call
do_mmap_shared() or do_mmap_private().

The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
there's any reason we can't use it.  It's also cleared in do_mmap_pgoff() if
f_op->get_unmapped_area() fails.

Without this fix, attempting to run a program from a RomFS image on a
non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
this can be caught in gdb:

Program received signal SIGABRT, Aborted.
0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
(gdb) bt
#0  0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
#1  0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
#2  0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
#3  0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
#4  0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
#5  0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
#6  0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
#7  0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897

Note that this fix does the following commit differently:

	commit a190887b
	Author: David Howells <dhowells@redhat.com>
	Date:   Sat Sep 5 11:17:07 2009 -0700
	nommu: fix error handling in do_mmap_pgoff()
Reported-by: NGraff Yang <graff.yang@gmail.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

645d83c5

24 9月, 2009 2 次提交

truncate: new helpers · 25d9e2d1

由 npiggin@suse.de 提交于 8月 21, 2009

Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

25d9e2d1

nommu: fix two build breakages · 4266c97a

由 Hugh Dickins 提交于 9月 23, 2009

My 58fa879e "mm: FOLL flags for GUP flags"
broke CONFIG_NOMMU build by forgetting to update nommu.c foll_flags type:

  mm/nommu.c:171: error: conflicting types for `__get_user_pages'
  mm/internal.h:254: error: previous declaration of `__get_user_pages' was here
  make[1]: *** [mm/nommu.o] Error 1

My 03f6462a "mm: move highest_memmap_pfn"
broke CONFIG_NOMMU build by forgetting to add a nommu.c highest_memmap_pfn:

  mm/built-in.o: In function `memmap_init_zone':
  (.meminit.text+0x326): undefined reference to `highest_memmap_pfn'
  mm/built-in.o: In function `memmap_init_zone':
  (.meminit.text+0x32d): undefined reference to `highest_memmap_pfn'

Fix both breakages, and give myself 30 lashes (ouch!)
Reported-by: NMichal Simek <michal.simek@petalogix.com>
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4266c97a

22 9月, 2009 4 次提交

nommu: add support for Memory Protection Units (MPU) · eb8cdec4

由 Bernd Schmidt 提交于 9月 21, 2009

Some architectures (like the Blackfin arch) implement some of the
"simpler" features that one would expect out of a MMU such as memory
protection.

In our case, we actually get read/write/exec protection down to the page
boundary so processes can't stomp on each other let alone the kernel.

There is a performance decrease (which depends greatly on the workload)
however as the hardware/software interaction was not optimized at design
time.
Signed-off-by: NBernd Schmidt <bernds_cb1@t-online.de>
Signed-off-by: NBryan Wu <cooloney@kernel.org>
Signed-off-by: NMike Frysinger <vapier@gentoo.org>
Acked-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NGreg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eb8cdec4

mm: FOLL flags for GUP flags · 58fa879e

由 Hugh Dickins 提交于 9月 21, 2009

__get_user_pages() has been taking its own GUP flags, then processing
them into FOLL flags for follow_page().  Though oddly named, the FOLL
flags are more widely used, so pass them to __get_user_pages() now.
Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

(The patch to __get_user_pages() looks peculiar, with both gup_flags
and foll_flags: the gup_flags remain constant; but as before there's
an exceptional case, out of scope of the patch, in which foll_flags
per page have FOLL_WRITE masked off.)
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

58fa879e

mm: remove unused GUP flags · 1c3aff1c

由 Hugh Dickins 提交于 9月 21, 2009

GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
flags added solely to prevent __get_user_pages() from doing some of
what it usually does, in the munlock case: we can now remove them.
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1c3aff1c

mm: includecheck fix for mm/nommu.c · 72ff13b7

由 Jaswinder Singh Rajput 提交于 9月 21, 2009

Fix the following 'make includecheck' warning:

  mm/nommu.c: internal.h is included more than once.
Signed-off-by: NJaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: NGreg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72ff13b7

06 9月, 2009 1 次提交

nommu: fix error handling in do_mmap_pgoff() · a190887b

由 David Howells 提交于 9月 05, 2009

Fix the error handling in do_mmap_pgoff().  If do_mmap_shared_file() or
do_mmap_private() fail, we jump to the error_put_region label at which
point we cann __put_nommu_region() on the region - but we haven't yet
added the region to the tree, and so __put_nommu_region() may BUG
because the region tree is empty or it may corrupt the region tree.

To get around this, we can afford to add the region to the region tree
before calling do_mmap_shared_file() or do_mmap_private() as we keep
nommu_region_sem write-locked, so no-one can race with us by seeing a
transient region.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: NGreg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a190887b

19 8月, 2009 1 次提交

nommu: check fd read permission in validate_mmap_request() · 28d7a6ae

由 Graff Yang 提交于 8月 18, 2009

According to the POSIX (1003.1-2008), the file descriptor shall have been
opened with read permission, regardless of the protection options specified to
mmap().  The ltp test cases mmap06/07 need this.
Signed-off-by: NGraff Yang <graff.yang@gmail.com>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NGreg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

28d7a6ae

17 8月, 2009 1 次提交

Security/SELinux: seperate lsm specific mmap_min_addr · 788084ab

由 Eric Paris 提交于 7月 31, 2009

Currently SELinux enforcement of controls on the ability to map low memory
is determined by the mmap_min_addr tunable.  This patch causes SELinux to
ignore the tunable and instead use a seperate Kconfig option specific to how
much space the LSM should protect.

The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
permissions will always protect the amount of low memory designated by
CONFIG_LSM_MMAP_MIN_ADDR.

This allows users who need to disable the mmap_min_addr controls (usual reason
being they run WINE as a non-root user) to do so and still have SELinux
controls preventing confined domains (like a web server) from being able to
map some area of low memory.
Signed-off-by: NEric Paris <eparis@redhat.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

788084ab

06 8月, 2009 1 次提交

Security/SELinux: seperate lsm specific mmap_min_addr · a2551df7

由 Eric Paris 提交于 7月 31, 2009

Currently SELinux enforcement of controls on the ability to map low memory
is determined by the mmap_min_addr tunable.  This patch causes SELinux to
ignore the tunable and instead use a seperate Kconfig option specific to how
much space the LSM should protect.

The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
permissions will always protect the amount of low memory designated by
CONFIG_LSM_MMAP_MIN_ADDR.

This allows users who need to disable the mmap_min_addr controls (usual reason
being they run WINE as a non-root user) to do so and still have SELinux
controls preventing confined domains (like a web server) from being able to
map some area of low memory.
Signed-off-by: NEric Paris <eparis@redhat.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

a2551df7

26 6月, 2009 2 次提交

nommu: provide follow_pfn(). · dfc2f91a

由 Paul Mundt 提交于 6月 26, 2009

With the introduction of follow_pfn() as an exported symbol, modules have
begun making use of it. Unfortunately this was not reflected on nommu at
the time, so the in-tree users have subsequently all blown up with link
errors there.

This provides a simple follow_pfn() that just returns addr >> PAGE_SHIFT,
which will do the right thing on nommu. There is no need to do range
checking within the vma, as the find_vma() case will already take care of
this.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>

dfc2f91a

clarify get_user_pages() prototype · 9d73777e

由 Peter Zijlstra 提交于 6月 25, 2009

Currently the 4th parameter of get_user_pages() is called len, but its
in pages, not bytes. Rename the thing to nr_pages to avoid future
confusion.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9d73777e

10 6月, 2009 1 次提交

nommu: Provide mmap_min_addr definition. · 35f2c2f6

由 Paul Mundt 提交于 6月 09, 2009

With the "security: use mmap_min_addr indepedently of security models"
change, mmap_min_addr is used in common areas, which susbsequently blows
up the nommu build. This stubs in the definition in the nommu case as
well.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>

--

 mm/nommu.c |    3 +++
 1 file changed, 3 insertions(+)
Signed-off-by: NJames Morris <jmorris@namei.org>

35f2c2f6

08 5月, 2009 1 次提交

NOMMU: Don't check vm_region::vm_start is page aligned in add_nommu_region() · 8c9ed899

由 David Howells 提交于 5月 07, 2009

Don't check vm_region::vm_start is page aligned in add_nommu_region() because
the region may reflect some non-page-aligned mapped file, such as could be
obtained from RomFS XIP.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NGreg Ungerer <gerg@uclinux.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8c9ed899

07 5月, 2009 1 次提交

nommu: make the initial mmap allocation excess behaviour Kconfig configurable · fc4d5c29

由 David Howells 提交于 5月 06, 2009

NOMMU mmap() has an option controlled by a sysctl variable that determines
whether the allocations made by do_mmap_private() should have the excess
space trimmed off and returned to the allocator.  Make the initial setting
of this variable a Kconfig configuration option.

The reason there can be excess space is that the allocator only allocates
in power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a
power of 2.

There are two alternatives:

 (1) Keep the excess as dead space.  The dead space then remains unused for the
     lifetime of the mapping.  Mappings of shared objects such as libc, ld.so
     or busybox's text segment may retain their dead space forever.

 (2) Return the excess to the allocator.  This means that the dead space is
     limited to less than a page per mapping, but it means that for a transient
     process, there's more chance of fragmentation as the excess space may be
     reused fairly quickly.

During the boot process, a lot of transient processes are created, and
this can cause a lot of fragmentation as the pagecache and various slabs
grow greatly during this time.

By turning off the trimming of excess space during boot and disabling
batching of frees, Coldfire can manage to boot.

A better way of doing things might be to have /sbin/init turn this option
off.  By that point libc, ld.so and init - which are all long-duration
processes - have all been loaded and trimmed.
Reported-by: NLanttor Guo <lanttor.guo@freescale.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NLanttor Guo <lanttor.guo@freescale.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc4d5c29

03 5月, 2009 1 次提交

mm: fix Committed_AS underflow on large NR_CPUS environment · 00a62ce9

由 KOSAKI Motohiro 提交于 4月 30, 2009

The Committed_AS field can underflow in certain situations:

>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB

Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
not check for underflow.

But NR_CPUS proportional isn't good calculation.  In general,
possibility of lock contention is proportional to the number of online
cpus, not theorical maximum cpus (NR_CPUS).

The current kernel has generic percpu-counter stuff.  using it is right
way.  it makes code simplify and percpu_counter_read_positive() don't
make underflow issue.
Reported-by: NDave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>		[All kernel versions]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

00a62ce9

03 4月, 2009 1 次提交

nommu: fix a number of issues with the per-MM VMA patch · 33e5d769

由 David Howells 提交于 4月 02, 2009

Fix a number of issues with the per-MM VMA patch:

 (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
     a NOMMU system with more than 2G pages.  Makes no difference on a 32-bit
     system.

 (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
     lest it overflow.

 (3) Move the allocation of the vm_area_struct slab back for fork.c.

 (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

 (5) Use BUG_ON() rather than if () BUG().

 (6) Make the default validate_nommu_regions() a static inline rather than a
     #define.

 (7) Make free_page_series()'s objection to pages with a refcount != 1 more
     informative.

 (8) Adjust the __put_nommu_region() banner comment to indicate that the
     semaphore must be held for writing.

 (9) Limit the number of warnings about munmaps of non-mmapped regions.
Reported-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

33e5d769

27 1月, 2009 1 次提交

uclinux: add process name to allocation error message · 05ae6fa3

由 Greg Ungerer 提交于 1月 13, 2009

This patch adds the name of the process to the bad allocation error
message on non-MMU systems.

Changed suggested by jsujjavanich@syntech-fuelmaster.com
Signed-off-by: NGreg Ungerer <gerg@uclinux.org>

05ae6fa3

21 1月, 2009 1 次提交

nommu: Stub in vm_map_ram()/vm_unmap_ram()/vm_unmap_aliases(). · eb6434d9

由 Paul Mundt 提交于 1月 21, 2009

Presently we do not support these interfaces, so make them BUG() wrappers
as per the rest of the vmap interface on nommu. Fixes up the modular xfs
build.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>

eb6434d9

14 1月, 2009 2 次提交

H
[CVE-2009-0029] System call wrappers part 13 · 6a6160a7
由 Heiko Carstens 提交于 1月 14, 2009
```
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
```
6a6160a7

[CVE-2009-0029] Convert all system calls to return a long · 2ed7c03e

由 Heiko Carstens 提交于 1月 14, 2009

Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

2ed7c03e

08 1月, 2009 4 次提交

NOMMU: Teach kobjsize() about VMA regions. · ab2e83ea

由 Paul Mundt 提交于 1月 08, 2009

Now that we no longer use compound pages for all large allocations,
kobjsize() actively breaks things like binfmt_flat by always handing
back PAGE_SIZE for mmap'ed regions. Fix this up by looking up the
VMA region for non-compounds.

Ideally binfmt_flat wants to get rid of kobjsize() completely, but
this is an incremental step.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NMike Frysinger <vapier.adi@gmail.com>

ab2e83ea

NOMMU: Make mmap allocation page trimming behaviour configurable. · dd8632a1

由 Paul Mundt 提交于 1月 08, 2009

NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to
the nearest power-of-2 number of pages. Currently it then discards the excess
pages back to the page allocator, making that memory available for use by other
things. This can, however, cause greater amount of fragmentation.

To counter this, a sysctl is added in order to fine-tune the trimming
behaviour. The default behaviour remains to trim pages aggressively, while
this can either be disabled completely or set to a higher page-granular
watermark in order to have finer-grained control.

vm region vm_top bits taken from an earlier patch by David Howells.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NMike Frysinger <vapier.adi@gmail.com>

dd8632a1

NOMMU: Make VMAs per MM as for MMU-mode linux · 8feae131

由 David Howells 提交于 1月 08, 2009

Make VMAs per mm_struct as for MMU-mode linux.  This solves two problems:

 (1) In SYSV SHM where nattch for a segment does not reflect the number of
     shmat's (and forks) done.

 (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
     exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
     that a VMA might be shared and already have its vm_mm assigned to another
     process or a dead process.

A new struct (vm_region) is introduced to track a mapped region and to remember
the circumstances under which it may be shared and the vm_list_struct structure
is discarded as it's no longer required.

This patch makes the following additional changes:

 (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
     with no recourse to __GFP_COMP, so the pages are not composite.  Instead,
     each page has a reference on it held by the region.  Anything else that is
     interested in such a page will have to get a reference on it to retain it.
     When the pages are released due to unmapping, each page is passed to
     put_page() and will be freed when the page usage count reaches zero.

 (2) Excess pages are trimmed after an allocation as the allocation must be
     made as a power-of-2 quantity of pages.

 (3) VMAs are added to the parent MM's R/B tree and mmap lists.  As an MM may
     end up with overlapping VMAs within the tree, the VMA struct address is
     appended to the sort key.

 (4) Non-anonymous VMAs are now added to the backing inode's prio list.

 (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
     the backing region.  The VMA and region structs will be split if
     necessary.

 (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
     segment instead of all the attachments at that addresss.  Multiple
     shmat()'s return the same address under NOMMU-mode instead of different
     virtual addresses as under MMU-mode.

 (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.

 (8) /proc/maps is now the global list of mapped regions, and may list bits
     that aren't actually mapped anywhere.

 (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
     of RAM currently allocated by mmap to hold mappable regions that can't be
     mapped directly.  These are copies of the backing device or file if not
     anonymous.

These changes make NOMMU mode more similar to MMU mode.  The downside is that
NOMMU mode requires some extra memory to track things over NOMMU without this
patch (VMAs are no longer shared, and there are now region structs).
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NMike Frysinger <vapier.adi@gmail.com>
Acked-by: NPaul Mundt <lethal@linux-sh.org>

8feae131

NOMMU: Delete askedalloc and realalloc variables · 41836382

由 David Howells 提交于 1月 08, 2009

Delete the askedalloc and realalloc variables as nothing actually uses the
value calculated.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NMike Frysinger <vapier.adi@gmail.com>
Acked-by: NPaul Mundt <lethal@linux-sh.org>

41836382

06 1月, 2009 1 次提交

inode->i_op is never NULL · acfa4380

由 Al Viro 提交于 12月 04, 2008

We used to have rather schizophrenic set of checks for NULL ->i_op even
though it had been eliminated years ago. You'd need to go out of your
way to set it to NULL explicitly _and_ a bunch of code would die on
such inodes anyway. After killing two remaining places that still
did that bogosity, all that crap can go away.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

acfa4380

31 10月, 2008 1 次提交

nfsd: fix vm overcommit crash · 731572d3

由 Alan Cox 提交于 10月 29, 2008

Junjiro R.  Okajima reported a problem where knfsd crashes if you are
using it to export shmemfs objects and run strict overcommit.  In this
situation the current->mm based modifier to the overcommit goes through a
NULL pointer.

We could simply check for NULL and skip the modifier but we've caught
other real bugs in the past from mm being NULL here - cases where we did
need a valid mm set up (eg the exec bug about a year ago).

To preserve the checks and get the logic we want shuffle the checking
around and add a new helper to the vm_ security wrappers

Also fix a current->mm reference in nommu that should use the passed mm

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix build]
Reported-by: NJunjiro R. Okajima <hooanon05@yahoo.co.jp>
Acked-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NAlan Cox <alan@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

731572d3