提交 · 0b0acbec1bed75ec1e1daa7f7006323a2a2b2844 · Linux-御风守护者 / linux

30 10月, 2005 40 次提交

[PATCH] memory hotplug: move section_mem_map alloc to sparse.c · 0b0acbec

由 Dave Hansen 提交于 10月 29, 2005

This basically keeps up from having to extern __kmalloc_section_memmap().

The vaddr_in_vmalloc_area() helper could go in a vmalloc header, but that
header gets hard to work with, because it needs some arch-specific macros.
Just stick it in here for now, instead of creating another header.
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NLion Vollnhals <webmaster@schiggl.de>
Signed-off-by: NJiri Slaby <xslaby@fi.muni.cz>
Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

0b0acbec

[PATCH] memory hotplug: sysfs and add/remove functions · 3947be19

由 Dave Hansen 提交于 10月 29, 2005

This adds generic memory add/remove and supporting functions for memory
hotplug into a new file as well as a memory hotplug kernel config option.

Individual architecture patches will follow.

For now, disable memory hotplug when swsusp is enabled.  There's a lot of
churn there right now.  We'll fix it up properly once it calms down.
Signed-off-by: NMatt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

3947be19

[PATCH] memory hotplug locking: zone span seqlock · bdc8cb98

由 Dave Hansen 提交于 10月 29, 2005

See the "fixup bad_range()" patch for more information, but this actually
creates a the lock to protect things making assumptions about a zone's size
staying constant at runtime.
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

bdc8cb98

[PATCH] memory hotplug locking: node_size_lock · 208d54e5

由 Dave Hansen 提交于 10月 29, 2005

pgdat->node_size_lock is basically only neeeded in one place in the normal
code: show_mem(), which is the arch-specific sysrq-m printing function.

Strictly speaking, the architectures not doing memory hotplug do no need this
locking in show_mem(). However, they are all included for completeness. This
should also make any future consolidation of all of the implementations a
little more straightforward.

This lock is also held in the sparsemem code during a memory removal, as
sections are invalidated. This is the place there pfn_valid() is made false
for a memory area that's being removed. The lock is only required when doing
pfn_valid() operations on memory which the user does not already have a
reference on the page, such as in show_mem().
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

208d54e5

[PATCH] memory hotplug prep: fixup bad_range() · c6a57e19

由 Dave Hansen 提交于 10月 29, 2005

When doing memory hotplug operations, the size of existing zones can obviously
change.  This means that zone->zone_{start_pfn,spanned_pages} can change.

There are currently no locks that protect these structure members.  However,
they are rarely accessed at runtime.  Outside of swsusp, the only place that I
can find is bad_range().

So, split bad_range() up into two pieces: one that needs to be locked and
anther that doesn't.
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

c6a57e19

[PATCH] memory hotplug prep: __section_nr helper · 4ca644d9

由 Dave Hansen 提交于 10月 29, 2005

A little helper that we use in the hotplug code.
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

4ca644d9

[PATCH] memory hotplug prep: break out zone initialization · ed8ece2e

由 Dave Hansen 提交于 10月 29, 2005

If a zone is empty at boot-time and then hot-added to later, it needs to run
the same init code that would have been run on it at boot.

This patch breaks out zone table and per-cpu-pages functions for use by the
hotplug code.  You can almost see all of the free_area_init_core() function on
one page now.  :)
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

ed8ece2e

[PATCH] memory hotplug prep: kill local_mapnr · 2774812f

由 Dave Hansen 提交于 10月 29, 2005

The following series implements memory hot-add for ppc64 and i386.  There are
x86_64 and ia64 implementations that will be submitted shortly as well,
through the normal maintainers.

This patch:

local_mapnr is unused, except for in an alpha header.  Keep the alpha one,
kill the rest.
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

2774812f

[PATCH] .text page fault SMP scalability optimization · 1a44e149

由 Andrea Arcangeli 提交于 10月 29, 2005

We had a problem on ppc64 where with more than 4 threads a large system
wouldn't scale well while faulting in the .text (most of the time was spent
in the kernel despite it was an userland compute intensive app).  The
reason is the useless overwrite of the same pte from all cpu.

I fixed it this way (verified on an older kernel but the forward port is
almost identical).  This will benefit all archs not just ppc64.
Signed-off-by: NAndrea Arcangeli <andrea@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1a44e149

[PATCH] hugetlb: overcommit accounting check · 2e9b367c

由 Adam Litke 提交于 10月 29, 2005

Basic overcommit checking for hugetlb_file_map() based on an implementation
used with demand faulting in SLES9.

Since demand faulting can't guarantee the availability of pages at mmap
time, this patch implements a basic sanity check to ensure that the number
of huge pages required to satisfy the mmap are currently available.
Despite the obvious race, I think it is a good start on doing proper
accounting.  I'd like to work towards an accounting system that mimics the
semantics of normal pages (especially for the MAP_PRIVATE/COW case).  That
work is underway and builds on what this patch starts.

Huge page shared memory segments are simpler and still maintain their
commit on shmget semantics.
Signed-off-by: NAdam Litke <agl@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

2e9b367c

[PATCH] hugetlb: demand fault handler · 4c887265

由 Adam Litke 提交于 10月 29, 2005

Below is a patch to implement demand faulting for huge pages. The main
motivation for changing from prefaulting to demand faulting is so that huge
page memory areas can be allocated according to NUMA policy.

Thanks to consolidated hugetlb code, switching the behavior requires changing
only one fault handler. The bulk of the patch just moves the logic from
hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().
Signed-off-by: NAdam Litke <agl@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

4c887265

[PATCH] hugetlb: remove repeated code · 551110a9

由 Krishnakumar R 提交于 10月 29, 2005

Clean up some repeated code related to HugeTLB.  hugetlb_zero_setup would
have already allocated the file->f_op.
Signed-off-by: NKrishnakumar. R <rkrishnakumar@gmail.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

551110a9

[PATCH] cleanup hugelbfs_forget_inode · 0b1533f6

由 Christoph Hellwig 提交于 10月 29, 2005

Reformat hugelbfs_forget_inode and add the missing but harmless
write_inode_now call.  It looks the same as generic_forget_inode now except
for the call to truncate_hugepages instead of truncate_inode_pages.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

0b1533f6

[PATCH] kill hugelbfs_do_delete_inode · 6b09b9df

由 Christoph Hellwig 提交于 10月 29, 2005

hugetlbfs_do_delete_inode is the same as generic_delete_inode now, so remove
it in favour of the latter.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

6b09b9df

[PATCH] hugetlbfs: clean up hugetlbfs_delete_inode · 149f4211

由 Christoph Hellwig 提交于 10月 29, 2005

Make hugetlbfs looks the same as generic_detelte_inode, fixing a bunch of
missing updates to it at the same time.  Rename it to
hugetlbfs_do_delete_inode and add a real hugetlbfs_delete_inode that
implements ->delete_inode.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

149f4211

[PATCH] hugetlbfs: move free_inodes accounting · 96527980

由 Christoph Hellwig 提交于 10月 29, 2005

Move hugetlbfs accounting into ->alloc_inode / ->destroy_inode.  This keeps
the code simpler, fixes a loeak where a failing inode allocation wouldn't
decrement the counter and moves hugetlbfs_delete_inode and
hugetlbfs_forget_inode closer to their generic counterparts.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

96527980

[PATCH] mm: update comments to pte lock · b8072f09

由 Hugh Dickins 提交于 10月 29, 2005

Updated several references to page_table_lock in common code comments.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b8072f09

[PATCH] mm: fix rss and mmlist locking · f412ac08

由 Hugh Dickins 提交于 10月 29, 2005

A couple of oddities were guarded by page_table_lock, no longer properly
guarded when that is split.

The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
atomic64_t if the architecture supports it, in such a case.  Definitions by
courtesy of Christoph Lameter: who spent considerable effort on more scalable
ways of counting, but found insufficient benefit in practice.

And adding an mm with swap to the mmlist for swapoff: the list is well-
guarded by its own lock, but the list_empty check now has to be repeated
inside it.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f412ac08

[PATCH] mm: split page table lock · 4c21e2f2

由 Hugh Dickins 提交于 10月 29, 2005

Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.

This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)

In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.

Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.

There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

4c21e2f2

[PATCH] mm: uml kill unused · b38c6845

由 Hugh Dickins 提交于 10月 29, 2005

In worrying over the various pte operations in different architectures, I came
across some unused functions in UML: remove mprotect_kernel_vm,
protect_vm_page and addr_pte.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b38c6845

[PATCH] mm: uml pte atomicity · 8f5cd76c

由 Hugh Dickins 提交于 10月 29, 2005

There's usually a good reason when a pte is examined without the lock; but it
makes me nervous when the pointer is dereferenced more than once.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

8f5cd76c

[PATCH] mm: cris v32 mmu_context_lock · a7e4705b

由 Hugh Dickins 提交于 10月 29, 2005

The cris v32 switch_mm guards get_mmu_context with next->page_table_lock: good
it's not really SMP yet, since get_mmu_context messes with global variables
affecting other mms.  Replace by global mmu_context_lock.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

a7e4705b

[PATCH] mm: parisc pte atomicity · 92dc6fcc

由 Hugh Dickins 提交于 10月 29, 2005

There's a worrying function translation_exists in parisc cacheflush.h,
unaffected by split ptlock since flush_dcache_page is using it on some other
mm, without any relevant lock. Oh well, make it a slightly more robust by
factoring the pfn check within it. And it looked liable to confuse a
camouflaged swap or file entry with a good pte: fix that too.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

92dc6fcc

[PATCH] mm: arm ready for split ptlock · 69b04754

由 Hugh Dickins 提交于 10月 29, 2005

Prepare arm for the split page_table_lock: three issues.

Signal handling's preserve and restore of iwmmxt context currently involves
reading and writing that context to and from user space, while holding
page_table_lock to secure the user page(s) against kswapd. If we split the
lock, then the structure might span two pages, secured by to read into and
write from a kernel stack buffer, copying that out and in without locking (the
structure is 160 bytes in size, and here we're near the top of the kernel
stack). Or would the overhead be noticeable?

arm_syscall's cmpxchg emulation use pte_offset_map_lock, instead of
pte_offset_map and mm-wide page_table_lock; and strictly, it should now also
take mmap_sem before descending to pmd, to guard against another thread
munmapping, and the page table pulled out beneath this thread.

Updated two comments in fault-armv.c. adjust_pte is interesting, since its
modification of a pte in one part of the mm depends on the lock held when
calling update_mmu_cache for a pte in some other part of that mm. This can't
be done with a split page_table_lock (and we've already taken the lowest lock
in the hierarchy here): so we'll have to disable split on arm, unless
CONFIG_CPU_CACHE_VIPT to ensures adjust_pte never used.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

69b04754

[PATCH] mm: i386 sh sh64 ready for split ptlock · 60ec5585

由 Hugh Dickins 提交于 10月 29, 2005

Use pte_offset_map_lock, instead of pte_offset_map (or inappropriate
pte_offset_kernel) and mm-wide page_table_lock, in sundry arch places.

The i386 vm86 mark_screen_rdonly: yes, there was and is an assumption that the
screen fits inside the one page table, as indeed it does.

The sh __do_page_fault: which handles both kernel faults (without lock) and
user mm faults (locked - though it set_pte without locking before).

The sh64 flush_cache_range and helpers: which wrongly thought callers held
page_table_lock before (only its tlb_start_vma did, and no longer does so);
moved the flush loop down, and adjusted the large versus small range decision
to consider a range which spans page tables as large.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

60ec5585

[PATCH] mm: follow_page with inner ptlock · deceb6cd

由 Hugh Dickins 提交于 10月 29, 2005

Final step in pushing down common core's page_table_lock. follow_page no
longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
and so no page_table_lock is taken in get_user_pages itself.

But get_user_pages (and get_futex_key) do then need follow_page to pin the
page for them: take Daniel's suggestion of bitflags to follow_page.

Need one for WRITE, another for TOUCH (it was the accessed flag before:
vanished along with check_user_page_readable, but surely get_numa_maps is
wrong to mark every page it finds as accessed), another for GET.

And another, ANON to dispose of untouched_anonymous_page: it seems silly for
that to descend a second time, let follow_page observe if there was no page
table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED -
make_pages_present ought to make readonly anonymous present.

Give get_numa_maps a cond_resched while we're there.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

deceb6cd

[PATCH] mm: kill check_user_page_readable · c34d1b4d

由 Hugh Dickins 提交于 10月 29, 2005

check_user_page_readable is a problematic variant of follow_page.  It's used
only by oprofile's i386 and arm backtrace code, at interrupt time, to
establish whether a userspace stackframe is currently readable.

This is problematic, because we want to push the page_table_lock down inside
follow_page, and later split it; whereas oprofile is doing a spin_trylock on
it (in the i386 case, forgotten in the arm case), and needs that to pin
perhaps two pages spanned by the stackframe (which might be covered by
different locks when we split).

I think oprofile is going about this in the wrong way: it doesn't need to know
the area is readable (neither i386 nor arm uses read protection of user
pages), it doesn't need to pin the memory, it should simply
__copy_from_user_inatomic, and see if that succeeds or not.  Sorry, but I've
not got around to devising the sparse __user annotations for this.

Then we can eliminate check_user_page_readable, and return to a single
follow_page without the __follow_page variants.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

c34d1b4d

[PATCH] mm: rmap with inner ptlock · c0718806

由 Hugh Dickins 提交于 10月 29, 2005

rmap's page_check_address descend without page_table_lock. First just
pte_offset_map in case there's no pte present worth locking for, then take
page_table_lock for the full check, and pass ptl back to caller in the same
style as pte_offset_map_lock. __xip_unmap, page_referenced_one and
try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also.

page_check_address reformatted to avoid progressive indentation. No use is
made of its one error code, return NULL when it fails.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

c0718806

[PATCH] mm: xip_unmap ZERO_PAGE fix · 67b02f11

由 Hugh Dickins 提交于 10月 29, 2005

Small fix to the PageReserved patch: the mips ZERO_PAGE(address) depends on
address, so __xip_unmap is wrong to initialize page with that before address
is initialized; and in fact must re-evaluate it each iteration.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

67b02f11

[PATCH] mm: unmap_vmas with inner ptlock · 508034a3

由 Hugh Dickins 提交于 10月 29, 2005

Remove the page_table_lock from around the calls to unmap_vmas, and replace
the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
now safe to descend without page_table_lock.

Don't attempt fancy locking for hugepages, just take page_table_lock in
unmap_hugepage_range.  Which makes zap_hugepage_range, and the hugetlb test in
zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway.  Nor
does unmap_vmas have much use for its mm arg now.

The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
page_table_lock: if they're implemented at all, they typically come down to
flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
(which we already audited for the mprotect case).
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

508034a3

[PATCH] mm: unlink vma before pagetables · 8f4f8c16

由 Hugh Dickins 提交于 10月 29, 2005

In most places the descent from pgd to pud to pmd to pte holds mmap_sem
(exclusively or not), which ensures that free_pgtables cannot be freeing page
tables from any level at the same time. But truncation and reverse mapping
descend without mmap_sem.

No problem: just make sure that a vma is unlinked from its prio_tree (or
nonlinear list) and from its anon_vma list, after zapping the vma, but before
freeing its page tables. Then neither vmtruncate nor rmap can reach that vma
whose page tables are now volatile (nor do they need to reach it, since all
its page entries have been zapped by this stage).

The i_mmap_lock and anon_vma->lock already serialize this correctly; but the
locking hierarchy is such that we cannot take them while holding
page_table_lock. Well, we're trying to push that down anyway. So in this
patch, move anon_vma_unlink and unlink_file_vma into free_pgtables, at the
same time as moving page_table_lock around calls to unmap_vmas.

tlb_gather_mmu and tlb_finish_mmu then fall outside the page_table_lock, but
we made them preempt_disable and preempt_enable earlier; and a long source
audit of all the architectures has shown no problem with removing
page_table_lock from them. free_pgtables doesn't need page_table_lock for
itself, nor for what it calls; tlb->mm->nr_ptes is usually protected by
page_table_lock, but partly by non-exclusive mmap_sem - here it's decremented
with exclusive mmap_sem, or mm_users 0. update_hiwater_rss and
vm_unacct_memory don't need page_table_lock either.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

8f4f8c16

[PATCH] mm: flush_tlb_range outside ptlock · 663b97f7

由 Hugh Dickins 提交于 10月 29, 2005

There was one small but very significant change in the previous patch:
mprotect's flush_tlb_range fell outside the page_table_lock: as it is in 2.4,
but that doesn't prove it safe in 2.6.

On some architectures flush_tlb_range comes to the same as flush_tlb_mm, which
has always been called from outside page_table_lock in dup_mmap, and is so
proved safe. Others required a deeper audit: I could find no reliance on
page_table_lock in any; but in ia64 and parisc found some code which looks a
bit as if it might want preemption disabled. That won't do any actual harm,
so pending a decision from the maintainers, disable preemption there.

Remove comments on page_table_lock from flush_tlb_mm, flush_tlb_range and
flush_tlb_page entries in cachetlb.txt: they were rather misleading (what
generic code does is different from what usually happens), the rules are now
changing, and it's not yet clear where we'll end up (will the generic
tlb_flush_mmu happen always under lock? never under lock? or sometimes under
and sometimes not?).
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

663b97f7

[PATCH] mm: pte_offset_map_lock loops · 705e87c0

由 Hugh Dickins 提交于 10月 29, 2005

Convert those common loops using page_table_lock on the outside and
pte_offset_map within to use just pte_offset_map_lock within instead.

These all hold mmap_sem (some exclusively, some not), so at no level can a
page table be whipped away from beneath them. But whereas pte_alloc loops
tested with the "atomic" pmd_present, these loops are testing with pmd_none,
which on i386 PAE tests both lower and upper halves.

That's now unsafe, so add a cast into pmd_none to test only the vital lower
half: we lose a little sensitivity to a corrupt middle directory, but not
enough to worry about. It appears that i386 and UML were the only
architectures vulnerable in this way, and pgd and pud no problem.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

705e87c0

[PATCH] mm: page fault handler locking · 8f4e2101

由 Hugh Dickins 提交于 10月 29, 2005

On the page fault path, the patch before last pushed acquiring the
page_table_lock down to the head of handle_pte_fault (though it's also taken
and dropped earlier when a new page table has to be allocated).

Now delete that line, read "entry = *pte" without it, and go off to this or
that page fault handler on the basis of this unlocked peek. Usually the
handler can proceed without the lock, relying on the subsequent locked
pte_same or pte_none test to back out when necessary; though do_wp_page needs
the lock immediately, and do_file_page doesn't check (if there's a race,
install_page just zaps the entry and reinstalls it).

But on those architectures (notably i386 with PAE) whose pte is too big to be
read atomically, if SMP or preemption is enabled, do_swap_page and
do_file_page might cause irretrievable damage if passed a Frankenstein entry
stitched together from unrelated parts. In those configs, "pte_unmap_same"
has to take page_table_lock, validate orig_pte still the same, and drop
page_table_lock before unmapping, before proceeding.

Use pte_offset_map_lock and pte_unmap_unlock throughout the handlers; but lock
avoidance leaves more lone maps and unmaps than elsewhere.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

8f4e2101

[PATCH] mm: arches skip ptlock · b462705a

由 Hugh Dickins 提交于 10月 29, 2005

Convert those few architectures which are calling pud_alloc, pmd_alloc,
pte_alloc_map on a user mm, not to take the page_table_lock first, nor drop it
after. Each of these can continue to use pte_alloc_map, no need to change
over to pte_alloc_map_lock, they're neither racy nor swappable.

In the sparc64 io_remap_pfn_range, flush_tlb_range then falls outside of the
page_table_lock: that's okay, on sparc64 it's like flush_tlb_mm, and that has
always been called from outside of page_table_lock in dup_mmap.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b462705a

[PATCH] mm: ptd_alloc take ptlock · c74df32c

由 Hugh Dickins 提交于 10月 29, 2005

Second step in pushing down the page_table_lock. Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.

Convert their callers from common code. But avoid coming back to change them
again later: instead of moving the spin_lock(&mm->page_table_lock) down,
switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
encapsulate the mapping+locking and unlocking+unmapping together, and in the
end may use alternatives to the mm page_table_lock itself.

These callers all hold mmap_sem (some exclusively, some not), so at no level
can a page table be whipped away from beneath them; and pte_alloc uses the
"atomic" pmd_present to test whether it needs to allocate. It appears that on
all arches we can safely descend without page_table_lock.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

c74df32c

[PATCH] mm: ptd_alloc inline and out · 1bb3630e

由 Hugh Dickins 提交于 10月 29, 2005

It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
pte_alloc_map and pte_alloc_kernel are entirely out-of-line.  Though it does
add a little to kernel size, change them to macros testing inline, calling
__pte_alloc or __pte_alloc_kernel to allocate out-of-line.  Mark none of them
as fastcalls, leave that to CONFIG_REGPARM or not.

It also seems more natural for the out-of-line functions to leave the offset
calculation and map to the inline, which has to do it anyway for the common
case.  At least mremap move wants __pte_alloc without _map.

Macros rather than inline functions, certainly to avoid the header file issues
which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
architectures I haven't built would have other such problems.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1bb3630e

[PATCH] mm: init_mm without ptlock · 872fec16

由 Hugh Dickins 提交于 10月 29, 2005

First step in pushing down the page_table_lock. init_mm.page_table_lock has
been used throughout the architectures (usually for ioremap): not to serialize
kernel address space allocation (that's usually vmlist_lock), but because
pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.

Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
and drop it when allocating a new one, to check lest a racing task already
did. Similarly no page_table_lock in vmalloc's map_vm_area.

Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
user mms, which are converted only by a later patch, for now they have to lock
differently according to whether or not it's init_mm.

If sources get muddled, there's a danger that an arch source taking
init_mm.page_table_lock will be mixed with common source also taking it (or
neither take it). So break the rules and make another change, which should
break the build for such a mismatch: remove the redundant mm arg from
pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).

Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
took page_table_lock for no good reason.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

872fec16

[PATCH] mm: ia64 use expand_upwards · 46dea3d0

由 Hugh Dickins 提交于 10月 29, 2005

ia64 has expand_backing_store function for growing its Register Backing Store
vma upwards. But more complete code for this purpose is found in the
CONFIG_STACK_GROWSUP part of mm/mmap.c. Uglify its #ifdefs further to provide
expand_upwards for ia64 as well as expand_stack for parisc.

The Register Backing Store vma should be marked VM_ACCOUNT. Implement the
intention of growing it only a page at a time, instead of passing an address
outside of the vma to handle_mm_fault, with unknown consequences.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

46dea3d0

[PATCH] mm: mm_struct hiwaters moved · f449952b

由 Hugh Dickins 提交于 10月 29, 2005

Slight and timid rearrangement of mm_struct: hiwater_rss and hiwater_vm were
tacked on the end, but it seems better to keep them near _file_rss, _anon_rss
and total_vm, in the same cacheline on those arches verified.

There are likely to be more profitable rearrangements, but less obvious (is it
good or bad that saved_auxv[AT_VECTOR_SIZE] isolates cpu_vm_mask and context
from many others?), needing serious instrumentation.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f449952b

Linux-御风守护者 / linux 与 Fork 源项目一致

Linux-御风守护者 / linux
与 Fork 源项目一致