提交 · 2152f8536668a957ea3214735b4761e7b22ef7d8 · openanolis / cloud-kernel

23 3月, 2006 1 次提交

Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 · 2152f853

由 Linus Torvalds 提交于 3月 22, 2006

* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (138 commits)
  [SCSI] libata: implement minimal transport template for ->eh_timed_out
  [SCSI] eliminate rphy allocation in favour of expander/end device allocation
  [SCSI] convert mptsas over to end_device/expander allocations
  [SCSI] allow displaying and setting of cache type via sysfs
  [SCSI] add scsi_mode_select to scsi_lib.c
  [SCSI] 3ware 9000 add big endian support
  [SCSI] qla2xxx: update MAINTAINERS
  [SCSI] scsi: move target_destroy call
  [SCSI] fusion - bump version
  [SCSI] fusion - expander hotplug suport in mptsas module
  [SCSI] fusion - exposing raid components in mptsas
  [SCSI] fusion - memory leak, and initializing fields
  [SCSI] fusion - exclosure misspelled
  [SCSI] fusion - cleanup mptsas event handling functions
  [SCSI] fusion - removing target_id/bus_id from the VirtDevice structure
  [SCSI] fusion - static fix's
  [SCSI] fusion - move some debug firmware event debug msgs to verbose level
  [SCSI] fusion - loginfo header update
  [SCSI] add scsi_reprobe_device
  [SCSI] megaraid_sas: fix extended timeout handling
  ...

2152f853

22 3月, 2006 39 次提交

[PATCH] SELinux: add slab cache for inode security struct · 7cae7e26

由 James Morris 提交于 3月 22, 2006

Add a slab cache for the SELinux inode security struct, one of which is
allocated for every inode instantiated by the system.

The memory savings are considerable.

On 64-bit, instead of the size-128 cache, we have a slab object of 96
bytes, saving 32 bytes per object.  After booting, I see about 4000 of
these and then about 17,000 after a kernel compile.  With this patch, we
save around 530KB of kernel memory in the latter case.  On 32-bit, the
savings are about half of this.
Signed-off-by: NJames Morris <jmorris@namei.org>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

7cae7e26

[PATCH] SELinux: cleanup stray variable in selinux_inode_init_security() · cf01efd0

由 James Morris 提交于 3月 22, 2006

Remove an unneded pointer variable in selinux_inode_init_security().
Signed-off-by: NJames Morris <jmorris@namei.org>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

cf01efd0

[PATCH] SELinux: fix hard link count for selinuxfs root directory · edb20fb5

由 James Morris 提交于 3月 22, 2006

A further fix is needed for selinuxfs link count management, to ensure that
the count is correct for the parent directory when a subdirectory is
created.  This is only required for the root directory currently, but the
code has been updated for the general case.
Signed-off-by: NJames Morris <jmorris@namei.org>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

edb20fb5

[PATCH] selinuxfs cleanups: sel_make_avc_files · d6aafa65

由 James Morris 提交于 3月 22, 2006

Fix copy & paste error in sel_make_avc_files(), removing a supurious call to
d_genocide() in the error path.  All of this will be cleaned up by
kill_litter_super().
Signed-off-by: NJames Morris <jmorris@namei.org>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

d6aafa65

[PATCH] selinuxfs cleanups: sel_make_bools · 253a8b1d

由 James Morris 提交于 3月 22, 2006

Remove the call to sel_make_bools() from sel_fill_super(), as policy needs to
be loaded before the boolean files can be created.  Policy will never be
loaded during sel_fill_super() as selinuxfs is kernel mounted during init and
the only means to load policy is via selinuxfs.

Also, the call to d_genocide() on the error path of sel_make_bools() is
incorrect and replaced with sel_remove_bools().
Signed-off-by: NJames Morris <jmorris@namei.org>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

253a8b1d

[PATCH] selinuxfs cleanups: sel_fill_super exit path · 161ce45a

由 James Morris 提交于 3月 22, 2006

Unify the error path of sel_fill_super() so that all errors pass through the
same point and generate an error message.  Also, removes a spurious dput() in
the error path which breaks the refcounting for the filesystem
(litter_kill_super() will correctly clean things up itself on error).
Signed-off-by: NJames Morris <jmorris@namei.org>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

161ce45a

[PATCH] selinuxfs cleanups: use sel_make_dir() · cde174a8

由 James Morris 提交于 3月 22, 2006

Use existing sel_make_dir() helper to create booleans directory rather than
duplicating the logic.
Signed-off-by: NJames Morris <jmorris@namei.org>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

cde174a8

[PATCH] selinuxfs cleanups: fix hard link count · 40e906f8

由 James Morris 提交于 3月 22, 2006

Fix the hard link count for selinuxfs directories, which are currently one
short.
Signed-off-by: NJames Morris <jmorris@namei.org>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

40e906f8

[PATCH] selinux: simplify sel_read_bool · 68bdcf28

由 Stephen Smalley 提交于 3月 22, 2006

Simplify sel_read_bool to use the simple_read_from_buffer helper, like the
other selinuxfs functions.
Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
Acked-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

68bdcf28

[PATCH] sem2mutex: security/ · bb003079

由 Ingo Molnar 提交于 3月 22, 2006

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Cc: James Morris <jmorris@namei.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

bb003079

[PATCH] selinux: Disable automatic labeling of new inodes when no policy is loaded · 8aad3875

由 Stephen Smalley 提交于 3月 22, 2006

This patch disables the automatic labeling of new inodes on disk
when no policy is loaded.

Discussion is here:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180296

In short, we're changing the behavior so that when no policy is loaded,
SELinux does not label files at all.  Currently it does add an 'unlabeled'
label in this case, which we've found causes problems later.

SELinux always maintains a safe internal label if there is none, so with this
patch, we just stick with that and wait until a policy is loaded before adding
a persistent label on disk.

The effect is simply that if you boot with SELinux enabled but no policy
loaded and create a file in that state, SELinux won't try to set a security
extended attribute on the new inode on the disk.  This is the only sane
behavior for SELinux in that state, as it cannot determine the right label to
assign in the absence of a policy.  That state usually doesn't occur, but the
rawhide installer seemed to be misbehaving temporarily so it happened to show
up on a test install.
Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
Acked-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

8aad3875

[PATCH] page migration reorg · b20a3503

由 Christoph Lameter 提交于 3月 22, 2006

Centralize the page migration functions in anticipation of additional
tinkering.  Creates a new file mm/migrate.c

1. Extract buffer_migrate_page() from fs/buffer.c

2. Extract central migration code from vmscan.c

3. Extract some components from mempolicy.c

4. Export pageout() and remove_from_swap() from vmscan.c

5. Make it possible to configure NUMA systems without page migration
   and non-NUMA systems with page migration.

I had to so some #ifdeffing in mempolicy.c that may need a cleanup.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b20a3503

[PATCH] mm: slab cache interleave rotor fix · 442295c9

由 Paul Jackson 提交于 3月 22, 2006

The alien cache rotor in mm/slab.c assumes that the first online node is
node 0.  Eventually for some archs, especially with hotplug, this will no
longer be true.

Fix the interleave rotor to handle the general case of node numbering.
Signed-off-by: NPaul Jackson <pj@sgi.com>
Acked-by: NChristoph Lameter <clameter@engr.sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

442295c9

[PATCH] mm: hugetlb alloc_fresh_huge_page bogus node loop fix · fdb7cc59

由 Paul Jackson 提交于 3月 22, 2006

Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was
assuming that nodes are numbered contiguously from 0 to num_online_nodes().
Once the hotplug folks get this far, that will be false.
Signed-off-by: NPaul Jackson <pj@sgi.com>
Acked-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

fdb7cc59

[PATCH] fix swap cluster offset · 9b65ef59

由 Akinobu Mita 提交于 3月 22, 2006

When we've allocated SWAPFILE_CLUSTER pages, ->cluster_next should be the
first index of swap cluster.  But current code probably sets it wrong offset.
Signed-off-by: NAkinobu Mita <mita@miraclelinux.com>
Acked-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

9b65ef59

[PATCH] drain_node_pages: interrupt latency reduction / optimization · 879336c3

由 Christoph Lameter 提交于 3月 22, 2006

1. Only disable interrupts if there is actually something to free

2. Only dirty the pcp cacheline if we actually freed something.

3. Disable interrupts for each single pcp and not for cleaning
  all the pcps in all zones of a node.

drain_node_pages is called every 2 seconds from cache_reap. This
fix should avoid most disabling of interrupts.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

879336c3

[PATCH] slab: fix drain_array() so that it works correctly with the shared_array · b18e7e65

由 Christoph Lameter 提交于 3月 22, 2006

The list_lock also protects the shared array and we call drain_array() with
the shared array.  Therefore we cannot go as far as I wanted to but have to
take the lock in a way so that it also protects the array_cache in
drain_pages.

(Note: maybe we should make the array_cache locking more consistent?  I.e.
always take the array cache lock for shared arrays and disable interrupts
for the per cpu arrays?)
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b18e7e65

[PATCH] slab: remove drain_array_locked · 1b55253a

由 Christoph Lameter 提交于 3月 22, 2006

Remove drain_array_locked and use that opportunity to limit the time the l3
lock is taken further.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1b55253a

[PATCH] slab: make drain_array more universal by adding more parameters · aab2207c

由 Christoph Lameter 提交于 3月 22, 2006

And a parameter to drain_array to control the freeing of all objects and
then use drain_array() to replace instances of drain_array_locked with
drain_array.  Doing so will avoid taking locks in those locations if the
arrays are empty.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

aab2207c

[PATCH] slab: cache_reap(): further reduction in interrupt holdoff · 35386e3b

由 Christoph Lameter 提交于 3月 22, 2006

cache_reap takes the l3->list_lock (disabling interrupts) unconditionally
and then does a few checks and maybe does some cleanup.  This patch makes
cache_reap() only take the lock if there is work to do and then the lock is
taken and released for each cleaning action.

The checking of when to do the next reaping is done without any locking and
becomes racy.  Should not matter since reaping can also be skipped if the
slab mutex cannot be acquired.

The same is true for the touched processing.  If we get this wrong once in
awhile then we will mistakenly clean or not clean the shared cache.  This
will impact performance slightly.

Note that the additional drain_array() function introduced here will fall
out in a subsequent patch since array cleaning will now be very similar
from all callers.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

35386e3b

[PATCH] mm: make shrink_all_memory try harder · 248a0301

由 Rafael J. Wysocki 提交于 3月 22, 2006

Make shrink_all_memory() repeat the attempts to free more memory if there
seems to be no pages to free.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

248a0301

[PATCH] optimize follow_hugetlb_page · d5d4b0aa

由 Chen, Kenneth W 提交于 3月 22, 2006

follow_hugetlb_page() walks a range of user virtual address and then fills
in list of struct page * into an array that is passed from the argument
list. It also gets a reference count via get_page(). For compound page,
get_page() actually traverse back to head page via page_private() macro and
then adds a reference count to the head page. Since we are doing a virt to
pte look up, kernel already has a struct page pointer into the head page.
So instead of traverse into the small unit page struct and then follow a
link back to the head page, optimize that with incrementing the reference
count directly on the head page.

The benefit is that we don't take a cache miss on accessing page struct for
the corresponding user address and more importantly, not to pollute the
cache with a "not very useful" round trip of pointer chasing. This adds a
moderate performance gain on an I/O intensive database transaction
workload.
Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

d5d4b0aa

[PATCH] convert hugetlbfs_counter to atomic · bba1e9b2

由 Chen, Kenneth W 提交于 3月 22, 2006

Implementation of hugetlbfs_counter() is functionally equivalent to
atomic_inc_return().  Use the simpler atomic form.
Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

bba1e9b2

[PATCH] hugepage: is_aligned_hugepage_range() cleanup · 42b88bef

由 David Gibson 提交于 3月 22, 2006

Quite a long time back, prepare_hugepage_range() replaced
is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to
verify if an address range is suitable for a hugepage mapping.
is_aligned_hugepage_range() stuck around, but only to implement
prepare_hugepage_range() on archs which didn't implement their own.

Most archs (everything except ia64 and powerpc) used the same
implementation of is_aligned_hugepage_range().  On powerpc, which
implements its own prepare_hugepage_range(), the custom version was never
used.

In addition, "is_aligned_hugepage_range()" was a bad name, because it
suggests it returns true iff the given range is a good hugepage range,
whereas in fact it returns 0-or-error (so the sense is reversed).

This patch cleans up by abolishing is_aligned_hugepage_range().  Instead
prepare_hugepage_range() is defined directly.  Most archs use the default
version, which simply checks the given region is aligned to the size of a
hugepage.  ia64 and powerpc define custom versions.  The ia64 one simply
checks that the range is in the correct address space region in addition to
being suitably aligned.  The powerpc version (just as previously) checks
for suitable addresses, and if necessary performs low-level MMU frobbing to
set up new areas for use by hugepages.

No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR).
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NZhang Yanmin <yanmin.zhang@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

42b88bef

[PATCH] hugepage: Move hugetlb_free_pgd_range() prototype to hugetlb.h · 3915bcf3

由 David Gibson 提交于 3月 22, 2006

The optional hugepage callback, hugetlb_free_pgd_range() is presently
implemented non-trivially only on ia64 (but I plan to add one for powerpc
shortly).  It has its own prototype for the function in asm-ia64/pgtable.h.
 However, since the function is called from generic code, it make sense for
its prototype to be in the generic hugetlb.h header file, as the protypes
other arch callbacks already are (prepare_hugepage_range(),
set_huge_pte_at(), etc.).  This patch makes it so.
Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

3915bcf3

[PATCH] hugepage: Fix hugepage logic in free_pgtables() harder · 4866920b

由 David Gibson 提交于 3月 22, 2006

Turns out the hugepage logic in free_pgtables() was doubly broken.  The
loop coalescing multiple normal page VMAs into one call to free_pgd_range()
had an off by one error, which could mean it would coalesce one hugepage
VMA into the same bundle (checking 'vma' not 'next' in the loop).  I
transferred this bug into the new is_vm_hugetlb_page() based version.
Here's the fix.

This one didn't bite on powerpc previously for the same reason the
is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range()
is identical to free_pgd_range().  It didn't bite on ia64 because the
hugepage region is distant enough from any other region that the separated
PMD_SIZE distance test would always prevent coalescing the two together.

No libhugetlbfs testsuite regressions (ppc64, POWER5).
Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

4866920b

[PATCH] hugepage: Fix hugepage logic in free_pgtables() · 9da61aef

由 David Gibson 提交于 3月 22, 2006

free_pgtables() has special logic to call hugetlb_free_pgd_range() instead
of the normal free_pgd_range() on hugepage VMAs. However, the test it uses
to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized
range at the start of the vma. is_hugepage_only_range() will return true
if the given range has any intersection with a hugepage address region, and
in this case the given region need not be hugepage aligned. So, for
example, this test can return true if called on, say, a 4k VMA immediately
preceding a (nicely aligned) hugepage VMA.

At present we get away with this because the powerpc version of
hugetlb_free_pgd_range() is just a call to free_pgd_range(). On ia64 (the
only other arch with a non-trivial is_hugepage_only_range()) we get away
with it for a different reason; the hugepage area is not contiguous with
the rest of the user address space, and VMAs are not permitted in between,
so the test can't return a false positive there.

Nonetheless this should be fixed. We do that in the patch below by
replacing the is_hugepage_only_range() test with an explicit test of the
VMA using is_vm_hugetlb_page().

This in turn changes behaviour for platforms where is_hugepage_only_range()
returns false always (everything except powerpc and ia64). We address this
by ensuring that hugetlb_free_pgd_range() is defined to be identical to
free_pgd_range() (instead of a no-op) on everything except ia64. Even so,
it will prevent some otherwise possible coalescing of calls down to
free_pgd_range(). Since this only happens for hugepage VMAs, removing this
small optimization seems unlikely to cause any trouble.

This patch causes no regressions on the libhugetlbfs testsuite - ppc64
POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP).
Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Acked-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

9da61aef

[PATCH] hugepage: Make {alloc,free}_huge_page() local · 27a85ef1

由 David Gibson 提交于 3月 22, 2006

Originally, mm/hugetlb.c just handled the hugepage physical allocation path
and its {alloc,free}_huge_page() functions were used from the arch specific
hugepage code.  These days those functions are only used with mm/hugetlb.c
itself.  Therefore, this patch makes them static and removes their
prototypes from hugetlb.h.  This requires a small rearrangement of code in
mm/hugetlb.c to avoid a forward declaration.

This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
POWER5).
Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

27a85ef1

[PATCH] hugepage: Strict page reservation for hugepage inodes · b45b5bd6

由 David Gibson 提交于 3月 22, 2006

These days, hugepages are demand-allocated at first fault time.  There's a
somewhat dubious (and racy) heuristic when making a new mmap() to check if
there are enough available hugepages to fully satisfy that mapping.

A particularly obvious case where the heuristic breaks down is where a
process maps its hugepages not as a single chunk, but as a bunch of
individually mmap()ed (or shmat()ed) blocks without touching and
instantiating the pages in between allocations.  In this case the size of
each block is compared against the total number of available hugepages.
It's thus easy for the process to become overcommitted, because each block
mapping will succeed, although the total number of hugepages required by
all blocks exceeds the number available.  In particular, this defeats such
a program which will detect a mapping failure and adjust its hugepage usage
downward accordingly.

The patch below addresses this problem, by strictly reserving a number of
physical hugepages for hugepage inodes which have been mapped, but not
instatiated.  MAP_SHARED mappings are thus "safe" - they will fail on
mmap(), not later with an OOM SIGKILL.  MAP_PRIVATE mappings can still
trigger an OOM.  (Actually SHARED mappings can technically still OOM, but
only if the sysadmin explicitly reduces the hugepage pool between mapping
and instantiation)

This patch appears to address the problem at hand - it allows DB2 to start
correctly, for instance, which previously suffered the failure described
above.

This patch causes no regressions on the libhugetblfs testsuite, and makes a
test (designed to catch this problem) pass which previously failed (ppc64,
POWER5).
Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b45b5bd6

[PATCH] hugepage: serialize hugepage allocation and instantiation · 3935baa9

由 David Gibson 提交于 3月 22, 2006

Currently, no lock or mutex is held between allocating a hugepage and
inserting it into the pagetables / page cache. When we do go to insert the
page into pagetables or page cache, we recheck and may free the newly
allocated hugepage. However, since the number of hugepages in the system
is strictly limited, and it's usualy to want to use all of them, this can
still lead to spurious allocation failures.

For example, suppose two processes are both mapping (MAP_SHARED) the same
hugepage file, large enough to consume the entire available hugepage pool.
If they race instantiating the last page in the mapping, they will both
attempt to allocate the last available hugepage. One will fail, of course,
returning OOM from the fault and thus causing the process to be killed,
despite the fact that the entire mapping can, in fact, be instantiated.

The patch fixes this race by the simple method of adding a (sleeping) mutex
to serialize the hugepage fault path between allocation and insertion into
pagetables and/or page cache. It would be possible to avoid the
serialization by catching the allocation failures, waiting on some
condition, then rechecking to see if someone else has instantiated the page
for us. Given the likely frequency of hugepage instantiations, it seems
very doubtful it's worth the extra complexity.

This patch causes no regression on the libhugetlbfs testsuite, and one
test, which can trigger this race now passes where it previously failed.

Actually, the test still sometimes fails, though less often and only as a
shmat() failure, rather processes getting OOM killed by the VM. The dodgy
heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
space aren't protected by the new mutex, and would be ugly to do so, so
there's still a race there. Another patch to replace those tests with
something saner for this reason as well as others coming...
Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

3935baa9

[PATCH] hugepage: Small fixes to hugepage clear/copy path · 79ac6ba4

由 David Gibson 提交于 3月 22, 2006

Move the loops used in mm/hugetlb.c to clear and copy hugepages to their
own functions for clarity.  As we do so, we add some checks of need_resched
- we are, after all copying megabytes of memory here.  We also add
might_sleep() accordingly.  We generally dropped locks around the clear and
copy, already but not everyone has PREEMPT enabled, so we should still be
checking explicitly.

For this to work, we need to remove the clear_huge_page() from
alloc_huge_page(), which is called with the page_table_lock held in the COW
path.  We move the clear_huge_page() to just after the alloc_huge_page() in
the hugepage no-page path.  In the COW path, the new page is about to be
copied over, so clearing it was just a waste of time anyway.  So as a side
effect we also fix the fact that we held the page_table_lock for far too
long in this path by calling alloc_huge_page() under it.

It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5).
Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

79ac6ba4

[PATCH] Enable mprotect on huge pages · 8f860591

由 Zhang, Yanmin 提交于 3月 22, 2006

2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
mprotect.

From: David Gibson <david@gibson.dropbear.id.au>

  Remove a test from the mprotect() path which checks that the mprotect()ed
  range on a hugepage VMA is hugepage aligned (yes, really, the sense of
  is_aligned_hugepage_range() is the opposite of what you'd guess :-/).

  In fact, we don't need this test.  If the given addresses match the
  beginning/end of a hugepage VMA they must already be suitably aligned.  If
  they don't, then mprotect_fixup() will attempt to split the VMA.  The very
  first test in split_vma() will check for a badly aligned address on a
  hugepage VMA and return -EINVAL if necessary.

From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>

  On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE.  The
  identify of hugetlb pte is lost when changing page protection via mprotect.
  A page fault occurs later will trigger a bug check in huge_pte_alloc().

  The fix is to always make new pte a hugetlb pte and also to clean up
  legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.
Signed-off-by: NZhang Yanmin <yanmin.zhang@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

8f860591

[PATCH] readahead: fix initial window size calculation · aed75ff3

由 Steven Pratt 提交于 3月 22, 2006

The current current get_init_ra_size is not optimal across different IO
sizes and max_readahead values.  Here is a quick summary of sizes computed
under current design and under the attached patch.  All of these assume 1st
IO at offset 0, or 1st detected sequential IO.

	32k max, 4k request

	old         new
	-----------------
	 8k        8k
	16k       16k
	32k       32k

	128k max, 4k request
	old         new
	-----------------
	32k         16k
	64k         32k
	128k        64k
	128k       128k

	128k max, 32k request
	old         new
	-----------------
	32k         64k    <-----
	64k        128k
	128k       128k

	512k max, 4k request
	old         new
	-----------------
	4k         32k     <----
	16k        64k
	64k       128k
	128k      256k
	512k      512k

Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

aed75ff3

[PATCH] readahead: ->prev_page can overrun the ahead window · a564da39

由 Oleg Nesterov 提交于 3月 22, 2006

If get_next_ra_size() does not grow fast enough, ->prev_page can overrun
the ahead window.  This means the caller will read the pages from
->ahead_start + ->ahead_size to ->prev_page synchronously.
Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

a564da39

[PATCH] shmem: inline to avoid warning · d15c023b

由 Hugh Dickins 提交于 3月 22, 2006

shmem.c was named and shamed in Jesper's "Building 100 kernels" warnings:
shmem_parse_mpol is only used when CONFIG_TMPFS parses mount options; and
only called from that one site, so mark it inline like its non-NUMA stub.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

d15c023b

[PATCH] vmscan: emove obsolete checks from shrink_list() and fix unlikely in refill_inactive_zone() · 6e5ef1a9

由 Christoph Lameter 提交于 3月 22, 2006

As suggested by Marcelo:

1. The optimization introduced recently for not calling
   page_referenced() during zone reclaim makes two additional checks in
   shrink_list unnecessary.

2. The if (unlikely(sc->may_swap)) in refill_inactive_zone is optimized
   for the zone_reclaim case.  However, most peoples system only does swap.
   Undo that.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

6e5ef1a9

[PATCH] Uninline sys_mmap common code (reduce binary size) · a7290ee0

由 Michael Buesch 提交于 3月 22, 2006

Remove the inlining of the new vs old mmap system call common code.  This
reduces the size of the resulting vmlinux for defconfig as follows:

mb@pc1:~/develop/git/linux-2.6$ size vmlinux.mmap*
   text    data     bss     dec     hex filename
3303749  521524  186564 4011837  3d373d vmlinux.mmapinline
3303557  521524  186564 4011645  3d367d vmlinux.mmapnoinline

The new sys_mmap2() has also one function call overhead removed, now.
(probably it was already optimized to a jmp before, but anyway...)
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

a7290ee0

[PATCH] mm: optimise page_count · 617d2214

由 Nick Piggin 提交于 3月 22, 2006

Optimise page_count compound page test and make it consistent with similar
functions.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

617d2214

[PATCH] mm: more CONFIG_DEBUG_VM · b7ab795b

由 Nick Piggin 提交于 3月 22, 2006

Put a few more checks under CONFIG_DEBUG_VM
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b7ab795b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功