- 17 10月, 2007 8 次提交
-
-
由 Adam Litke 提交于
When gather_surplus_pages() fails to allocate enough huge pages to satisfy the requested reservation, it frees what it did allocate back to the buddy allocator. put_page() should be called instead of update_and_free_page() to ensure that pool counters are updated as appropriate and the page's refcount is decremented. Signed-off-by: NAdam Litke <agl@us.ibm.com> Acked-by: NDave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nishanth Aravamudan 提交于
Anton found a problem with the hugetlb pool allocation when some nodes have no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2). Lee worked on versions that tried to fix it, but none were accepted. Christoph has created a set of patches which allow for GFP_THISNODE allocations to fail if the node has no memory. Currently, alloc_fresh_huge_page() returns NULL when it is not able to allocate a huge page on the current node, as specified by its custom interleave variable. The callers of this function, though, assume that a failure in alloc_fresh_huge_page() indicates no hugepages can be allocated on the system period. This might not be the case, for instance, if we have an uneven NUMA system, and we happen to try to allocate a hugepage on a node with less memory and fail, while there is still plenty of free memory on the other nodes. To correct this, make alloc_fresh_huge_page() search through all online nodes before deciding no hugepages can be allocated. Add a helper function for actually allocating the hugepage. Use a new global nid iterator to control which nid to allocate on. Note: we expect particular semantics for __GFP_THISNODE, which are now enforced even for memoryless nodes. That is, there is should be no fallback to other nodes. Therefore, we rely on the nid passed into alloc_pages_node() to be the nid the page comes from. If this is incorrect, accounting will break. Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2 memoryless nodes). Before on the ppc64 box: Trying to clear the hugetlb pool Done. 0 free Trying to resize the pool to 100 Node 0 HugePages_Free: 25 Node 1 HugePages_Free: 75 Node 2 HugePages_Free: 0 Node 3 HugePages_Free: 0 Done. Initially 100 free Trying to resize the pool to 200 Node 0 HugePages_Free: 50 Node 1 HugePages_Free: 150 Node 2 HugePages_Free: 0 Node 3 HugePages_Free: 0 Done. 200 free After: Trying to clear the hugetlb pool Done. 0 free Trying to resize the pool to 100 Node 0 HugePages_Free: 50 Node 1 HugePages_Free: 50 Node 2 HugePages_Free: 0 Node 3 HugePages_Free: 0 Done. Initially 100 free Trying to resize the pool to 200 Node 0 HugePages_Free: 100 Node 1 HugePages_Free: 100 Node 2 HugePages_Free: 0 Node 3 HugePages_Free: 0 Done. 200 free Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Adam Litke 提交于
When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we are careful to keep enough pages around to satisfy reservations. But the calculation is flawed for the following scenario: Action Pool Counters (Total, Free, Resv) ====== ============= Set pool to 1 page 1 1 0 Map 1 page MAP_PRIVATE 1 1 0 Touch the page to fault it in 1 0 0 Set pool to 3 pages 3 2 0 Map 2 pages MAP_SHARED 3 2 2 Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2 Touch the 2 shared pages 2 0 1 <-- Program crashes here The last touch above will terminate the process due to lack of huge pages. This patch corrects the calculation so that it factors in pages being used for private mappings. Andrew, this is a standalone fix suitable for mainline. It is also now corrected in my latest dynamic pool resizing patchset which I will send out soon. Signed-off-by: NAdam Litke <agl@us.ibm.com> Acked-by: NKen Chen <kenchen@google.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Adam Litke 提交于
The maximum size of the huge page pool can be controlled using the overall size of the hugetlb filesystem (via its 'size' mount option). However in the common case the this will not be set as the pool is traditionally fixed in size at boot time. In order to maintain the expected semantics, we need to prevent the pool expanding by default. This patch introduces a new sysctl controlling dynamic pool resizing. When this is enabled the pool will expand beyond its base size up to the size of the hugetlb filesystem. It is disabled by default. Signed-off-by: NAdam Litke <agl@us.ibm.com> Acked-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NDave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Adam Litke 提交于
Shared mappings require special handling because the huge pages needed to fully populate the VMA must be reserved at mmap time. If not enough pages are available when making the reservation, allocate all of the shortfall at once from the buddy allocator and add the pages directly to the hugetlb pool. If they cannot be allocated, then fail the mapping. The page surplus is accounted for in the same way as for private mappings; faulted surplus pages will be freed at unmap time. Reserved, surplus pages that have not been used must be freed separately when their reservation has been released. Signed-off-by: NAdam Litke <agl@us.ibm.com> Acked-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NDave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Adam Litke 提交于
Because we overcommit hugepages for MAP_PRIVATE mappings, it is possible that the hugetlb pool will be exhausted or completely reserved when a hugepage is needed to satisfy a page fault. Before killing the process in this situation, try to allocate a hugepage directly from the buddy allocator. The explicitly configured pool size becomes a low watermark. When dynamically grown, the allocated huge pages are accounted as a surplus over the watermark. As huge pages are freed on a node, surplus pages are released to the buddy allocator so that the pool will shrink back to the watermark. Surplus accounting also allows for friendlier explicit pool resizing. When shrinking a pool that is fully in-use, increase the surplus so pages will be returned to the buddy allocator as soon as they are freed. When growing a pool that has a surplus, consume the surplus first and then allocate new pages. Signed-off-by: NAdam Litke <agl@us.ibm.com> Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NDave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Adam Litke 提交于
Dynamic huge page pool resizing. In most real-world scenarios, configuring the size of the hugetlb pool correctly is a difficult task. If too few pages are allocated to the pool, applications using MAP_SHARED may fail to mmap() a hugepage region and applications using MAP_PRIVATE may receive SIGBUS. Isolating too much memory in the hugetlb pool means it is not available for other uses, especially those programs not using huge pages. The obvious answer is to let the hugetlb pool grow and shrink in response to the runtime demand for huge pages. The work Mel Gorman has been doing to establish a memory zone for movable memory allocations makes dynamically resizing the hugetlb pool reliable within the limits of that zone. This patch series implements dynamic pool resizing for private and shared mappings while being careful to maintain existing semantics. Please reply with your comments and feedback; even just to say whether it would be a useful feature to you. Thanks. How it works ============ Upon depletion of the hugetlb pool, rather than reporting an error immediately, first try and allocate the needed huge pages directly from the buddy allocator. Care must be taken to avoid unbounded growth of the hugetlb pool, so the hugetlb filesystem quota is used to limit overall pool size. The real work begins when we decide there is a shortage of huge pages. What happens next depends on whether the pages are for a private or shared mapping. Private mappings are straightforward. At fault time, if alloc_huge_page() fails, we allocate a page from the buddy allocator and increment the source node's surplus_huge_pages counter. When free_huge_page() is called for a page on a node with a surplus, the page is freed directly to the buddy allocator instead of the hugetlb pool. Because shared mappings require all of the pages to be reserved up front, some additional work must be done at mmap() to support them. We determine the reservation shortage and allocate the required number of pages all at once. These pages are then added to the hugetlb pool and marked reserved. Where that is not possible the mmap() will fail. As with private mappings, the appropriate surplus counters are updated. Since reserved huge pages won't necessarily be used by the process, we can't be sure that free_huge_page() will always be called to return surplus pages to the buddy allocator. To prevent the huge page pool from bloating, we must free unused surplus pages when their reservation has ended. Controlling it ============== With the entire patch series applied, pool resizing is off by default so unless specific action is taken, the semantics are unchanged. To take advantage of the flexibility afforded by this patch series one must tolerate a change in semantics. To control hugetlb pool growth, the following techniques can be employed: * A sysctl tunable to enable/disable the feature entirely * The size= mount option for hugetlbfs filesystems to limit pool size Performance =========== When contiguous memory is readily available, it is expected that the cost of dynamicly resizing the pool will be small. This series has been performance tested with 'stream' to measure this cost. Stream (http://www.cs.virginia.edu/stream/) was linked with libhugetlbfs to enable remapping of the text and data/bss segments into huge pages. Stream with small array ----------------------- Baseline: nr_hugepages = 0, No libhugetlbfs segment remapping Preallocated: nr_hugepages = 5, Text and data/bss remapping Dynamic: nr_hugepages = 0, Text and data/bss remapping Rate (MB/s) Function Baseline Preallocated Dynamic Copy: 4695.6266 5942.8371 5982.2287 Scale: 4451.5776 5017.1419 5658.7843 Add: 5815.8849 7927.7827 8119.3552 Triad: 5949.4144 8527.6492 8110.6903 Stream with large array ----------------------- Baseline: nr_hugepages = 0, No libhugetlbfs segment remapping Preallocated: nr_hugepages = 67, Text and data/bss remapping Dynamic: nr_hugepages = 0, Text and data/bss remapping Rate (MB/s) Function Baseline Preallocated Dynamic Copy: 2227.8281 2544.2732 2546.4947 Scale: 2136.3208 2430.7294 2421.2074 Add: 2773.1449 4004.0021 3999.4331 Triad: 2748.4502 3777.0109 3773.4970 * All numbers are averages taken from 10 consecutive runs with a maximum standard deviation of 1.3 percent noted. This patch: Simply move update_and_free_page() so that it can be reused later in this patch series. The implementation is not changed. Signed-off-by: NAdam Litke <agl@us.ibm.com> Acked-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NDave McCracken <dave.mccracken@oracle.com> Acked-by: NWilliam Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after* set_pte(). This is too late. This patch removes lazy_mmu_prot_update and add modfied set_pte() for flushing if necessary. This patch flush icache of a page when new pte has exec bit. && new pte has present bit && new pte is user's page. && (old *ptep is not present || new pte's pfn is not same to old *ptep's ptn) && new pte's page has no Pg_arch_1 bit. Pg_arch_1 is set when a page is cache consistent. I think this condition checks are much easier to understand than considering "Where sync_icache_dcache() should be inserted ?". pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as clean-up. So, I added it again. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 10月, 2007 1 次提交
-
-
由 Ralf Baechle 提交于
The virtual address space argument of clear_user_highpage is supposed to be the virtual address where the page being cleared will eventually be mapped. This allows architectures with virtually indexed caches a few clever tricks. That sort of trick falls over in painful ways if the virtual address argument is wrong. Signed-off-by: NRalf Baechle <ralf@linux-mips.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 9月, 2007 1 次提交
-
-
由 Lee Schermerhorn 提交于
This patch proposes fixes to the reference counting of memory policy in the page allocation paths and in show_numa_map(). Extracted from my "Memory Policy Cleanups and Enhancements" series as stand-alone. Shared policy lookup [shmem] has always added a reference to the policy, but this was never unrefed after page allocation or after formatting the numa map data. Default system policy should not require additional ref counting, nor should the current task's task policy. However, show_numa_map() calls get_vma_policy() to examine what may be [likely is] another task's policy. The latter case needs protection against freeing of the policy. This patch adds a reference count to a mempolicy returned by get_vma_policy() when the policy is a vma policy or another task's mempolicy. Again, shared policy is already reference counted on lookup. A matching "unref" [__mpol_free()] is performed in alloc_page_vma() for shared and vma policies, and in show_numa_map() for shared and another task's mempolicy. We can call __mpol_free() directly, saving an admittedly inexpensive inline NULL test, because we know we have a non-NULL policy. Handling policy ref counts for hugepages is a bit trickier. huge_zonelist() returns a zone list that might come from a shared or vma 'BIND policy. In this case, we should hold the reference until after the huge page allocation in dequeue_hugepage(). The patch modifies huge_zonelist() to return a pointer to the mempolicy if it needs to be unref'd after allocation. Kernel Build [16cpu, 32GB, ia64] - average of 10 runs: w/o patch w/ refcount patch Avg Std Devn Avg Std Devn Real: 100.59 0.38 100.63 0.43 User: 1209.60 0.37 1209.91 0.31 System: 81.52 0.42 81.64 0.34 Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NAndi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 8月, 2007 1 次提交
-
-
由 Adam Litke 提交于
It seems a simple mistake was made when converting follow_hugetlb_page() over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit 83c54070). By using the wrong bitmask, hugetlb_fault() failures are not being recognized. This results in an infinite loop whenever follow_hugetlb_page is involved in a failed fault. Signed-off-by: NAdam Litke <agl@us.ibm.com> Acked-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 7月, 2007 1 次提交
-
-
由 Ken Chen 提交于
dequeue_huge_page() has a serious memory leak upon hugetlb page allocation. The for loop continues on allocating hugetlb pages out of all allowable zone, where this function is supposedly only dequeue one and only one pages. Fixed it by breaking out of the for loop once a hugetlb page is found. Signed-off-by: NKen Chen <kenchen@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 7月, 2007 5 次提交
-
-
由 Akinobu Mita 提交于
Use appropriate accessor function to set compound page destructor function. Cc: William Irwin <wli@holomorphy.com> Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com> Acked-by: NAdam Litke <agl@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
The fix to that race in alloc_fresh_huge_page() which could give an illegal node ID did not need nid_lock at all: the fix was to replace static int nid by static int prev_nid and do the work on local int nid. nid_lock did make sure that racers strictly roundrobin the nodes, but that's not something we need to enforce strictly. Kill nid_lock. Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
mm/hugetlb.c: In function `dequeue_huge_page': mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function Cc: Christoph Lameter <clameter@sgi.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
This patch completes Linus's wish that the fault return codes be made into bit flags, which I agree makes everything nicer. This requires requires all handle_mm_fault callers to be modified (possibly the modifications should go further and do things like fault accounting in handle_mm_fault -- however that would be for another patch). [akpm@linux-foundation.org: fix alpha build] [akpm@linux-foundation.org: fix s390 build] [akpm@linux-foundation.org: fix sparc build] [akpm@linux-foundation.org: fix sparc64 build] [akpm@linux-foundation.org: fix ia64 build] Signed-off-by: NNick Piggin <npiggin@suse.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Bryan Wu <bryan.wu@analog.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Matthew Wilcox <willy@debian.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Acked-by: NKyle McMartin <kyle@mcmartin.ca> Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com> Acked-by: NRalf Baechle <ralf@linux-mips.org> Acked-by: NAndi Kleen <ak@muc.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> [ Still apparently needs some ARM and PPC loving - Linus ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Change ->fault prototype. We now return an int, which contains VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte. FAULT_RET_ code tells the VM whether a page was found, whether it has been locked, and potentially other things. This is not quite the way he wanted it yet, but that's changed in the next patch (which requires changes to arch code). This means we no longer set VM_CAN_INVALIDATE in the vma in order to say that a page is locked which requires filemap_nopage to go away (because we can no longer remain backward compatible without that flag), but we were going to do that anyway. struct fault_data is renamed to struct vm_fault as Linus asked. address is now a void __user * that we should firmly encourage drivers not to use without really good reason. The page is now returned via a page pointer in the vm_fault struct. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 7月, 2007 2 次提交
-
-
由 Robert P. J. Day 提交于
Signed-off-by: NRobert P. J. Day <rpjday@mindspring.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Huge pages are not movable so are not allocated from ZONE_MOVABLE. However, as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it can be used to satisfy hugepage allocations even when the system has been running a long time. This allows an administrator to resize the hugepage pool at runtime depending on the size of ZONE_MOVABLE. This patch adds a new sysctl called hugepages_treat_as_movable. When a non-zero value is written to it, future allocations for the huge page pool will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. [akpm@linux-foundation.org: various fixes] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 7月, 2007 2 次提交
-
-
由 Joe Jin 提交于
That static `nid' index needs locking. Without it we can end up calling alloc_pages_node() with an illegal node ID and the kernel crashes. Acked-by: Ngurudas pai <gurudas.pai@oracle.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nishanth Aravamudan 提交于
nid is initialized to numa_node_id() but will either be overwritten in the loop or not used in the conditional. So remove the initialization. Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 6月, 2007 1 次提交
-
-
由 Benjamin Herrenschmidt 提交于
Some changes done a while ago to avoid pounding on ptep_set_access_flags and update_mmu_cache in some race situations break sun4c which requires update_mmu_cache() to always be called on minor faults. This patch reworks ptep_set_access_flags() semantics, implementations and callers so that it's now responsible for returning whether an update is necessary or not (basically whether the PTE actually changed). This allow fixing the sparc implementation to always return 1 on sun4c. [akpm@linux-foundation.org: fixes, cleanups] Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: David Miller <davem@davemloft.net> Cc: Mark Fortescue <mark@mtfhpc.demon.co.uk> Acked-by: NWilliam Lee Irwin III <wli@holomorphy.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 5月, 2007 2 次提交
-
-
由 Ken Chen 提交于
When cpuset is configured, it breaks the strict hugetlb page reservation as the accounting is done on a global variable. Such reservation is completely rubbish in the presence of cpuset because the reservation is not checked against page availability for the current cpuset. Application can still potentially OOM'ed by kernel with lack of free htlb page in cpuset that the task is in. Attempt to enforce strict accounting with cpuset is almost impossible (or too ugly) because cpuset is too fluid that task or memory node can be dynamically moved between cpusets. The change of semantics for shared hugetlb mapping with cpuset is undesirable. However, in order to preserve some of the semantics, we fall back to check against current free page availability as a best attempt and hopefully to minimize the impact of changing semantics that cpuset has on hugetlb. Signed-off-by: NKen Chen <kenchen@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ken Chen 提交于
The internal hugetlb resv_huge_pages variable can permanently leak nonzero value in the error path of hugetlb page fault handler when hugetlb page is used in combination of cpuset. The leaked count can permanently trap N number of hugetlb pages in unusable "reserved" state. Steps to reproduce the bug: (1) create two cpuset, user1 and user2 (2) reserve 50 htlb pages in cpuset user1 (3) attempt to shmget/shmat 50 htlb page inside cpuset user2 (4) kernel oom the user process in step 3 (5) ipcrm the shm segment At this point resv_huge_pages will have a count of 49, even though there are no active hugetlbfs file nor hugetlb shared memory segment in the system. The leak is permanent and there is no recovery method other than system reboot. The leaked count will hold up all future use of that many htlb pages in all cpusets. The culprit is that the error path of alloc_huge_page() did not properly undo the change it made to resv_huge_page, causing inconsistent state. Signed-off-by: NKen Chen <kenchen@google.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Adam Litke <agl@us.ibm.com> Cc: Martin Bligh <mbligh@google.com> Acked-by: NDavid Gibson <dwg@au1.ibm.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 2月, 2007 1 次提交
-
-
由 Ken Chen 提交于
__unmap_hugepage_range() is buggy that it does not preserve dirty state of huge_pte when unmapping hugepage range. It causes data corruption in the event of dop_caches being used by sys admin. For example, an application creates a hugetlb file, modify pages, then unmap it. While leaving the hugetlb file alive, comes along sys admin doing a "echo 3 > /proc/sys/vm/drop_caches". drop_pagecache_sb() will happily free all pages that aren't marked dirty if there are no active mapping. Later when application remaps the hugetlb file back and all data are gone, triggering catastrophic flip over on application. Not only that, the internal resv_huge_pages count will also get all messed up. Fix it up by marking page dirty appropriately. Signed-off-by: NKen Chen <kenchen@google.com> Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: <stable@kernel.org> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 12月, 2006 2 次提交
-
-
由 Atsushi Nemoto 提交于
To allow a more effective copy_user_highpage() on certain architectures, a vma argument is added to the function and cow_user_page() allowing the implementation of these functions to check for the VM_EXEC bit. The main part of this patch was originally written by Ralf Baechle; Atushi Nemoto did the the debugging. Signed-off-by: NAtsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: NRalf Baechle <ralf@linux-mips.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Paul Jackson 提交于
Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: NPaul Jackson <pj@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 08 12月, 2006 4 次提交
-
-
由 Andy Whitcroft 提交于
Currently we we use the lru head link of the second page of a compound page to hold its destructor. This was ok when it was purely an internal implmentation detail. However, hugetlbfs overrides this destructor violating the layering. Abstract this out as explicit calls, also introduce a type for the callback function allowing them to be type checked. For each callback we pre-declare the function, causing a type error on definition rather than on use elsewhere. [akpm@osdl.org: cleanups] Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Chen, Kenneth W 提交于
Imprecise RSS accounting is an irritating ill effect with pt sharing. After consulted with several VM experts, I have tried various methods to solve that problem: (1) iterate through all mm_structs that share the PT and increment count; (2) keep RSS count in page table structure and then sum them up at reporting time. None of the above methods yield any satisfactory implementation. Since process RSS accounting is pure information only, I propose we don't count them at all for hugetlb page. rlimit has such field, though there is absolutely no enforcement on limiting that resource. One other method is to account all RSS at hugetlb mmap time regardless they are faulted or not. I opt for the simplicity of no accounting at all. Hugetlb page are special, they are reserved up front in global reservation pool and is not reclaimable. From physical memory resource point of view, it is already consumed regardless whether there are users using them. If the concern is that RSS can be used to control resource allocation, we already can specify hugetlb fs size limit and sysadmin can enforce that at mount time. Combined with the two points mentioned above, I fail to see if there is anything got affected because of this patch. Signed-off-by: NKen Chen <kenneth.w.chen@intel.com> Acked-by: NHugh Dickins <hugh@veritas.com> Cc: Dave McCracken <dmccr@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Adam Litke <agl@us.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Chen, Kenneth W 提交于
Following up with the work on shared page table done by Dave McCracken. This set of patch target shared page table for hugetlb memory only. The shared page table is particular useful in the situation of large number of independent processes sharing large shared memory segments. In the normal page case, the amount of memory saved from process' page table is quite significant. For hugetlb, the saving on page table memory is not the primary objective (as hugetlb itself already cuts down page table overhead significantly), instead, the purpose of using shared page table on hugetlb is to allow faster TLB refill and smaller cache pollution upon TLB miss. With PT sharing, pte entries are shared among hundreds of processes, the cache consumption used by all the page table is smaller and in return, application gets much higher cache hit ratio. One other effect is that cache hit ratio with hardware page walker hitting on pte in cache will be higher and this helps to reduce tlb miss latency. These two effects contribute to higher application performance. Signed-off-by: NKen Chen <kenneth.w.chen@intel.com> Acked-by: NHugh Dickins <hugh@veritas.com> Cc: Dave McCracken <dmccr@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Adam Litke <agl@us.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Chen, Kenneth W 提交于
Signed-off-by: NKen Chen <kenneth.w.chen@intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 29 10月, 2006 1 次提交
-
-
由 Hugh Dickins 提交于
If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative". Reinstate my preliminary i_size check before attempting to allocate the page (though this only fixes the most obvious case: more work will be needed here). Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 12 10月, 2006 1 次提交
-
-
由 Chen, Kenneth W 提交于
commit fe1668ae causes kernel to oops with libhugetlbfs test suite. The problem is that hugetlb pages can be shared by multiple mappings. Multiple threads can fight over page->lru in the unmap path and bad things happen. We now serialize __unmap_hugepage_range to void concurrent linked list manipulation. Such serialization is also needed for shared page table page on hugetlb area. This patch will fixed the bug and also serve as a prepatch for shared page table. Signed-off-by: NKen Chen <kenneth.w.chen@intel.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 04 10月, 2006 1 次提交
-
-
由 Chen, Kenneth W 提交于
Spotted by Hugh that hugetlb page is free'ed back to global pool before performing any TLB flush in unmap_hugepage_range(). This potentially allow threads to abuse free-alloc race condition. The generic tlb gather code is unsuitable to use by hugetlb, I just open coded a page gathering list and delayed put_page until tlb flush is performed. Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NKen Chen <kenneth.w.chen@intel.com> Acked-by: NWilliam Irwin <wli@holomorphy.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 26 9月, 2006 2 次提交
-
-
由 Christoph Lameter 提交于
There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
I found two location in hugetlb.c where we chase pointer instead of using page_to_nid(). Page_to_nid is more effective and can get the node directly from page flags. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 23 6月, 2006 1 次提交
-
-
由 Chen, Kenneth W 提交于
Current hugetlb strict accounting for shared mapping always assume mapping starts at zero file offset and reserves pages between zero and size of the file. This assumption often reserves (or lock down) a lot more pages then necessary if application maps at none zero file offset. libhugetlbfs is one example that requires proper reservation on shared mapping starts at none zero offset. This patch extends the reservation and hugetlb strict accounting to support any arbitrary pair of (offset, len), resulting a much more robust and accurate scheme. More importantly, it won't lock down any hugetlb pages outside file mapping. Signed-off-by: NKen Chen <kenneth.w.chen@intel.com> Acked-by: NAdam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 01 4月, 2006 2 次提交
-
-
由 Chen, Kenneth W 提交于
With strict page reservation, I think kernel should enforce number of free hugetlb page don't fall below reserved count. Currently it is possible in the sysctl path. Add proper check in sysctl to disallow that. Signed-off-by: NKen Chen <kenneth.w.chen@intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Chen, Kenneth W 提交于
git-commit: d5d4b0aa "[PATCH] optimize follow_hugetlb_page" breaks mlock on hugepage areas. I mis-interpret pages argument and made get_page() unconditional. It should only get a ref count when "pages" argument is non-null. Credit goes to Adam Litke who spotted the bug. Signed-off-by: NKen Chen <kenneth.w.chen@intel.com> Acked-by: NAdam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 22 3月, 2006 1 次提交
-
-
由 Paul Jackson 提交于
Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was assuming that nodes are numbered contiguously from 0 to num_online_nodes(). Once the hotplug folks get this far, that will be false. Signed-off-by: NPaul Jackson <pj@sgi.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-