1. 31 10月, 2005 12 次提交
  2. 30 10月, 2005 16 次提交
    • A
      [PATCH] hugetlb: overcommit accounting check · 2e9b367c
      Adam Litke 提交于
      Basic overcommit checking for hugetlb_file_map() based on an implementation
      used with demand faulting in SLES9.
      
      Since demand faulting can't guarantee the availability of pages at mmap
      time, this patch implements a basic sanity check to ensure that the number
      of huge pages required to satisfy the mmap are currently available.
      Despite the obvious race, I think it is a good start on doing proper
      accounting.  I'd like to work towards an accounting system that mimics the
      semantics of normal pages (especially for the MAP_PRIVATE/COW case).  That
      work is underway and builds on what this patch starts.
      
      Huge page shared memory segments are simpler and still maintain their
      commit on shmget semantics.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2e9b367c
    • A
      [PATCH] hugetlb: demand fault handler · 4c887265
      Adam Litke 提交于
      Below is a patch to implement demand faulting for huge pages.  The main
      motivation for changing from prefaulting to demand faulting is so that huge
      page memory areas can be allocated according to NUMA policy.
      
      Thanks to consolidated hugetlb code, switching the behavior requires changing
      only one fault handler.  The bulk of the patch just moves the logic from
      hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c887265
    • C
      [PATCH] cleanup hugelbfs_forget_inode · 0b1533f6
      Christoph Hellwig 提交于
      Reformat hugelbfs_forget_inode and add the missing but harmless
      write_inode_now call.  It looks the same as generic_forget_inode now except
      for the call to truncate_hugepages instead of truncate_inode_pages.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0b1533f6
    • C
      [PATCH] kill hugelbfs_do_delete_inode · 6b09b9df
      Christoph Hellwig 提交于
      hugetlbfs_do_delete_inode is the same as generic_delete_inode now, so remove
      it in favour of the latter.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6b09b9df
    • C
      [PATCH] hugetlbfs: clean up hugetlbfs_delete_inode · 149f4211
      Christoph Hellwig 提交于
      Make hugetlbfs looks the same as generic_detelte_inode, fixing a bunch of
      missing updates to it at the same time.  Rename it to
      hugetlbfs_do_delete_inode and add a real hugetlbfs_delete_inode that
      implements ->delete_inode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      149f4211
    • C
      [PATCH] hugetlbfs: move free_inodes accounting · 96527980
      Christoph Hellwig 提交于
      Move hugetlbfs accounting into ->alloc_inode / ->destroy_inode.  This keeps
      the code simpler, fixes a loeak where a failing inode allocation wouldn't
      decrement the counter and moves hugetlbfs_delete_inode and
      hugetlbfs_forget_inode closer to their generic counterparts.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      96527980
    • H
      [PATCH] mm: split page table lock · 4c21e2f2
      Hugh Dickins 提交于
      Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
      a many-threaded application which concurrently initializes different parts of
      a large anonymous area.
      
      This patch corrects that, by using a separate spinlock per page table page, to
      guard the page table entries in that page, instead of using the mm's single
      page_table_lock.  (But even then, page_table_lock is still used to guard page
      table allocation, and anon_vma allocation.)
      
      In this implementation, the spinlock is tucked inside the struct page of the
      page table page: with a BUILD_BUG_ON in case it overflows - which it would in
      the case of 32-bit PA-RISC with spinlock debugging enabled.
      
      Splitting the lock is not quite for free: another cacheline access.  Ideally,
      I suppose we would use split ptlock only for multi-threaded processes on
      multi-cpu machines; but deciding that dynamically would have its own costs.
      So for now enable it by config, at some number of cpus - since the Kconfig
      language doesn't support inequalities, let preprocessor compare that with
      NR_CPUS.  But I don't think it's worth being user-configurable: for good
      testing of both split and unsplit configs, split now at 4 cpus, and perhaps
      change that to 8 later.
      
      There is a benefit even for singly threaded processes: kswapd can be attacking
      one part of the mm while another part is busy faulting.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c21e2f2
    • H
      [PATCH] mm: follow_page with inner ptlock · deceb6cd
      Hugh Dickins 提交于
      Final step in pushing down common core's page_table_lock.  follow_page no
      longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
      and so no page_table_lock is taken in get_user_pages itself.
      
      But get_user_pages (and get_futex_key) do then need follow_page to pin the
      page for them: take Daniel's suggestion of bitflags to follow_page.
      
      Need one for WRITE, another for TOUCH (it was the accessed flag before:
      vanished along with check_user_page_readable, but surely get_numa_maps is
      wrong to mark every page it finds as accessed), another for GET.
      
      And another, ANON to dispose of untouched_anonymous_page: it seems silly for
      that to descend a second time, let follow_page observe if there was no page
      table and return ZERO_PAGE if so.  Fix minor bug in that: check VM_LOCKED -
      make_pages_present ought to make readonly anonymous present.
      
      Give get_numa_maps a cond_resched while we're there.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      deceb6cd
    • H
      [PATCH] mm: unmap_vmas with inner ptlock · 508034a3
      Hugh Dickins 提交于
      Remove the page_table_lock from around the calls to unmap_vmas, and replace
      the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
      now safe to descend without page_table_lock.
      
      Don't attempt fancy locking for hugepages, just take page_table_lock in
      unmap_hugepage_range.  Which makes zap_hugepage_range, and the hugetlb test in
      zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway.  Nor
      does unmap_vmas have much use for its mm arg now.
      
      The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
      page_table_lock: if they're implemented at all, they typically come down to
      flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
      (which we already audited for the mprotect case).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      508034a3
    • H
      [PATCH] mm: pte_offset_map_lock loops · 705e87c0
      Hugh Dickins 提交于
      Convert those common loops using page_table_lock on the outside and
      pte_offset_map within to use just pte_offset_map_lock within instead.
      
      These all hold mmap_sem (some exclusively, some not), so at no level can a
      page table be whipped away from beneath them.  But whereas pte_alloc loops
      tested with the "atomic" pmd_present, these loops are testing with pmd_none,
      which on i386 PAE tests both lower and upper halves.
      
      That's now unsafe, so add a cast into pmd_none to test only the vital lower
      half: we lose a little sensitivity to a corrupt middle directory, but not
      enough to worry about.  It appears that i386 and UML were the only
      architectures vulnerable in this way, and pgd and pud no problem.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      705e87c0
    • H
      [PATCH] mm: ptd_alloc take ptlock · c74df32c
      Hugh Dickins 提交于
      Second step in pushing down the page_table_lock.  Remove the temporary
      bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
      to hold page_table_lock, whether it's on init_mm or a user mm; take
      page_table_lock internally to check if a racing task already allocated.
      
      Convert their callers from common code.  But avoid coming back to change them
      again later: instead of moving the spin_lock(&mm->page_table_lock) down,
      switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
      encapsulate the mapping+locking and unlocking+unmapping together, and in the
      end may use alternatives to the mm page_table_lock itself.
      
      These callers all hold mmap_sem (some exclusively, some not), so at no level
      can a page table be whipped away from beneath them; and pte_alloc uses the
      "atomic" pmd_present to test whether it needs to allocate.  It appears that on
      all arches we can safely descend without page_table_lock.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c74df32c
    • H
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins 提交于
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      365e9c87
    • N
      [PATCH] core remove PageReserved · b5810039
      Nick Piggin 提交于
      Remove PageReserved() calls from core code by tightening VM_RESERVED
      handling in mm/ to cover PageReserved functionality.
      
      PageReserved special casing is removed from get_page and put_page.
      
      All setting and clearing of PageReserved is retained, and it is now flagged
      in the page_alloc checks to help ensure we don't introduce any refcount
      based freeing of Reserved pages.
      
      MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
      deprecated.  We never completely handled it correctly anyway, and is be
      reintroduced in future if required (Hugh has a proof of concept).
      
      Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
      arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
      be trivially removed.
      
      Last real user of PageReserved is swsusp, which uses PageReserved to
      determine whether a struct page points to valid memory or not.  This still
      needs to be addressed (a generic page_is_ram() should work).
      
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
      thus mapcounted and count towards shared rss).  These writes to the struct
      page could cause excessive cacheline bouncing on big systems.  There are a
      number of ways this could be addressed if it is an issue.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      Refcount bug fix for filemap_xip.c
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b5810039
    • H
      [PATCH] mm: rss = file_rss + anon_rss · 4294621f
      Hugh Dickins 提交于
      I was lazy when we added anon_rss, and chose to change as few places as
      possible.  So currently each anonymous page has to be counted twice, in rss
      and in anon_rss.  Which won't be so good if those are atomic counts in some
      configurations.
      
      Change that around: keep file_rss and anon_rss separately, and add them
      together (with get_mm_rss macro) when the total is needed - reading two
      atomics is much cheaper than updating two atomics.  And update anon_rss
      upfront, typically in memory.c, not tucked away in page_add_anon_rmap.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4294621f
    • H
      [PATCH] mm: mm_init set_mm_counters · 404351e6
      Hugh Dickins 提交于
      How is anon_rss initialized?  In dup_mmap, and by mm_alloc's memset; but
      that's not so good if an mm_counter_t is a special type.  And how is rss
      initialized?  By set_mm_counter, all over the place.  Come on, we just need to
      initialize them both at once by set_mm_counter in mm_init (which follows the
      memcpy when forking).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      404351e6
    • A
      [PATCH] Convert mempolicies to nodemask_t · dfcd3c0d
      Andi Kleen 提交于
      The NUMA policy code predated nodemask_t so it used open coded bitmaps.
      Convert everything to nodemask_t.  Big patch, but shouldn't have any actual
      behaviour changes (except I removed one unnecessary check against
      node_online_map and one unnecessary BUG_ON)
      Signed-off-by: N"Andi Kleen" <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dfcd3c0d
  3. 29 10月, 2005 5 次提交
  4. 28 10月, 2005 7 次提交