1. 01 11月, 2005 2 次提交
    • J
      [BLOCK] Unify the seperate read/write io stat fields into arrays · a362357b
      Jens Axboe 提交于
      Instead of having ->read_sectors and ->write_sectors, combine the two
      into ->sectors[2] and similar for the other fields. This saves a branch
      several places in the io path, since we don't have to care for what the
      actual io direction is. On my x86-64 box, that's 200 bytes less text in
      just the core (not counting the various drivers).
      Signed-off-by: NJens Axboe <axboe@suse.de>
      a362357b
    • A
      [PATCH] fix __writeback_single_inode WARN_ON · 659603ef
      Andrea Arcangeli 提交于
      When the inode count is zero in inode writeback, the
      
      	WARN_ON(!(inode->i_state & I_WILL_FREE));
      
      is broken, and needs to test for either I_WILL_FREE|I_FREEING.
      
      When the inode is in I_FREEING state, it's already out of the visibility
      of the vm so it can't be freed so it doesn't require the __iget and the
      generic_delete_inode path can call the sync internally to the lowlevel
      fs callback during the last iput. So the inode being in I_FREEING is
      also a valid condition for calling the sync with i_count == 0.
      
      The specific stack trace is this:
      
        0xc00000007b8fb6e0  0xc00000000010118c  .__writeback_single_inode +0x5c
        0xc00000007b8fb6e0  0xc0000000001014dc (lr) .sync_inode +0x3c
        0xc00000007b8fb790  0xc0000000001014dc  .sync_inode +0x3c
        0xc00000007b8fb820  0xc0000000001a5020  .ext2_sync_inode +0x64
        0xc00000007b8fb8f0  0xc0000000001a65b4  .ext2_truncate +0x3f8
        0xc00000007b8fba40  0xc0000000001a6940  .ext2_delete_inode +0xdc
        0xc00000007b8fbac0  0xc0000000000f7a5c  .generic_delete_inode +0x124
        0xc00000007b8fbb50  0xc0000000000f5fe0  .iput +0xb8
        0xc00000007b8fbbe0  0xc0000000000e9fd4  .sys_unlink +0x2a8
        0xc00000007b8fbd10  0xc00000000001048c  .ret_from_syscall_1 +0x0
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      659603ef
  2. 31 10月, 2005 26 次提交
  3. 30 10月, 2005 12 次提交
    • A
      [PATCH] hugetlb: overcommit accounting check · 2e9b367c
      Adam Litke 提交于
      Basic overcommit checking for hugetlb_file_map() based on an implementation
      used with demand faulting in SLES9.
      
      Since demand faulting can't guarantee the availability of pages at mmap
      time, this patch implements a basic sanity check to ensure that the number
      of huge pages required to satisfy the mmap are currently available.
      Despite the obvious race, I think it is a good start on doing proper
      accounting.  I'd like to work towards an accounting system that mimics the
      semantics of normal pages (especially for the MAP_PRIVATE/COW case).  That
      work is underway and builds on what this patch starts.
      
      Huge page shared memory segments are simpler and still maintain their
      commit on shmget semantics.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2e9b367c
    • A
      [PATCH] hugetlb: demand fault handler · 4c887265
      Adam Litke 提交于
      Below is a patch to implement demand faulting for huge pages.  The main
      motivation for changing from prefaulting to demand faulting is so that huge
      page memory areas can be allocated according to NUMA policy.
      
      Thanks to consolidated hugetlb code, switching the behavior requires changing
      only one fault handler.  The bulk of the patch just moves the logic from
      hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c887265
    • C
      [PATCH] cleanup hugelbfs_forget_inode · 0b1533f6
      Christoph Hellwig 提交于
      Reformat hugelbfs_forget_inode and add the missing but harmless
      write_inode_now call.  It looks the same as generic_forget_inode now except
      for the call to truncate_hugepages instead of truncate_inode_pages.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0b1533f6
    • C
      [PATCH] kill hugelbfs_do_delete_inode · 6b09b9df
      Christoph Hellwig 提交于
      hugetlbfs_do_delete_inode is the same as generic_delete_inode now, so remove
      it in favour of the latter.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6b09b9df
    • C
      [PATCH] hugetlbfs: clean up hugetlbfs_delete_inode · 149f4211
      Christoph Hellwig 提交于
      Make hugetlbfs looks the same as generic_detelte_inode, fixing a bunch of
      missing updates to it at the same time.  Rename it to
      hugetlbfs_do_delete_inode and add a real hugetlbfs_delete_inode that
      implements ->delete_inode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      149f4211
    • C
      [PATCH] hugetlbfs: move free_inodes accounting · 96527980
      Christoph Hellwig 提交于
      Move hugetlbfs accounting into ->alloc_inode / ->destroy_inode.  This keeps
      the code simpler, fixes a loeak where a failing inode allocation wouldn't
      decrement the counter and moves hugetlbfs_delete_inode and
      hugetlbfs_forget_inode closer to their generic counterparts.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      96527980
    • H
      [PATCH] mm: split page table lock · 4c21e2f2
      Hugh Dickins 提交于
      Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
      a many-threaded application which concurrently initializes different parts of
      a large anonymous area.
      
      This patch corrects that, by using a separate spinlock per page table page, to
      guard the page table entries in that page, instead of using the mm's single
      page_table_lock.  (But even then, page_table_lock is still used to guard page
      table allocation, and anon_vma allocation.)
      
      In this implementation, the spinlock is tucked inside the struct page of the
      page table page: with a BUILD_BUG_ON in case it overflows - which it would in
      the case of 32-bit PA-RISC with spinlock debugging enabled.
      
      Splitting the lock is not quite for free: another cacheline access.  Ideally,
      I suppose we would use split ptlock only for multi-threaded processes on
      multi-cpu machines; but deciding that dynamically would have its own costs.
      So for now enable it by config, at some number of cpus - since the Kconfig
      language doesn't support inequalities, let preprocessor compare that with
      NR_CPUS.  But I don't think it's worth being user-configurable: for good
      testing of both split and unsplit configs, split now at 4 cpus, and perhaps
      change that to 8 later.
      
      There is a benefit even for singly threaded processes: kswapd can be attacking
      one part of the mm while another part is busy faulting.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c21e2f2
    • H
      [PATCH] mm: follow_page with inner ptlock · deceb6cd
      Hugh Dickins 提交于
      Final step in pushing down common core's page_table_lock.  follow_page no
      longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
      and so no page_table_lock is taken in get_user_pages itself.
      
      But get_user_pages (and get_futex_key) do then need follow_page to pin the
      page for them: take Daniel's suggestion of bitflags to follow_page.
      
      Need one for WRITE, another for TOUCH (it was the accessed flag before:
      vanished along with check_user_page_readable, but surely get_numa_maps is
      wrong to mark every page it finds as accessed), another for GET.
      
      And another, ANON to dispose of untouched_anonymous_page: it seems silly for
      that to descend a second time, let follow_page observe if there was no page
      table and return ZERO_PAGE if so.  Fix minor bug in that: check VM_LOCKED -
      make_pages_present ought to make readonly anonymous present.
      
      Give get_numa_maps a cond_resched while we're there.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      deceb6cd
    • H
      [PATCH] mm: unmap_vmas with inner ptlock · 508034a3
      Hugh Dickins 提交于
      Remove the page_table_lock from around the calls to unmap_vmas, and replace
      the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
      now safe to descend without page_table_lock.
      
      Don't attempt fancy locking for hugepages, just take page_table_lock in
      unmap_hugepage_range.  Which makes zap_hugepage_range, and the hugetlb test in
      zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway.  Nor
      does unmap_vmas have much use for its mm arg now.
      
      The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
      page_table_lock: if they're implemented at all, they typically come down to
      flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
      (which we already audited for the mprotect case).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      508034a3
    • H
      [PATCH] mm: pte_offset_map_lock loops · 705e87c0
      Hugh Dickins 提交于
      Convert those common loops using page_table_lock on the outside and
      pte_offset_map within to use just pte_offset_map_lock within instead.
      
      These all hold mmap_sem (some exclusively, some not), so at no level can a
      page table be whipped away from beneath them.  But whereas pte_alloc loops
      tested with the "atomic" pmd_present, these loops are testing with pmd_none,
      which on i386 PAE tests both lower and upper halves.
      
      That's now unsafe, so add a cast into pmd_none to test only the vital lower
      half: we lose a little sensitivity to a corrupt middle directory, but not
      enough to worry about.  It appears that i386 and UML were the only
      architectures vulnerable in this way, and pgd and pud no problem.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      705e87c0
    • H
      [PATCH] mm: ptd_alloc take ptlock · c74df32c
      Hugh Dickins 提交于
      Second step in pushing down the page_table_lock.  Remove the temporary
      bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
      to hold page_table_lock, whether it's on init_mm or a user mm; take
      page_table_lock internally to check if a racing task already allocated.
      
      Convert their callers from common code.  But avoid coming back to change them
      again later: instead of moving the spin_lock(&mm->page_table_lock) down,
      switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
      encapsulate the mapping+locking and unlocking+unmapping together, and in the
      end may use alternatives to the mm page_table_lock itself.
      
      These callers all hold mmap_sem (some exclusively, some not), so at no level
      can a page table be whipped away from beneath them; and pte_alloc uses the
      "atomic" pmd_present to test whether it needs to allocate.  It appears that on
      all arches we can safely descend without page_table_lock.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c74df32c
    • H
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins 提交于
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      365e9c87