1. 30 10月, 2005 19 次提交
    • H
      [PATCH] mm: unmap_vmas with inner ptlock · 508034a3
      Hugh Dickins 提交于
      Remove the page_table_lock from around the calls to unmap_vmas, and replace
      the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
      now safe to descend without page_table_lock.
      
      Don't attempt fancy locking for hugepages, just take page_table_lock in
      unmap_hugepage_range.  Which makes zap_hugepage_range, and the hugetlb test in
      zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway.  Nor
      does unmap_vmas have much use for its mm arg now.
      
      The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
      page_table_lock: if they're implemented at all, they typically come down to
      flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
      (which we already audited for the mprotect case).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      508034a3
    • H
      [PATCH] mm: unlink vma before pagetables · 8f4f8c16
      Hugh Dickins 提交于
      In most places the descent from pgd to pud to pmd to pte holds mmap_sem
      (exclusively or not), which ensures that free_pgtables cannot be freeing page
      tables from any level at the same time.  But truncation and reverse mapping
      descend without mmap_sem.
      
      No problem: just make sure that a vma is unlinked from its prio_tree (or
      nonlinear list) and from its anon_vma list, after zapping the vma, but before
      freeing its page tables.  Then neither vmtruncate nor rmap can reach that vma
      whose page tables are now volatile (nor do they need to reach it, since all
      its page entries have been zapped by this stage).
      
      The i_mmap_lock and anon_vma->lock already serialize this correctly; but the
      locking hierarchy is such that we cannot take them while holding
      page_table_lock.  Well, we're trying to push that down anyway.  So in this
      patch, move anon_vma_unlink and unlink_file_vma into free_pgtables, at the
      same time as moving page_table_lock around calls to unmap_vmas.
      
      tlb_gather_mmu and tlb_finish_mmu then fall outside the page_table_lock, but
      we made them preempt_disable and preempt_enable earlier; and a long source
      audit of all the architectures has shown no problem with removing
      page_table_lock from them.  free_pgtables doesn't need page_table_lock for
      itself, nor for what it calls; tlb->mm->nr_ptes is usually protected by
      page_table_lock, but partly by non-exclusive mmap_sem - here it's decremented
      with exclusive mmap_sem, or mm_users 0.  update_hiwater_rss and
      vm_unacct_memory don't need page_table_lock either.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f4f8c16
    • H
      [PATCH] mm: page fault handler locking · 8f4e2101
      Hugh Dickins 提交于
      On the page fault path, the patch before last pushed acquiring the
      page_table_lock down to the head of handle_pte_fault (though it's also taken
      and dropped earlier when a new page table has to be allocated).
      
      Now delete that line, read "entry = *pte" without it, and go off to this or
      that page fault handler on the basis of this unlocked peek.  Usually the
      handler can proceed without the lock, relying on the subsequent locked
      pte_same or pte_none test to back out when necessary; though do_wp_page needs
      the lock immediately, and do_file_page doesn't check (if there's a race,
      install_page just zaps the entry and reinstalls it).
      
      But on those architectures (notably i386 with PAE) whose pte is too big to be
      read atomically, if SMP or preemption is enabled, do_swap_page and
      do_file_page might cause irretrievable damage if passed a Frankenstein entry
      stitched together from unrelated parts.  In those configs, "pte_unmap_same"
      has to take page_table_lock, validate orig_pte still the same, and drop
      page_table_lock before unmapping, before proceeding.
      
      Use pte_offset_map_lock and pte_unmap_unlock throughout the handlers; but lock
      avoidance leaves more lone maps and unmaps than elsewhere.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f4e2101
    • H
      [PATCH] mm: ptd_alloc take ptlock · c74df32c
      Hugh Dickins 提交于
      Second step in pushing down the page_table_lock.  Remove the temporary
      bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
      to hold page_table_lock, whether it's on init_mm or a user mm; take
      page_table_lock internally to check if a racing task already allocated.
      
      Convert their callers from common code.  But avoid coming back to change them
      again later: instead of moving the spin_lock(&mm->page_table_lock) down,
      switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
      encapsulate the mapping+locking and unlocking+unmapping together, and in the
      end may use alternatives to the mm page_table_lock itself.
      
      These callers all hold mmap_sem (some exclusively, some not), so at no level
      can a page table be whipped away from beneath them; and pte_alloc uses the
      "atomic" pmd_present to test whether it needs to allocate.  It appears that on
      all arches we can safely descend without page_table_lock.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c74df32c
    • H
      [PATCH] mm: ptd_alloc inline and out · 1bb3630e
      Hugh Dickins 提交于
      It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
      calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
      pte_alloc_map and pte_alloc_kernel are entirely out-of-line.  Though it does
      add a little to kernel size, change them to macros testing inline, calling
      __pte_alloc or __pte_alloc_kernel to allocate out-of-line.  Mark none of them
      as fastcalls, leave that to CONFIG_REGPARM or not.
      
      It also seems more natural for the out-of-line functions to leave the offset
      calculation and map to the inline, which has to do it anyway for the common
      case.  At least mremap move wants __pte_alloc without _map.
      
      Macros rather than inline functions, certainly to avoid the header file issues
      which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
      architectures I haven't built would have other such problems.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1bb3630e
    • H
      [PATCH] mm: init_mm without ptlock · 872fec16
      Hugh Dickins 提交于
      First step in pushing down the page_table_lock.  init_mm.page_table_lock has
      been used throughout the architectures (usually for ioremap): not to serialize
      kernel address space allocation (that's usually vmlist_lock), but because
      pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.
      
      Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
      architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
      and drop it when allocating a new one, to check lest a racing task already
      did.  Similarly no page_table_lock in vmalloc's map_vm_area.
      
      Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
      user mms, which are converted only by a later patch, for now they have to lock
      differently according to whether or not it's init_mm.
      
      If sources get muddled, there's a danger that an arch source taking
      init_mm.page_table_lock will be mixed with common source also taking it (or
      neither take it).  So break the rules and make another change, which should
      break the build for such a mismatch: remove the redundant mm arg from
      pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).
      
      Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
      used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
      pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
      map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
      took page_table_lock for no good reason.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      872fec16
    • H
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins 提交于
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      365e9c87
    • H
      [PATCH] mm: do_swap_page race major · 9e9bef07
      Hugh Dickins 提交于
      Small adjustment: do_swap_page should report its !pte_same race as a major
      fault if it had to read into swap cache, because whatever raced with it will
      have found page already in cache and reported minor fault.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9e9bef07
    • H
      [PATCH] mm: zap_pte_range dec rss · 86d912f4
      Hugh Dickins 提交于
      Small adjustment: zap_pte_range decrement its rss counts from 0 then finally
      add, avoiding negations - we don't have or need a sub_mm_rss.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      86d912f4
    • H
      [PATCH] mm: copy_one_pte inc rss · 8c103762
      Hugh Dickins 提交于
      Small adjustment, following Nick's suggestion: it's more straightforward for
      copy_pte_range to let copy_one_pte do the rss incrementation, than use an
      index it passed back.  Saves a #define, and 16 bytes of .text.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8c103762
    • N
      [PATCH] core remove PageReserved · b5810039
      Nick Piggin 提交于
      Remove PageReserved() calls from core code by tightening VM_RESERVED
      handling in mm/ to cover PageReserved functionality.
      
      PageReserved special casing is removed from get_page and put_page.
      
      All setting and clearing of PageReserved is retained, and it is now flagged
      in the page_alloc checks to help ensure we don't introduce any refcount
      based freeing of Reserved pages.
      
      MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
      deprecated.  We never completely handled it correctly anyway, and is be
      reintroduced in future if required (Hugh has a proof of concept).
      
      Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
      arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
      be trivially removed.
      
      Last real user of PageReserved is swsusp, which uses PageReserved to
      determine whether a struct page points to valid memory or not.  This still
      needs to be addressed (a generic page_is_ram() should work).
      
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
      thus mapcounted and count towards shared rss).  These writes to the struct
      page could cause excessive cacheline bouncing on big systems.  There are a
      number of ways this could be addressed if it is an issue.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      Refcount bug fix for filemap_xip.c
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b5810039
    • H
      [PATCH] mm: batch updating mm_counters · ae859762
      Hugh Dickins 提交于
      tlb_finish_mmu used to batch zap_pte_range's update of mm rss, which may be
      worthwhile if the mm is contended, and would reduce atomic operations if the
      counts were atomic.  Let zap_pte_range now batch its updates to file_rss and
      anon_rss, per page-table in case we drop the lock outside; and copy_pte_range
      batch them too.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ae859762
    • H
      [PATCH] mm: rss = file_rss + anon_rss · 4294621f
      Hugh Dickins 提交于
      I was lazy when we added anon_rss, and chose to change as few places as
      possible.  So currently each anonymous page has to be counted twice, in rss
      and in anon_rss.  Which won't be so good if those are atomic counts in some
      configurations.
      
      Change that around: keep file_rss and anon_rss separately, and add them
      together (with get_mm_rss macro) when the total is needed - reading two
      atomics is much cheaper than updating two atomics.  And update anon_rss
      upfront, typically in memory.c, not tucked away in page_add_anon_rmap.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4294621f
    • H
      [PATCH] mm: tlb_finish_mmu forget rss · fc2acab3
      Hugh Dickins 提交于
      zap_pte_range has been counting the pages it frees in tlb->freed, then
      tlb_finish_mmu has used that to update the mm's rss.  That got stranger when I
      added anon_rss, yet updated it by a different route; and stranger when rss and
      anon_rss became mm_counters with special access macros.  And it would no
      longer be viable if we're relying on page_table_lock to stabilize the
      mm_counter, but calling tlb_finish_mmu outside that lock.
      
      Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
      business, just decrement the rss mm_counter in zap_pte_range (yes, there was
      some point to batching the update, and a subsequent patch restores that).  And
      forget the anal paranoia of first reading the counter to avoid going negative
      - if rss does go negative, just fix that bug.
      
      Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use
      was being made of them.  But arm26 alone was actually using the freed, in the
      way some others use need_flush: give it a need_flush.  arm26 seems to prefer
      spaces to tabs here: respect that.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fc2acab3
    • H
      [PATCH] mm: tlb_is_full_mm was obscure · 4d6ddfa9
      Hugh Dickins 提交于
      tlb_is_full_mm?  What does that mean?  The TLB is full?  No, it means that the
      mm's last user has gone and the whole mm is being torn down.  And it's an
      inline function because sparc64 uses a different (slightly better)
      "tlb_frozen" name for the flag others call "fullmm".
      
      And now the ptep_get_and_clear_full macro used in zap_pte_range refers
      directly to tlb->fullmm, which would be wrong for sparc64.  Rather than
      correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
      sparc64 to just use the same poor name as everyone else - is that okay?
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4d6ddfa9
    • H
      [PATCH] mm: page fault handlers tidyup · 65500d23
      Hugh Dickins 提交于
      Impose a little more consistency on the page fault handlers do_wp_page,
      do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their
      arguments in the same order, called the same names?
      
      break_cow is all very well, but what it did was inlined elsewhere: easier to
      compare if it's brought back into do_wp_page.
      
      do_file_page's fallback to do_no_page dates from a time when we were testing
      pte_file by using it wherever possible: currently it's peculiar to nonlinear
      vmas, so just check that.  BUG_ON if not?  Better not, it's probably page
      table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's
      use that for do_wp_page's invalid pfn too.
      
      Hah!  Someone in the ppc64 world noticed pte_ERROR was unused so removed it:
      restored (and say "pud" not "pmd" in its pud_ERROR).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      65500d23
    • H
      [PATCH] mm: anon is already wrprotected · 72866f6f
      Hugh Dickins 提交于
      do_anonymous_page's pte_wrprotect causes some confusion: in such a case,
      vm_page_prot must already be forcing COW, so must omit write permission, and
      so the pte_wrprotect is redundant.  Replace it by a comment to that effect,
      and reword the comment on unuse_pte which also caused confusion.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      72866f6f
    • H
      [PATCH] mm: zap_pte_range dont dirty anon · 6237bcd9
      Hugh Dickins 提交于
      zap_pte_range already avoids wasting time to mark_page_accessed on anon pages:
      it can also skip anon set_page_dirty - the page only needs to be marked dirty
      if shared with another mm, but that will say pte_dirty too.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6237bcd9
    • H
      [PATCH] mm: copy_pte_range progress fix · e040f218
      Hugh Dickins 提交于
      My latency breaking in copy_pte_range didn't work as intended: instead of
      checking at regularish intervals, after the first interval it checked every
      time around the loop, too impatient to be preempted.  Fix that.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e040f218
  2. 21 10月, 2005 1 次提交
  3. 20 10月, 2005 1 次提交
    • S
      [PATCH] Handle spurious page fault for hugetlb region · 3359b54c
      Seth, Rohit 提交于
      The hugetlb pages are currently pre-faulted.  At the time of mmap of
      hugepages, we populate the new PTEs.  It is possible that HW has already
      cached some of the unused PTEs internally.  These stale entries never
      get a chance to be purged in existing control flow.
      
      This patch extends the check in page fault code for hugepages.  Check if
      a faulted address falls with in size for the hugetlb file backing it.
      We return VM_FAULT_MINOR for these cases (assuming that the arch
      specific page-faulting code purges the stale entry for the archs that
      need it).
      Signed-off-by: NRohit Seth <rohit.seth@intel.com>
      
      [ This is apparently arguably an ia64 port bug. But the code won't
        hurt, and for now it fixes a real problem on some ia64 machines ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3359b54c
  4. 11 9月, 2005 1 次提交
  5. 05 9月, 2005 2 次提交
    • Z
      [PATCH] x86: ptep_clear optimization · a600388d
      Zachary Amsden 提交于
      Add a new accessor for PTEs, which passes the full hint from the mmu_gather
      struct; this allows architectures with hardware pagetables to optimize away
      atomic PTE operations when destroying an address space.  Removing the
      locked operation should allow better pipelining of memory access in this
      loop.  I measured an average savings of 30-35 cycles per zap_pte_range on
      the first 500 destructions on Pentium-M, but I believe the optimization
      would win more on older processors which still assert the bus lock on xchg
      for an exclusive cacheline.
      
      Update: I made some new measurements, and this saves exactly 26 cycles over
      ptep_get_and_clear on Pentium M.  On P4, with a PAE kernel, this saves 180
      cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
      full address space destruction.
      
      pte_clear_full is not yet used, but is provided for future optimizations
      (in particular, when running inside of a hypervisor that queues page table
      updates, the full hint allows us to avoid queueing unnecessary page table
      update for an address space in the process of being destroyed.
      
      This is not a huge win, but it does help a bit, and sets the stage for
      further hypervisor optimization of the mm layer on all architectures.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a600388d
    • P
      [PATCH] mm: remove implied vm_ops check · 4944e76d
      Paolo 'Blaisorblade' Giarrusso 提交于
      If !vma->vm-ops we already BUG above, so retesting it is useless.  The
      compiler cannot optimize this because BUG is a macro and is not thus marked
      noreturn; that should possibly be fixed.
      Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4944e76d
  6. 30 8月, 2005 1 次提交
    • N
      [PATCH] Lazy page table copies in fork() · d992895b
      Nick Piggin 提交于
      Defer copying of ptes until fault time when it is possible to reconstruct
      the pte from backing store. Idea from Andi Kleen and Nick Piggin.
      
      Thanks to input from Rik van Riel and Linus and to Hugh for correcting
      my blundering.
      
      Ray Fucillo <fucillo@intersystems.com> reports:
      
        "I applied this latest patch to a 2.6.12 kernel and found that it does
         resolve the problem.  Prior to the patch on this machine, I was
         seeing about 23ms spent in fork for ever 100MB of shared memory
         segment.
      
         After applying the patch, fork is taking about 1ms regardless of the
         shared memory size."
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d992895b
  7. 04 8月, 2005 2 次提交
    • L
      Fix up recent get_user_pages() handling · a68d2ebc
      Linus Torvalds 提交于
      The VM_FAULT_WRITE thing is an extra bit, not a valid return value, and
      has to be treated as such by get_user_pages().
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a68d2ebc
    • N
      [PATCH] fix get_user_pages bug · f33ea7f4
      Nick Piggin 提交于
      Checking pte_dirty instead of pte_write in __follow_page is problematic
      for s390, and for copy_one_pte which leaves dirty when clearing write.
      
      So revert __follow_page to check pte_write as before, and make
      do_wp_page pass back a special extra VM_FAULT_WRITE bit to say it has
      done its full job: once get_user_pages receives this value, it no longer
      requires pte_write in __follow_page.
      
      But most callers of handle_mm_fault, in the various architectures, have
      switch statements which do not expect this new case.  To avoid changing
      them all in a hurry, make an inline wrapper function (using the old
      name) that masks off the new bit, and use the extended interface with
      double underscores.
      
      Yes, we do have a call to do_wp_page from do_swap_page, but no need to
      change that: in rare case it's needed, another do_wp_page will follow.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      [ Cleanups by Nick Piggin ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f33ea7f4
  8. 02 8月, 2005 2 次提交
    • H
      [PATCH] x86_64: access of some bad address · 690dbe1c
      Hugh Dickins 提交于
      x86_64 has a large sparse gate area between VSYSCALL_START and
      VSYSCALL_END, not all of it presently backed by pmds.  Alexander Nyberg has
      found that in some circumstances gdb may try to ptrace here, and hit
      get_user_pages BUG_ON.  It seems odd that gdb should be accessing here, but
      it certainly shouldn't crash in this way: relax BUG_ON to -EFAULT.  Fixes
      kernel bugzilla #4801.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      690dbe1c
    • L
      Fix get_user_pages() race for write access · 4ceb5db9
      Linus Torvalds 提交于
      There's no real guarantee that handle_mm_fault() will always be able to
      break a COW situation - if an update from another thread ends up
      modifying the page table some way, handle_mm_fault() may end up
      requiring us to re-try the operation.
      
      That's normally fine, but get_user_pages() ended up re-trying it as a
      read, and thus a write access could in theory end up losing the dirty
      bit or be done on a page that had not been properly COW'ed.
      
      This makes get_user_pages() always retry write accesses as write
      accesses by making "follow_page()" require that a writable follow has
      the dirty bit set.  That simplifies the code and solves the race: if the
      COW break fails for some reason, we'll just loop around and try again.
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4ceb5db9
  9. 28 7月, 2005 1 次提交
    • A
      [PATCH] check_user_page_readable() deadlock fix · 1aaf18ff
      Andrew Morton 提交于
      Fix bug identifued by Richard Purdie <rpurdie@rpsys.net>.
      
      oprofile calls check_user_page_readable() from interrupt context, so we
      deadlock over various VFS locks.
      
      But check_user_page_readable() doesn't imply either a read or a write of the
      page's contents.  Change __follow_page() so that check_user_page_readable()
      can tell __follow_page() that we're not accessing the page's contents, and use
      that info to avoid the troublesome lock-takings.
      
      Also, make follow_page() inline for the single callsite in memory.c to save a
      bit of stack space.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1aaf18ff
  10. 26 6月, 2005 1 次提交
  11. 24 6月, 2005 2 次提交
    • M
      [PATCH] DocBook: update comments · 3d41088f
      Martin Waitz 提交于
      This patch updates some comments to match code changes.
      Signed-off-by: NMartin Waitz <tali@admingilde.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3d41088f
    • A
      [PATCH] sparsemem memory model · d41dee36
      Andy Whitcroft 提交于
      Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
      mem_map[] is needed by discontiguous memory machines (like in the old
      CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
      replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
      become a complete replacement.
      
      A significant advantage over DISCONTIGMEM is that it's completely separated
      from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
      and DISCONTIG are often confused.
      
      Another advantage is that sparse doesn't require each NUMA node's ranges to be
      contiguous.  It can handle overlapping ranges between nodes with no problems,
      where DISCONTIGMEM currently throws away that memory.
      
      Sparsemem uses an array to provide different pfn_to_page() translations for
      each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
      to be chopped up.
      
      In order to do quick pfn_to_page() operations, the section number of the page
      is encoded in page->flags.  Part of the sparsemem infrastructure enables
      sharing of these bits more dynamically (at compile-time) between the
      page_zone() and sparsemem operations.  However, on 32-bit architectures, the
      number of bits is quite limited, and may require growing the size of the
      page->flags type in certain conditions.  Several things might force this to
      occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
      memory), an increase in the physical address space, or an increase in the
      number of used page->flags.
      
      One thing to note is that, once sparsemem is present, the NUMA node
      information no longer needs to be stored in the page->flags.  It might provide
      speed increases on certain platforms and will be stored there if there is
      room.  But, if out of room, an alternate (theoretically slower) mechanism is
      used.
      
      This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
      there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
      often have to compile out the same areas of code.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NMartin Bligh <mbligh@aracnet.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d41dee36
  12. 22 6月, 2005 3 次提交
    • H
      [PATCH] can_share_swap_page: use page_mapcount · c475a8ab
      Hugh Dickins 提交于
      Remember that ironic get_user_pages race?  when the raised page_count on a
      page swapped out led do_wp_page to decide that it had to copy on write, so
      substituted a different page into userspace.  2.6.7 onwards have Andrea's
      solution, where try_to_unmap_one backs out if it finds page_count raised.
      
      Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
      and was found a few months ago to hang an intensive page migration test.  A
      year ago I was hesitant to engage page_mapcount, now it seems the right fix.
      
      So remove the page_count hack from try_to_unmap_one; and use activate_page in
      unuse_mm when dropping lock, to replace its secondary effect of helping
      swapoff to make progress in that case.
      
      Simplify can_share_swap_page (now called only on anonymous pages) to check
      page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
      their (pessimistic) sum, but does not need swapper_space.tree_lock for that.
      
      In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
      keep sum on the high side, and correct when can_share_swap_page called.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c475a8ab
    • H
      [PATCH] do_wp_page: cannot share file page · d296e9cd
      Hugh Dickins 提交于
      A small optimization to do_wp_page's check for whether to avoid copy by
      reusing the page already mapped.  It can never share a cached file page,
      nor can it share a reserved page (often the empty zero page), so it's a
      waste of time to lock and unlock in those cases.  Which nowadays can both
      be neatly excluded by a preliminary PageAnon test.
      
      Christoph has reported that a preliminary page_count test proved valuable
      for scalability here, but PageAnon covers more common cases all at once.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d296e9cd
    • H
      [PATCH] get_user_pages: kill get_page_map · 08ef4729
      Hugh Dickins 提交于
      Since its birth, get_user_pages has been calling a misguided get_page_map
      function.  follow_page has already returned NULL if the pfn is invalid, we
      cannot reach an invalid pfn from a validated struct page.
      
      Remove get_page_map, and the messy rewind in get_user_pages to cope with
      its failure.  Oh, and could we please call that "struct page *page" like
      everywhere else, instead of "struct page *map"?
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      08ef4729
  13. 17 5月, 2005 1 次提交
  14. 20 4月, 2005 3 次提交
    • H
      [PATCH] freepgt: hugetlb_free_pgd_range · 3bf5ee95
      Hugh Dickins 提交于
      ia64 and ppc64 had hugetlb_free_pgtables functions which were no longer being
      called, and it wasn't obvious what to do about them.
      
      The ppc64 case turns out to be easy: the associated tables are noted elsewhere
      and freed later, safe to either skip its hugetlb areas or go through the
      motions of freeing nothing.  Since ia64 does need a special case, restore to
      ppc64 the special case of skipping them.
      
      The ia64 hugetlb case has been broken since pgd_addr_end went in, though it
      probably appeared to work okay if you just had one such area; in fact it's
      been broken much longer if you consider a long munmap spanning from another
      region into the hugetlb region.
      
      In the ia64 hugetlb region, more virtual address bits are available than in
      the other regions, yet the page tables are structured the same way: the page
      at the bottom is larger.  Here we need to scale down each addr before passing
      it to the standard free_pgd_range.  Was about to write a hugely_scaled_down
      macro, but found htlbpage_to_page already exists for just this purpose.  Fixed
      off-by-one in ia64 is_hugepage_only_range.
      
      Uninline free_pgd_range to make it available to ia64.  Make sure the
      vma-gathering loop in free_pgtables cannot join a hugepage_only_range to any
      other (safe to join huges?  probably but don't bother).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3bf5ee95
    • H
      [PATCH] freepgt: remove MM_VM_SIZE(mm) · ee39b37b
      Hugh Dickins 提交于
      There's only one usage of MM_VM_SIZE(mm) left, and it's a troublesome macro
      because mm doesn't contain the (32-bit emulation?) info needed.  But it too is
      only needed because we ignore the end from the vma list.
      
      We could make flush_pgtables return that end, or unmap_vmas.  Choose the
      latter, since it's a natural fit with unmap_mapping_range_vma needing to know
      its restart addr.  This does make more than minimal change, but if unmap_vmas
      had returned the end before, this is how we'd have done it, rather than
      storing the break_addr in zap_details.
      
      unmap_vmas used to return count of vmas scanned, but that's just debug which
      hasn't been useful in a while; and if we want the map_count 0 on exit check
      back, it can easily come from the final remove_vm_struct loop.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ee39b37b
    • H
      [PATCH] freepgt: free_pgtables use vma list · e0da382c
      Hugh Dickins 提交于
      Recent woes with some arches needing their own pgd_addr_end macro; and 4-level
      clear_page_range regression since 2.6.10's clear_page_tables; and its
      long-standing well-known inefficiency in searching throughout the higher-level
      page tables for those few entries to clear and free: all can be blamed on
      ignoring the list of vmas when we free page tables.
      
      Replace exit_mmap's clear_page_range of the total user address space by
      free_pgtables operating on the mm's vma list; unmap_region use it in the same
      way, giving floor and ceiling beyond which it may not free tables.  This
      brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled,
      in which case latency fixes spoil unmap_vmas throughput).
      
      Beware: the do_mmap_pgoff driver failure case must now use unmap_region
      instead of zap_page_range, since a page table might have been allocated, and
      can only be freed while it is touched by some vma.
      
      Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted
      from the clear_page_range levels.  (Most of free_pgtables' old code was
      actually for a non-existent case, prev not properly set up, dating from before
      hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we
      might want to add latency lockdrops later; but no attempt to do so yet, going
      by vma should itself reduce latency.
      
      But what if is_hugepage_only_range?  Those ia64 and ppc64 cases need careful
      examination: put that off until a later patch of the series.
      
      What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma?
      
      And the range to sparc64's flush_tlb_pgtables?  It's less clear to me now that
      we need to do more than is done here - every PMD_SIZE ever occupied will be
      flushed, do we really have to flush every PGDIR_SIZE ever partially occupied? 
      A shame to complicate it unnecessarily.
      
      Special thanks to David Miller for time spent repairing my ceilings.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e0da382c