1. 19 10月, 2009 1 次提交
  2. 24 9月, 2009 1 次提交
  3. 22 9月, 2009 15 次提交
    • H
      mm: move highest_memmap_pfn · 03f6462a
      Hugh Dickins 提交于
      Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
      __read_mostly in memory.c: to help them share a cacheline, since they're
      very often tested together in vm_normal_page().
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03f6462a
    • H
      mm: ZERO_PAGE without PTE_SPECIAL · 62eede62
      Hugh Dickins 提交于
      Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
      those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
      
      Contrary to how I'd imagined it, there's nothing ugly about this, just a
      zero_pfn test built into one or another block of vm_normal_page().
      
      But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
      my_zero_pfn() inlines.  Reinstate its mremap move_pte() shuffling of
      ZERO_PAGEs we did from 2.6.17 to 2.6.19?  Not unless someone shouts for
      that: it would have to take vm_flags to weed out some cases.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62eede62
    • H
      mm: FOLL flags for GUP flags · 58fa879e
      Hugh Dickins 提交于
      __get_user_pages() has been taking its own GUP flags, then processing
      them into FOLL flags for follow_page().  Though oddly named, the FOLL
      flags are more widely used, so pass them to __get_user_pages() now.
      Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.
      
      (The patch to __get_user_pages() looks peculiar, with both gup_flags
      and foll_flags: the gup_flags remain constant; but as before there's
      an exceptional case, out of scope of the patch, in which foll_flags
      per page have FOLL_WRITE masked off.)
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58fa879e
    • H
      mm: reinstate ZERO_PAGE · a13ea5b7
      Hugh Dickins 提交于
      KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
      advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
      using in 2.6.24.  And there were a couple of regression reports on LKML.
      
      Following suggestions from Linus, reinstate do_anonymous_page() use of
      the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
      with (map)count updates - let vm_normal_page() regard it as abnormal.
      
      Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
      most powerpc): that's not essential, but minimizes additional branches
      (keeping them in the unlikely pte_special case); and incidentally
      excludes mips (some models of which needed eight colours of ZERO_PAGE
      to avoid costly exceptions).
      
      Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
      callers won't want to make exceptions for it, so increment its count
      there.  Changes to mlock and migration? happily seems not needed.
      
      In most places it's quicker to check pfn than struct page address:
      prepare a __read_mostly zero_pfn for that.  Does get_dump_page()
      still need its ZERO_PAGE check? probably not, but keep it anyway.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a13ea5b7
    • H
      mm: fix anonymous dirtying · 1ac0cb5d
      Hugh Dickins 提交于
      do_anonymous_page() has been wrong to dirty the pte regardless.
      If it's not going to mark the pte writable, then it won't help
      to mark it dirty here, and clogs up memory with pages which will
      need swap instead of being thrown away.  Especially wrong if no
      overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
      we could exceed the limit and OOM despite no overcommit.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: <stable@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ac0cb5d
    • H
      mm: follow_hugetlb_page flags · 2a15efc9
      Hugh Dickins 提交于
      follow_hugetlb_page() shouldn't be guessing about the coredump case
      either: pass the foll_flags down to it, instead of just the write bit.
      
      Remove that obscure huge_zeropage_ok() test.  The decision is easy,
      though unlike the non-huge case - here vm_ops->fault is always set.
      But we know that a fault would serve up zeroes, unless there's
      already a hugetlbfs pagecache page to back the range.
      
      (Alternatively, since hugetlb pages aren't swapped out under pressure,
      you could save more dump space by arguing that a page not yet faulted
      into this process cannot be relevant to the dump; but that would be
      more surprising.)
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a15efc9
    • H
      mm: FOLL_DUMP replace FOLL_ANON · 8e4b9a60
      Hugh Dickins 提交于
      The "FOLL_ANON optimization" and its use_zero_page() test have caused
      confusion and bugs: why does it test VM_SHARED? for the very good but
      unsatisfying reason that VMware crashed without.  As we look to maybe
      reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
      
      Easily done: it's silly for __get_user_pages() and follow_page() to
      be guessing whether it's safe to assume that they're being used for
      a coredump (which can take a shortcut snapshot where other uses must
      handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
      
      get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e4b9a60
    • H
      mm: add get_dump_page · f3e8fccd
      Hugh Dickins 提交于
      In preparation for the next patch, add a simple get_dump_page(addr)
      interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
      get_user_pages() directly.  They're not interested in errors: they
      just want to use holes as much as possible, to save space and make
      sure that the data is aligned where the headers said it would be.
      
      Oh, and don't use that horrid DUMP_SEEK(off) macro!
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3e8fccd
    • H
      mm: remove unused GUP flags · 1c3aff1c
      Hugh Dickins 提交于
      GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
      flags added solely to prevent __get_user_pages() from doing some of
      what it usually does, in the munlock case: we can now remove them.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c3aff1c
    • J
      mm: drop unneeded double negations · b7c46d15
      Johannes Weiner 提交于
      Remove double negations where the operand is already boolean.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7c46d15
    • A
      ksm: fix deadlock with munlock in exit_mmap · 1c2fb7a4
      Andrea Arcangeli 提交于
      Rawhide users have reported hang at startup when cryptsetup is run: the
      same problem can be simply reproduced by running a program int main() {
      mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }
      
      The problem is that exit_mmap() applies munlock_vma_pages_all() to
      clean up VM_LOCKED areas, and its current implementation (stupidly)
      tries to fault in absent pages, for example where PROT_NONE prevented
      them being faulted in when mlocking.  Whereas the "ksm: fix oom
      deadlock" patch, knowing there's a race by which KSM might try to fault
      in pages after exit_mmap() had finally zapped the range, backs out of
      such faults doing nothing when its ksm_test_exit() notices mm_users 0.
      
      So revert that part of "ksm: fix oom deadlock" which moved the
      ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
      and remove those ksm_test_exit() checks from the page fault paths, so
      allowing the munlocking to proceed without interference.
      
      ksm_exit, if there are rmap_items still chained on this mm slot, takes
      mmap_sem write side: so preventing KSM from working on an mm while
      exit_mmap runs.  And KSM will bail out as soon as it notices that
      mm_users is already zero, thanks to its internal ksm_test_exit checks.
      So that when a task is killed by OOM killer or the user, KSM will not
      indefinitely prevent it from running exit_mmap to release its memory.
      
      This does break a part of what "ksm: fix oom deadlock" was trying to
      achieve.  When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
      when ksmd itself has to cancel a KSM page, it is possible that the
      first OOM-kill victim would be the KSM process being faulted: then its
      memory won't be freed until a second victim has been selected (freeing
      memory for the unmerging fault to complete).
      
      But the OOM killer is already liable to kill a second victim once the
      intended victim's p->mm goes to NULL: so there's not much point in
      rejecting this KSM patch before fixing that OOM behaviour.  It is very
      much more important to allow KSM users to boot up, than to haggle over
      an unlikely and poorly supported OOM case.
      
      We also intend to fix munlocking to not fault pages: at which point
      this patch _could_ be reverted; though that would be controversial, so
      we hope to find a better solution.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJustin M. Forbes <jforbes@redhat.com>
      Acked-for-now-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c2fb7a4
    • H
      ksm: fix oom deadlock · 9ba69294
      Hugh Dickins 提交于
      There's a now-obvious deadlock in KSM's out-of-memory handling:
      imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
      trying to allocate a page to break KSM in an mm which becomes the
      OOM victim (quite likely in the unmerge case): it's killed and goes
      to exit, and hangs there waiting to acquire ksm_thread_mutex.
      
      Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
      though that made everything else: perhaps use mmap_sem somehow?
      And part of the answer lies in the comments on unmerge_ksm_pages:
      __ksm_exit should also leave all the rmap_item removal to ksmd.
      
      But there's a fundamental problem, that KSM relies upon mmap_sem to
      guarantee the consistency of the mm it's dealing with, yet exit_mmap
      tears down an mm without taking mmap_sem.  And bumping mm_users won't
      help at all, that just ensures that the pages the OOM killer assumes
      are on their way to being freed will not be freed.
      
      The best answer seems to be, to move the ksm_exit callout from just
      before exit_mmap, to the middle of exit_mmap: after the mm's pages
      have been freed (if the mmu_gather is flushed), but before its page
      tables and vma structures have been freed; and down_write,up_write
      mmap_sem there to serialize with KSM's own reliance on mmap_sem.
      
      But KSM then needs to be careful, whenever it downs mmap_sem, to
      check that the mm is not already exiting: there's a danger of using
      find_vma on a layout that's being torn apart, or writing into page
      tables which have been freed for reuse; and even do_anonymous_page
      and __do_fault need to check they're not being called by break_ksm
      to reinstate a pte after zap_pte_range has zapped that page table.
      
      Though it might be clearer to add an exiting flag, set while holding
      mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
      a zapped pte.  All we need is to check whether mm_users is 0 - but
      must remember that ksmd may detect that before __ksm_exit is reached.
      So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
      
      __ksm_exit now has to leave clearing up the rmap_items to ksmd,
      that needs ksm_thread_mutex; but shift the exiting mm just after the
      ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
      mm_count to hold the mm_struct, ksmd's exit processing (exactly like
      its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
      similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
      
      But also give __ksm_exit a fast path: when there's no complication
      (no rmap_items attached to mm and it's not at the ksm_scan cursor),
      it can safely do all the exiting work itself.  This is not just an
      optimization: when ksmd is not running, the raised mm_count would
      otherwise leak mm_structs.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ba69294
    • H
      ksm: identify PageKsm pages · 9a840895
      Hugh Dickins 提交于
      KSM will need to identify its kernel merged pages unambiguously, and
      /proc/kpageflags will probably like to do so too.
      
      Since KSM will only be substituting anonymous pages, statistics are best
      preserved by making a PageKsm page a special PageAnon page: one with no
      anon_vma.
      
      But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near
      PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a840895
    • H
      ksm: no debug in page_dup_rmap() · 21333b2b
      Hugh Dickins 提交于
      page_dup_rmap(), used on each mapped page when forking, was originally
      just an inline atomic_inc of mapcount.  2.6.22 added CONFIG_DEBUG_VM
      out-of-line checks to it, which would need to be ever-so-slightly
      complicated to allow for the PageKsm() we're about to define.
      
      But I think these checks never caught anything.  And if it's coding errors
      we're worried about, such checks should be in page_remove_rmap() too, not
      just when forking; whereas if it's pagetable corruption we're worried
      about, then they shouldn't be limited to CONFIG_DEBUG_VM.
      
      Oh, just revert page_dup_rmap() to an inline atomic_inc of mapcount.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21333b2b
    • I
      ksm: add mmu_notifier set_pte_at_notify() · 828502d3
      Izik Eidus 提交于
      KSM is a linux driver that allows dynamicly sharing identical memory pages
      between one or more processes.
      
      Unlike tradtional page sharing that is made at the allocation of the
      memory, ksm do it dynamicly after the memory was created.  Memory is
      periodically scanned; identical pages are identified and merged.
      
      The sharing is made in a transparent way to the processes that use it.
      
      Ksm is highly important for hypervisors (kvm), where in production
      enviorments there might be many copys of the same data data among the host
      memory.  This kind of data can be: similar kernels, librarys, cache, and
      so on.
      
      Even that ksm was wrote for kvm, any userspace application that want to
      use it to share its data can try it.
      
      Ksm may be useful for any application that might have similar (page
      aligment) data strctures among the memory, ksm will find this data merge
      it to one copy, and even if it will be changed and thereforew copy on
      writed, ksm will merge it again as soon as it will be identical again.
      
      Another reason to consider using ksm is the fact that it might simplify
      alot the userspace code of application that want to use shared private
      data, instead that the application will mange shared area, ksm will do
      this for the application, and even write to this data will be allowed
      without any synchinization acts from the application.
      
      Ksm was designed to be a loadable module that doesn't change the VM code
      of linux.
      
      This patch:
      
      The set_pte_at_notify() macro allows setting a pte in the shadow page
      table directly, instead of flushing the shadow page table entry and then
      getting vmexit to set it.  It uses a new change_pte() callback to do so.
      
      set_pte_at_notify() is an optimization for kvm, and other users of
      mmu_notifiers, for COW pages.  It is useful for kvm when ksm is used,
      because it allows kvm not to have to receive vmexit and only then map the
      ksm page into the shadow page table, but instead map it directly at the
      same time as Linux maps the page into the host page table.
      
      Users of mmu_notifiers who don't implement new mmu_notifier_change_pte()
      callback will just receive the mmu_notifier_invalidate_page() callback.
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      828502d3
  4. 19 9月, 2009 1 次提交
  5. 16 9月, 2009 2 次提交
    • A
      HWPOISON: Add poison check to page fault handling · a3b947ea
      Andi Kleen 提交于
      Bail out early when hardware poisoned pages are found in page fault handling.
      Since they are poisoned they should not be mapped freshly into processes,
      because that would cause another (potentially deadly) machine check
      
      This is generally handled in the same way as OOM, just a different
      error code is returned to the architecture code.
      
      v2: Do a page unlock if needed (Fengguang Wu)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      a3b947ea
    • A
      HWPOISON: Add basic support for poisoned pages in fault handler v3 · d1737fdb
      Andi Kleen 提交于
      - Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
      architectures have to explicitely enable poison page support, so
      this is forward compatible to all architectures. They only need
      to add it when they enable poison page support.
      - Add poison page handling in swap in fault code
      
      v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
      v3: Really use delayacct_clear_flag (Hidehiro Kawai)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      d1737fdb
  6. 28 7月, 2009 1 次提交
    • B
      mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() · 9e1b32ca
      Benjamin Herrenschmidt 提交于
      mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()
      
      Upcoming paches to support the new 64-bit "BookE" powerpc architecture
      will need to have the virtual address corresponding to PTE page when
      freeing it, due to the way the HW table walker works.
      
      Basically, the TLB can be loaded with "large" pages that cover the whole
      virtual space (well, sort-of, half of it actually) represented by a PTE
      page, and which contain an "indirect" bit indicating that this TLB entry
      RPN points to an array of PTEs from which the TLB can then create direct
      entries. Thus, in order to invalidate those when PTE pages are deleted,
      we need the virtual address to pass to tlbilx or tlbivax instructions.
      
      The old trick of sticking it somewhere in the PTE page struct page sucks
      too much, the address is almost readily available in all call sites and
      almost everybody implemets these as macros, so we may as well add the
      argument everywhere. I added it to the pmd and pud variants for consistency.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: David Howells <dhowells@redhat.com> [MN10300 & FRV]
      Acked-by: NNick Piggin <npiggin@suse.de>
      Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e1b32ca
  7. 26 6月, 2009 1 次提交
  8. 24 6月, 2009 2 次提交
  9. 22 6月, 2009 2 次提交
    • L
      Move FAULT_FLAG_xyz into handle_mm_fault() callers · d06063cc
      Linus Torvalds 提交于
      This allows the callers to now pass down the full set of FAULT_FLAG_xyz
      flags to handle_mm_fault().  All callers have been (mechanically)
      converted to the new calling convention, there's almost certainly room
      for architectures to clean up their code and then add FAULT_FLAG_RETRY
      when that support is added.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d06063cc
    • L
      Remove internal use of 'write_access' in mm/memory.c · 30c9f3a9
      Linus Torvalds 提交于
      The fault handling routines really want more fine-grained flags than a
      single "was it a write fault" boolean - the callers will want to set
      flags like "you can return a retry error" etc.
      
      And that's actually how the VM works internally, but right now the
      top-level fault handling functions in mm/memory.c all pass just the
      'write_access' boolean around.
      
      This switches them over to pass around the FAULT_FLAG_xyzzy 'flags'
      variable instead.  The 'write_access' calling convention still exists
      for the exported 'handle_mm_fault()' function, but that is next.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30c9f3a9
  10. 17 6月, 2009 4 次提交
  11. 03 5月, 2009 2 次提交
    • N
      mm: close page_mkwrite races · b827e496
      Nick Piggin 提交于
      Change page_mkwrite to allow implementations to return with the page
      locked, and also change it's callers (in page fault paths) to hold the
      lock until the page is marked dirty.  This allows the filesystem to have
      full control of page dirtying events coming from the VM.
      
      Rather than simply hold the page locked over the page_mkwrite call, we
      call page_mkwrite with the page unlocked and allow callers to return with
      it locked, so filesystems can avoid LOR conditions with page lock.
      
      The problem with the current scheme is this: a filesystem that wants to
      associate some metadata with a page as long as the page is dirty, will
      perform this manipulation in its ->page_mkwrite.  It currently then must
      return with the page unlocked and may not hold any other locks (according
      to existing page_mkwrite convention).
      
      In this window, the VM could write out the page, clearing page-dirty.  The
      filesystem has no good way to detect that a dirty pte is about to be
      attached, so it will happily write out the page, at which point, the
      filesystem may manipulate the metadata to reflect that the page is no
      longer dirty.
      
      It is not always possible to perform the required metadata manipulation in
      ->set_page_dirty, because that function cannot block or fail.  The
      filesystem may need to allocate some data structure, for example.
      
      And the VM cannot mark the pte dirty before page_mkwrite, because
      page_mkwrite is allowed to fail, so we must not allow any window where the
      page could be written to if page_mkwrite does fail.
      
      This solution of holding the page locked over the 3 critical operations
      (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
      closes out races nicely, preventing page cleaning for writeout being
      initiated in that window.  This provides the filesystem with a strong
      synchronisation against the VM here.
      
      - Sage needs this race closed for ceph filesystem.
      - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
      - I need it for fsblock.
      - I suspect other filesystems may need it too (eg. btrfs).
      - I have converted buffer.c to the new locking. Even simple block allocation
        under dirty pages might be susceptible to i_size changing under partial page
        at the end of file (we also have a buffer.c-side problem here, but it cannot
        be fixed properly without this patch).
      - Other filesystems (eg. NFS, maybe btrfs) will need to change their
        page_mkwrite functions themselves.
      
      [ This also moves page_mkwrite another step closer to fault, which should
        eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
        filesystem calldown and page lock/unlock cycle in __do_fault. ]
      
      [akpm@linux-foundation.org: fix derefs of NULL ->mapping]
      Cc: Sage Weil <sage@newdream.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b827e496
    • J
      mm: fix pageref leak in do_swap_page() · bc43f75c
      Johannes Weiner 提交于
      By the time the memory cgroup code is notified about a swapin we
      already hold a reference on the fault page.
      
      If the cgroup callback fails make sure to unlock AND release the page
      reference which was taken by lookup_swap_cach(), or we leak the reference.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc43f75c
  12. 01 4月, 2009 3 次提交
  13. 31 3月, 2009 1 次提交
  14. 30 3月, 2009 2 次提交
  15. 20 3月, 2009 1 次提交
    • I
      tracing, Text Edit Lock - kprobes architecture independent support, nommu fix · 505f2b97
      Ingo Molnar 提交于
      Impact: build fix on SH !CONFIG_MMU
      
      Stephen Rothwell reported this linux-next build failure on the SH
      architecture:
      
        kernel/built-in.o: In function `disable_all_kprobes':
        kernel/kprobes.c:1382: undefined reference to `text_mutex'
        [...]
      
      And observed:
      
      | Introduced by commit 4460fdad ("tracing,
      | Text Edit Lock - kprobes architecture independent support") from the
      | tracing tree.  text_mutex is defined in mm/memory.c which is only built
      | if CONFIG_MMU is defined, which is not true for sh allmodconfig.
      
      Move this lock to kernel/extable.c (which is already home to various
      kernel text related routines), which file is always built-in.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      LKML-Reference: <20090320110602.86351a91.sfr@canb.auug.org.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      505f2b97
  16. 14 3月, 2009 1 次提交