1. 06 1月, 2009 1 次提交
    • A
      inode->i_op is never NULL · acfa4380
      Al Viro 提交于
      We used to have rather schizophrenic set of checks for NULL ->i_op even
      though it had been eliminated years ago.  You'd need to go out of your
      way to set it to NULL explicitly _and_ a bunch of code would die on
      such inodes anyway.  After killing two remaining places that still
      did that bogosity, all that crap can go away.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      acfa4380
  2. 20 12月, 2008 3 次提交
  3. 19 12月, 2008 3 次提交
  4. 21 10月, 2008 1 次提交
  5. 20 10月, 2008 8 次提交
    • K
      memcg: avoid accounting special pages · 5b4e655e
      KAMEZAWA Hiroyuki 提交于
      There are not-on-LRU pages which can be mapped and they are not worth to
      be accounted.  (becasue we can't shrink them and need dirty codes to
      handle specical case) We'd like to make use of usual objrmap/radix-tree's
      protcol and don't want to account out-of-vm's control pages.
      
      When special_mapping_fault() is called, page->mapping is tend to be NULL
      and it's charged as Anonymous page.  insert_page() also handles some
      special pages from drivers.
      
      This patch is for avoiding to account special pages.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b4e655e
    • K
      memcg: move charge swapin under lock · 073e587e
      KAMEZAWA Hiroyuki 提交于
      While page-cache's charge/uncharge is done under page_lock(), swap-cache
      isn't.  (anonymous page is charged when it's newly allocated.)
      
      This patch moves do_swap_page()'s charge() call under lock.  I don't see
      any bad problem *now* but this fix will be good for future for avoiding
      unnecessary racy state.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      073e587e
    • L
      mlock: make mlock error return Posixly Correct · 9978ad58
      Lee Schermerhorn 提交于
      Rework Posix error return for mlock().
      
      Posix requires error code for mlock*() system calls for some conditions
      that differ from what kernel low level functions, such as
      get_user_pages(), return for those conditions.  For more info, see:
      
      http://marc.info/?l=linux-kernel&m=121750892930775&w=2
      
      This patch provides the same translation of get_user_pages()
      error codes to posix specified error codes in the context
      of the mlock rework for unevictable lru.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9978ad58
    • L
      mlock: revert mainline handling of mlock error return · c11d69d8
      Lee Schermerhorn 提交于
      This change is intended to make mlock() error returns correct.
      make_page_present() is a lower level function used by more than mlock().
      Subsequent patch[es] will add this error return fixup in an mlock specific
      path.
      
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c11d69d8
    • L
      swap: cull unevictable pages in fault path · 64d6519d
      Lee Schermerhorn 提交于
      In the fault paths that install new anonymous pages, check whether the
      page is evictable or not using lru_cache_add_active_or_unevictable().  If
      the page is evictable, just add it to the active lru list [via the pagevec
      cache], else add it to the unevictable list.
      
      This "proactive" culling in the fault path mimics the handling of mlocked
      pages in Nick Piggin's series to keep mlocked pages off the lru lists.
      
      Notes:
      
      1) This patch is optional--e.g., if one is concerned about the
         additional test in the fault path.  We can defer the moving of
         nonreclaimable pages until when vmscan [shrink_*_list()]
         encounters them.  Vmscan will only need to handle such pages
         once, but if there are a lot of them it could impact system
         performance.
      
      2) The 'vma' argument to page_evictable() is require to notice that
         we're faulting a page into an mlock()ed vma w/o having to scan the
         page's rmap in the fault path.   Culling mlock()ed anon pages is
         currently the only reason for this patch.
      
      3) We can't cull swap pages in read_swap_cache_async() because the
         vma argument doesn't necessarily correspond to the swap cache
         offset passed in by swapin_readahead().  This could [did!] result
         in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
         cull in this path.
      
      4) Move set_pte_at() to after where we add page to lru to keep it
         hidden from other tasks that might walk the page table.
         We already do it in this order in do_anonymous() page.  And,
         these are COW'd anon pages.  Is this safe?
      
      [riel@redhat.com: undo an overzealous code cleanup]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64d6519d
    • N
      mlock: mlocked pages are unevictable · b291f000
      Nick Piggin 提交于
      Make sure that mlocked pages also live on the unevictable LRU, so kswapd
      will not scan them over and over again.
      
      This is achieved through various strategies:
      
      1) add yet another page flag--PG_mlocked--to indicate that
         the page is locked for efficient testing in vmscan and,
         optionally, fault path.  This allows early culling of
         unevictable pages, preventing them from getting to
         page_referenced()/try_to_unmap().  Also allows separate
         accounting of mlock'd pages, as Nick's original patch
         did.
      
         Note:  Nick's original mlock patch used a PG_mlocked
         flag.  I had removed this in favor of the PG_unevictable
         flag + an mlock_count [new page struct member].  I
         restored the PG_mlocked flag to eliminate the new
         count field.
      
      2) add the mlock/unevictable infrastructure to mm/mlock.c,
         with internal APIs in mm/internal.h.  This is a rework
         of Nick's original patch to these files, taking into
         account that mlocked pages are now kept on unevictable
         LRU list.
      
      3) update vmscan.c:page_evictable() to check PageMlocked()
         and, if vma passed in, the vm_flags.  Note that the vma
         will only be passed in for new pages in the fault path;
         and then only if the "cull unevictable pages in fault
         path" patch is included.
      
      4) add try_to_unlock() to rmap.c to walk a page's rmap and
         ClearPageMlocked() if no other vmas have it mlocked.
         Reuses as much of try_to_unmap() as possible.  This
         effectively replaces the use of one of the lru list links
         as an mlock count.  If this mechanism let's pages in mlocked
         vmas leak through w/o PG_mlocked set [I don't know that it
         does], we should catch them later in try_to_unmap().  One
         hopes this will be rare, as it will be relatively expensive.
      
      Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      splitlru: introduce __get_user_pages():
      
        New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
        because current get_user_pages() can't grab PROT_NONE pages theresore it
        cause PROT_NONE pages can't munlock.
      
      [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
      [akpm@linux-foundation.org: untangle patch interdependencies]
      [akpm@linux-foundation.org: fix things after out-of-order merging]
      [hugh@veritas.com: fix page-flags mess]
      [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
      [kosaki.motohiro@jp.fujitsu.com: build fix]
      [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
      [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b291f000
    • R
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel 提交于
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
    • R
      define page_file_cache() function · b2e18538
      Rik van Riel 提交于
      Define page_file_cache() function to answer the question:
      	is page backed by a file?
      
      Originally part of Rik van Riel's split-lru patch.  Extracted to make
      available for other, independent reclaim patches.
      
      Moved inline function to linux/mm_inline.h where it will be needed by
      subsequent "split LRU" and "noreclaim" patches.
      
      Unfortunately this needs to use a page flag, since the PG_swapbacked state
      needs to be preserved all the way to the point where the page is last
      removed from the LRU.  Trying to derive the status from other info in the
      page resulted in wrong VM statistics in earlier split VM patchsets.
      
      The total number of page flags in use on a 32 bit machine after this patch
      is 19.
      
      [akpm@linux-foundation.org: fix up out-of-order merge fallout]
      [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMinChan Kim <minchan.kim@gmail.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2e18538
  6. 11 9月, 2008 1 次提交
  7. 05 8月, 2008 2 次提交
    • N
      mm: rename page trylock · 529ae9aa
      Nick Piggin 提交于
      Converting page lock to new locking bitops requires a change of page flag
      operation naming, so we might as well convert it to something nicer
      (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).
      
      This also facilitates lockdeping of page lock.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      529ae9aa
    • K
      mlock() fix return values · a477097d
      KOSAKI Motohiro 提交于
      Halesh says:
      
      Please find the below testcase provide to test mlock.
      
      Test Case :
      ===========================
      
      #include <sys/resource.h>
      #include <stdio.h>
      #include <sys/stat.h>
      #include <sys/types.h>
      #include <unistd.h>
      #include <sys/mman.h>
      #include <fcntl.h>
      #include <errno.h>
      #include <stdlib.h>
      
      int main(void)
      {
        int fd,ret, i = 0;
        char *addr, *addr1 = NULL;
        unsigned int page_size;
        struct rlimit rlim;
      
        if (0 != geteuid())
        {
         printf("Execute this pgm as root\n");
         exit(1);
        }
      
        /* create a file */
        if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
        {
         printf("cant create test file\n");
         exit(1);
        }
      
        page_size = sysconf(_SC_PAGE_SIZE);
      
        /* set the MEMLOCK limit */
        rlim.rlim_cur = 2000;
        rlim.rlim_max = 2000;
      
        if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
        {
         printf("Cant change limit values\n");
         exit(1);
        }
      
        addr = 0;
        while (1)
        {
        /* map a page into memory each time*/
        if ((addr = (char *) mmap(addr,page_size, PROT_READ |
      PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
        {
         printf("cant do mmap on file\n");
         exit(1);
        }
      
        if (0 == i)
          addr1 = addr;
        i++;
        errno = 0;
        /* lock the mapped memory pagewise*/
        if ((ret = mlock((char *)addr, 1500)) == -1)
        {
         printf("errno value is %d\n", errno);
         printf("cant lock maped region\n");
         exit(1);
        }
        addr = addr + page_size;
       }
      }
      ======================================================
      
      This testcase results in an mlock() failure with errno 14 that is EFAULT,
      but it has nowhere been specified that mlock() will return EFAULT.  When I
      tested the same on older kernels like 2.6.18, I got the correct result i.e
      errno 12 (ENOMEM).
      
      I think in source code mlock(2), setting errno ENOMEM has been missed in
      do_mlock() , on mlock_fixup() failure.
      
      SUSv3 requires the following behavior frmo mlock(2).
      
      [ENOMEM]
          Some or all of the address range specified by the addr and
          len arguments does not correspond to valid mapped pages
          in the address space of the process.
      
      [EAGAIN]
          Some or all of the memory identified by the operation could not
          be locked when the call was made.
      
      This rule isn't so nice and slighly strange.  but many people think
      POSIX/SUS compliance is important.
      Reported-by: NHalesh Sadashiv <halesh.sadashiv@ap.sony.com>
      Tested-by: NHalesh Sadashiv <halesh.sadashiv@ap.sony.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a477097d
  8. 02 8月, 2008 1 次提交
  9. 31 7月, 2008 2 次提交
  10. 29 7月, 2008 1 次提交
    • A
      mmu-notifiers: core · cddb8a5c
      Andrea Arcangeli 提交于
      With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
       There are secondary MMUs (with secondary sptes and secondary tlbs) too.
      sptes in the kvm case are shadow pagetables, but when I say spte in
      mmu-notifier context, I mean "secondary pte".  In GRU case there's no
      actual secondary pte and there's only a secondary tlb because the GRU
      secondary MMU has no knowledge about sptes and every secondary tlb miss
      event in the MMU always generates a page fault that has to be resolved by
      the CPU (this is not the case of KVM where the a secondary tlb miss will
      walk sptes in hardware and it will refill the secondary tlb transparently
      to software if the corresponding spte is present).  The same way
      zap_page_range has to invalidate the pte before freeing the page, the spte
      (and secondary tlb) must also be invalidated before any page is freed and
      reused.
      
      Currently we take a page_count pin on every page mapped by sptes, but that
      means the pages can't be swapped whenever they're mapped by any spte
      because they're part of the guest working set.  Furthermore a spte unmap
      event can immediately lead to a page to be freed when the pin is released
      (so requiring the same complex and relatively slow tlb_gather smp safe
      logic we have in zap_page_range and that can be avoided completely if the
      spte unmap event doesn't require an unpin of the page previously mapped in
      the secondary MMU).
      
      The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
      when the VM is swapping or freeing or doing anything on the primary MMU so
      that the secondary MMU code can drop sptes before the pages are freed,
      avoiding all page pinning and allowing 100% reliable swapping of guest
      physical address space.  Furthermore it avoids the code that teardown the
      mappings of the secondary MMU, to implement a logic like tlb_gather in
      zap_page_range that would require many IPI to flush other cpu tlbs, for
      each fixed number of spte unmapped.
      
      To make an example: if what happens on the primary MMU is a protection
      downgrade (from writeable to wrprotect) the secondary MMU mappings will be
      invalidated, and the next secondary-mmu-page-fault will call
      get_user_pages and trigger a do_wp_page through get_user_pages if it
      called get_user_pages with write=1, and it'll re-establishing an updated
      spte or secondary-tlb-mapping on the copied page.  Or it will setup a
      readonly spte or readonly tlb mapping if it's a guest-read, if it calls
      get_user_pages with write=0.  This is just an example.
      
      This allows to map any page pointed by any pte (and in turn visible in the
      primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
      full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
      with kvm), or a remote DMA in software like XPMEM (hence needing of
      schedule in XPMEM code to send the invalidate to the remote node, while no
      need to schedule in kvm/gru as it's an immediate event like invalidating
      primary-mmu pte).
      
      At least for KVM without this patch it's impossible to swap guests
      reliably.  And having this feature and removing the page pin allows
      several other optimizations that simplify life considerably.
      
      Dependencies:
      
      1) mm_take_all_locks() to register the mmu notifier when the whole VM
         isn't doing anything with "mm".  This allows mmu notifier users to keep
         track if the VM is in the middle of the invalidate_range_begin/end
         critical section with an atomic counter incraese in range_begin and
         decreased in range_end.  No secondary MMU page fault is allowed to map
         any spte or secondary tlb reference, while the VM is in the middle of
         range_begin/end as any page returned by get_user_pages in that critical
         section could later immediately be freed without any further
         ->invalidate_page notification (invalidate_range_begin/end works on
         ranges and ->invalidate_page isn't called immediately before freeing
         the page).  To stop all page freeing and pagetable overwrites the
         mmap_sem must be taken in write mode and all other anon_vma/i_mmap
         locks must be taken too.
      
      2) It'd be a waste to add branches in the VM if nobody could possibly
         run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
         CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
         mmu notifiers, but this already allows to compile a KVM external module
         against a kernel with mmu notifiers enabled and from the next pull from
         kvm.git we'll start using them.  And GRU/XPMEM will also be able to
         continue the development by enabling KVM=m in their config, until they
         submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
         also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
         This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
         are all =n.
      
      The mmu_notifier_register call can fail because mm_take_all_locks may be
      interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
      is used when a driver startup, a failure can be gracefully handled.  Here
      an example of the change applied to kvm to register the mmu notifiers.
      Usually when a driver startups other allocations are required anyway and
      -ENOMEM failure paths exists already.
      
       struct  kvm *kvm_arch_create_vm(void)
       {
              struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
      +       int err;
      
              if (!kvm)
                      return ERR_PTR(-ENOMEM);
      
              INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
      
      +       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
      +       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
      +       if (err) {
      +               kfree(kvm);
      +               return ERR_PTR(err);
      +       }
      +
              return kvm;
       }
      
      mmu_notifier_unregister returns void and it's reliable.
      
      The patch also adds a few needed but missing includes that would prevent
      kernel to compile after these changes on non-x86 archs (x86 didn't need
      them by luck).
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
      [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cddb8a5c
  11. 27 7月, 2008 1 次提交
  12. 25 7月, 2008 7 次提交
    • A
      hugetlb: introduce pud_huge · ceb86879
      Andi Kleen 提交于
      Straight forward extensions for huge pages located in the PUD instead of
      PMDs.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ceb86879
    • A
      hugetlbfs: per mount huge page sizes · a137e1cc
      Andi Kleen 提交于
      Add the ability to configure the hugetlb hstate used on a per mount basis.
      
      - Add a new pagesize= option to the hugetlbfs mount that allows setting
        the page size
      - This option causes the mount code to find the hstate corresponding to the
        specified size, and sets up a pointer to the hstate in the mount's
        superblock.
      - Change the hstate accessors to use this information rather than the
        global_hstate they were using (requires a slight change in mm/memory.c
        so we don't NULL deref in the error-unmap path -- see comments).
      
      [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a137e1cc
    • A
      hugetlb: modular state for hugetlb page size · a5516438
      Andi Kleen 提交于
      The goal of this patchset is to support multiple hugetlb page sizes.  This
      is achieved by introducing a new struct hstate structure, which
      encapsulates the important hugetlb state and constants (eg.  huge page
      size, number of huge pages currently allocated, etc).
      
      The hstate structure is then passed around the code which requires these
      fields, they will do the right thing regardless of the exact hstate they
      are operating on.
      
      This patch adds the hstate structure, with a single global instance of it
      (default_hstate), and does the basic work of converting hugetlb to use the
      hstate.
      
      Future patches will add more hstate structures to allow for different
      hugetlbfs mounts to have different page sizes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5516438
    • M
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE)... · 04f2cbe3
      Mel Gorman 提交于
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
      
      After patch 2 in this series, a process that successfully calls mmap() for
      a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
      process calls fork().  At that point, the next write fault from the parent
      could fail due to COW if the child still has a reference.
      
      We only reserve pages for the parent but a copy must be made to avoid
      leaking data from the parent to the child after fork().  Reserves could be
      taken for both parent and child at fork time to guarantee faults but if
      the mapping is large it is highly likely we will not have sufficient pages
      for the reservation, and it is common to fork only to exec() immediatly
      after.  A failure here would be very undesirable.
      
      Note that the current behaviour of mainline with MAP_PRIVATE pages is
      pretty bad.  The following situation is allowed to occur today.
      
      1. Process calls mmap(MAP_PRIVATE)
      2. Process calls mlock() to fault all pages and makes sure it succeeds
      3. Process forks()
      4. Process writes to MAP_PRIVATE mapping while child still exists
      5. If the COW fails at this point, the process gets SIGKILLed even though it
         had taken care to ensure the pages existed
      
      This patch improves the situation by guaranteeing the reliability of the
      process that successfully calls mmap().  When the parent performs COW, it
      will try to satisfy the allocation without using reserves.  If that fails
      the parent will steal the page leaving any children without a page.
      Faults from the child after that point will result in failure.  If the
      child COW happens first, an attempt will be made to allocate the page
      without reserves and the child will get SIGKILLed on failure.
      
      To summarise the new behaviour:
      
      1. If the original mapper performs COW on a private mapping with multiple
         references, it will attempt to allocate a hugepage from the pool or
         the buddy allocator without using the existing reserves. On fail, VMAs
         mapping the same area are traversed and the page being COW'd is unmapped
         where found. It will then steal the original page as the last mapper in
         the normal way.
      
      2. The VMAs the pages were unmapped from are flagged to note that pages
         with data no longer exist. Future no-page faults on those VMAs will
         terminate the process as otherwise it would appear that data was corrupted.
         A warning is printed to the console that this situation occured.
      
      2. If the child performs COW first, it will attempt to satisfy the COW
         from the pool if there are enough pages or via the buddy allocator if
         overcommit is allowed and the buddy allocator can satisfy the request. If
         it fails, the child will be killed.
      
      If the pool is large enough, existing applications will not notice that
      the reserves were a factor.  Existing applications depending on the
      no-reserves been set are unlikely to exist as for much of the history of
      hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
      point or failing the mmap().
      
      [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f2cbe3
    • J
      mm: remove double indirection on tlb parameter to free_pgd_range() & Co · 42b77728
      Jan Beulich 提交于
      The double indirection here is not needed anywhere and hence (at least)
      confusing.
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42b77728
    • R
      access_process_vm device memory infrastructure · 28b2ee20
      Rik van Riel 提交于
      In order to be able to debug things like the X server and programs using
      the PPC Cell SPUs, the debugger needs to be able to access device memory
      through ptrace and /proc/pid/mem.
      
      This patch:
      
      Add the generic_access_phys access function and put the hooks in place
      to allow access_process_vm to access device or PPC Cell SPU memory.
      
      [riel@redhat.com: Add documentation for the vm_ops->access function]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NBenjamin Herrensmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28b2ee20
    • N
      mm: remove nopfn · 0d71d10a
      Nick Piggin 提交于
      There are no users of nopfn in the tree. Remove it.
      
      [hugh@veritas.com: fix build error]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d71d10a
  13. 05 7月, 2008 2 次提交
    • O
      get_user_pages(): fix possible page leak on oom · 7a36a752
      Oleg Nesterov 提交于
      get_user_pages() must not return the error when i != 0.  When pages !=
      NULL we have i get_page()'ed pages.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a36a752
    • P
      mm: dirty page accounting vs VM_MIXEDMAP · 251b97f5
      Peter Zijlstra 提交于
      Dirty page accounting accurately measures the amound of dirty pages in
      writable shared mappings by mapping the pages RO (as indicated by
      vma_wants_writenotify).  We then trap on first write and call
      set_page_dirty() on the page, after which we map the page RW and
      continue execution.
      
      When we launder dirty pages, we call clear_page_dirty_for_io() which
      clears both the dirty flag, and maps the page RO again before we start
      writeout so that the story can repeat itself.
      
      vma_wants_writenotify() excludes VM_PFNMAP on the basis that we cannot
      do the regular dirty page stuff on raw PFNs and the memory isn't going
      anywhere anyway.
      
      The recently introduced VM_MIXEDMAP mixes both !pfn_valid() and
      pfn_valid() pages in a single mapping.
      
      We can't do dirty page accounting on !pfn_valid() pages as stated
      above, and mapping them RO causes them to be COW'ed on write, which
      breaks VM_SHARED semantics.
      
      Excluding VM_MIXEDMAP in vma_wants_writenotify() would mean we don't do
      the regular dirty page accounting for the pfn_valid() pages, which
      would bring back all the head-aches from inaccurate dirty page
      accounting.
      
      So instead, we let the !pfn_valid() pages get mapped RO, but fix them
      up unconditionally in the fault path.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: "Jared Hulbert" <jaredeh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      251b97f5
  14. 24 6月, 2008 2 次提交
    • N
      mm: fix race in COW logic · 945754a1
      Nick Piggin 提交于
      There is a race in the COW logic.  It contains a shortcut to avoid the
      COW and reuse the page if we have the sole reference on the page,
      however it is possible to have two racing do_wp_page()ers with one
      causing the other to mistakenly believe it is safe to take the shortcut
      when it is not.  This could lead to data corruption.
      
      Process 1 and process2 each have a wp pte of the same anon page (ie.
      one forked the other).  The page's mapcount is 2.  Then they both
      attempt to write to it around the same time...
      
        proc1				proc2 thr1			proc2 thr2
        CPU0				CPU1				CPU3
        do_wp_page()			do_wp_page()
      				 trylock_page()
      				  can_share_swap_page()
      				   load page mapcount (==2)
      				  reuse = 0
      				 pte unlock
      				 copy page to new_page
      				 pte lock
      				 page_remove_rmap(page);
         trylock_page()
          can_share_swap_page()
           load page mapcount (==1)
          reuse = 1
         ptep_set_access_flags (allow W)
      
        write private key into page
      								read from page
      				ptep_clear_flush()
      				set_pte_at(pte of new_page)
      
      Fix this by moving the page_remove_rmap of the old page after the pte
      clear and flush.  Potentially the entire branch could be moved down
      here, but in order to stay consistent, I won't (should probably move all
      the *_mm_counter stuff with one patch).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      945754a1
    • L
      Fix ZERO_PAGE breakage with vmware · 672ca28e
      Linus Torvalds 提交于
      Commit 89f5b7da ("Reinstate ZERO_PAGE
      optimization in 'get_user_pages()' and fix XIP") broke vmware, as
      reported by Jeff Chua:
      
        "This broke vmware 6.0.4.
         Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
         /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"
      
      and the reason seems to be that there's an old bug in how we handle do
      FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
      triggered if the whole page table was missing, nobody had apparently hit
      it before.
      
      The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
      not just for whole missing page tables, but for individual pages as
      well, and exposed this problem.
      
      This fixes it by making the test for when FOLL_ANON is used more
      careful, and also makes the code easier to read and understand by moving
      the logic to a separate inline function.
      Reported-and-tested-by: NJeff Chua <jeff.chua.linux@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      672ca28e
  15. 21 6月, 2008 1 次提交
    • L
      Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP · 89f5b7da
      Linus Torvalds 提交于
      KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
      557ed1fa ("remove ZERO_PAGE") removed
      the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
      generally now populate the VM with real empty pages needlessly.
      
      We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
      since fault handling no longer uses ZERO_PAGE for new anonymous pages,
      we now need to handle that special case in follow_page() instead.
      
      In particular, the removal of ZERO_PAGE effectively removed the core
      file writing optimization where we would skip writing pages that had not
      been populated at all, and increased memory pressure a lot by allocating
      all those useless newly zeroed pages.
      
      This reinstates the optimization by making the unmapped PTE case the
      same as for a non-existent page table, which already did this correctly.
      
      While at it, this also fixes the XIP case for follow_page(), where the
      caller could not differentiate between the case of a page that simply
      could not be used (because it had no "struct page" associated with it)
      and a page that just wasn't mapped.
      
      We do that by simply returning an error pointer for pages that could not
      be turned into a "struct page *".  The error is arbitrarily picked to be
      EFAULT, since that was what get_user_pages() already used for the
      equivalent IO-mapped page case.
      
      [ Also removed an impossible test for pte_offset_map_lock() failing:
        that's not how that function works ]
      Acked-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89f5b7da
  16. 25 5月, 2008 1 次提交
  17. 15 5月, 2008 1 次提交
    • N
      fix SMP data race in pagetable setup vs walking · 362a61ad
      Nick Piggin 提交于
      There is a possible data race in the page table walking code. After the split
      ptlock patches, it actually seems to have been introduced to the core code, but
      even before that I think it would have impacted some architectures (powerpc
      and sparc64, at least, walk the page tables without taking locks eg. see
      find_linux_pte()).
      
      The race is as follows:
      The pte page is allocated, zeroed, and its struct page gets its spinlock
      initialized. The mm-wide ptl is then taken, and then the pte page is inserted
      into the pagetables.
      
      At this point, the spinlock is not guaranteed to have ordered the previous
      stores to initialize the pte page with the subsequent store to put it in the
      page tables. So another Linux page table walker might be walking down (without
      any locks, because we have split-leaf-ptls), and find that new pte we've
      inserted. It might try to take the spinlock before the store from the other
      CPU initializes it. And subsequently it might read a pte_t out before stores
      from the other CPU have cleared the memory.
      
      There are also similar races in higher levels of the page tables. They
      obviously don't involve the spinlock, but could see uninitialized memory.
      
      Arch code and hardware pagetable walkers that walk the pagetables without
      locks could see similar uninitialized memory problems, regardless of whether
      split ptes are enabled or not.
      
      I prefer to put the barriers in core code, because that's where the higher
      level logic happens, but the page table accessors are per-arch, and open-coding
      them everywhere I don't think is an option. I'll put the read-side barriers
      in alpha arch code for now (other architectures perform data-dependent loads
      in order).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      362a61ad
  18. 07 5月, 2008 1 次提交
    • H
      x86: fix PAE pmd_bad bootup warning · aeed5fce
      Hugh Dickins 提交于
      Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.
      
      That came from 9fc34113 x86: debug pmd_bad();
      but we understand now that the typecasting was wrong for PAE in the previous
      version: pagetable pages above 4GB looked bad and stopped Arjan from booting.
      
      And revert that cded932b x86: fix pmd_bad
      and pud_bad to support huge pages.  It was the wrong way round: we shouldn't
      weaken every pmd_bad and pud_bad check to let huge pages slip through - in
      part they check that we _don't_ have a huge page where it's not expected.
      
      Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
      been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
      junk in the upper word is good; and x86_64 should follow x86_32's stricter
      comparison, to stop thinking any subset of required bits is good); but that
      should be a later patch.
      
      Fix Hans' good observation that follow_page() will never find pmd_huge()
      because that would have already failed the pmd_bad test: test pmd_huge in
      between the pmd_none and pmd_bad tests.  Tighten x86's pmd_huge() check?
      No, once it's a hugepage entry, it can get quite far from a good pmd: for
      example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.
      
      However... though follow_page() contains this and another test for huge
      pages, so it's nice to keep it working on them, where does it actually get
      called on a huge page?  get_user_pages() checks is_vm_hugetlb_page(vma) to
      to call alternative hugetlb processing, as does unmap_vmas() and others.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Earlier-version-tested-by: NIngo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeff Chua <jeff.chua.linux@gmail.com>
      Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aeed5fce
  19. 28 4月, 2008 1 次提交
    • N
      mm: add vm_insert_mixed · 423bad60
      Nick Piggin 提交于
      vm_insert_mixed will insert either a raw pfn or a refcounted struct page into
      the page tables, depending on whether vm_normal_page() will return the page or
      not.  With the introduction of the new pte bit, this is now a too tricky for
      drivers to be doing themselves.
      
      filemap_xip uses this in a subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      423bad60