1. 25 5月, 2011 9 次提交
    • N
      mm: nommu: sort mm->mmap list properly · 6038def0
      Namhyung Kim 提交于
      When I was reading nommu code, I found that it handles the vma list/tree
      in an unusual way.  IIUC, because there can be more than one
      identical/overrapped vmas in the list/tree, it sorts the tree more
      strictly and does a linear search on the tree.  But it doesn't applied to
      the list (i.e.  the list could be constructed in a different order than
      the tree so that we can't use the list when finding the first vma in that
      order).
      
      Since inserting/sorting a vma in the tree and link is done at the same
      time, we can easily construct both of them in the same order.  And linear
      searching on the tree could be more costly than doing it on the list, it
      can be converted to use the list.
      
      Also, after the commit 297c5eee ("mm: make the vma list be doubly
      linked") made the list be doubly linked, there were a couple of code need
      to be fixed to construct the list properly.
      
      Patch 1/6 is a preparation.  It maintains the list sorted same as the tree
      and construct doubly-linked list properly.  Patch 2/6 is a simple
      optimization for the vma deletion.  Patch 3/6 and 4/6 convert tree
      traversal to list traversal and the rest are simple fixes and cleanups.
      
      This patch:
      
      @vma added into @mm should be sorted by start addr, end addr and VMA
      struct addr in that order because we may get identical VMAs in the @mm.
      However this was true only for the rbtree, not for the list.
      
      This patch fixes this by remembering 'rb_prev' during the tree traversal
      like find_vma_prepare() does and linking the @vma via __vma_link_list().
      After this patch, we can iterate the whole VMAs in correct order simply by
      using @mm->mmap list.
      
      [akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6038def0
    • S
    • S
      mmap: avoid merging cloned VMAs · 965f55de
      Shaohua Li 提交于
      Avoid merging a VMA with another VMA which is cloned from the parent process.
      
      The cloned VMA shares the anon_vma lock with the parent process's VMA.  If
      we do the merge, more vmas (even the new range is only for current
      process) use the perent process's anon_vma lock.  This introduces
      scalability issues.  find_mergeable_anon_vma() already considers this
      case.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      965f55de
    • S
      mmap: avoid unnecessary anon_vma lock · 5f70b962
      Shaohua Li 提交于
      If we only change vma->vm_end, we can avoid taking anon_vma lock even if
      'insert' isn't NULL, which is the case of split_vma.
      
      As I understand it, we need the lock before because rmap must get the
      'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked
      to anon_vma list in __insert_vm_struct before).
      
      But now this isn't true any more.  The 'insert' VMA is already linked to
      anon_vma list in __split_vma(with anon_vma_clone()) instead of
      __insert_vm_struct.  There is no race rmap can't get required VMAs.  So
      the anon_vma lock is unnecessary, and this can reduce one locking in brk
      case and improve scalability.
      
      Signed-off-by: Shaohua Li<shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f70b962
    • S
      mmap: add alignment for some variables · 34679d7e
      Shaohua Li 提交于
      Make some variables have correct alignment/section to avoid cache issue.
      In a workload which heavily does mmap/munmap, the variables will be used
      frequently.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34679d7e
    • D
      arch, mm: filter disallowed nodes from arch specific show_mem functions · 7bf02ea2
      David Rientjes 提交于
      Architectures that implement their own show_mem() function did not pass
      the filter argument to show_free_areas() to appropriately avoid emitting
      the state of nodes that are disallowed in the current context.  This patch
      now passes the filter argument to show_free_areas() so those nodes are now
      avoided.
      
      This patch also removes the show_free_areas() wrapper around
      __show_free_areas() and converts existing callers to pass an empty filter.
      
      ia64 emits additional information for each node, so skip_free_areas_zone()
      must be made global to filter disallowed nodes and it is converted to use
      a nid argument rather than a zone for this use case.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Bottomley <jejb@parisc-linux.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bf02ea2
    • M
      mm: vmscan: correctly check if reclaimer should schedule during shrink_slab · f06590bd
      Minchan Kim 提交于
      It has been reported on some laptops that kswapd is consuming large
      amounts of CPU and not being scheduled when SLUB is enabled during large
      amounts of file copying.  It is expected that this is due to kswapd
      missing every cond_resched() point because;
      
      shrink_page_list() calls cond_resched() if inactive pages were isolated
              which in turn may not happen if all_unreclaimable is set in
              shrink_zones(). If for whatver reason, all_unreclaimable is
              set on all zones, we can miss calling cond_resched().
      
      balance_pgdat() only calls cond_resched if the zones are not
              balanced. For a high-order allocation that is balanced, it
              checks order-0 again. During that window, order-0 might have
              become unbalanced so it loops again for order-0 and returns
              that it was reclaiming for order-0 to kswapd(). It can then
              find that a caller has rewoken kswapd for a high-order and
              re-enters balance_pgdat() without ever calling cond_resched().
      
      shrink_slab only calls cond_resched() if we are reclaiming slab
      	pages. If there are a large number of direct reclaimers, the
      	shrinker_rwsem can be contended and prevent kswapd calling
      	cond_resched().
      
      This patch modifies the shrink_slab() case.  If the semaphore is
      contended, the caller will still check cond_resched().  After each
      successful call into a shrinker, the check for cond_resched() remains in
      case one shrinker is particularly slow.
      
      [mgorman@suse.de: preserve call to cond_resched after each call into shrinker]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Tested-by: NColin King <colin.king@canonical.com>
      Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>		[2.6.38+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f06590bd
    • J
      mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely · afc7e326
      Johannes Weiner 提交于
      There are a few reports of people experiencing hangs when copying large
      amounts of data with kswapd using a large amount of CPU which appear to be
      due to recent reclaim changes.  SLUB using high orders is the trigger but
      not the root cause as SLUB has been using high orders for a while.  The
      root cause was bugs introduced into reclaim which are addressed by the
      following two patches.
      
      Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd:
              keep kswapd awake for high-order allocations until a percentage of
              the node is balanced") to allow kswapd to go to sleep when
              balanced for high orders.
      
      Patch 2 notes that it is possible for kswapd to miss every
              cond_resched() and updates shrink_slab() so it'll at least reach
              that scheduling point.
      
      Chris Wood reports that these two patches in isolation are sufficient to
      prevent the system hanging.  AFAIK, they should also resolve similar hangs
      experienced by James Bottomley.
      
      This patch:
      
      Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd:
      keep kswapd awake for high-order allocations until a percentage of the
      node is balanced") is backwards.  Instead of allowing kswapd to go to
      sleep when balancing for high order allocations, it keeps it kswapd
      running uselessly.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Tested-by: NColin King <colin.king@canonical.com>
      Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>		[2.6.38+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afc7e326
    • C
      slub: Fix double bit unlock in debug mode · a71ae47a
      Christoph Lameter 提交于
      Commit 442b06bc ("slub: Remove node check in slab_free") added a
      call to deactivate_slab() in the debug case in __slab_alloc(), which
      unlocks the current slab used for allocation.  Going to the label
      'unlock_out' then does it again.
      
      Also, in the debug case we do not need all the other processing that the
      'unlock_out' path does.  We always fall back to the slow path in the
      debug case.  So the tid update is useless.
      
      Similarly, ALLOC_SLOWPATH would just be incremented for all allocations.
      Also a pretty useless thing.
      
      So simply restore irq flags and return the object.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reported-and-bisected-by: NJames Morris <jmorris@namei.org>
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Reported-by: NJens Axboe <jaxboe@fusionio.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a71ae47a
  2. 23 5月, 2011 1 次提交
    • M
      [S390] merge page_test_dirty and page_clear_dirty · 2d42552d
      Martin Schwidefsky 提交于
      The page_clear_dirty primitive always sets the default storage key
      which resets the access control bits and the fetch protection bit.
      That will surprise a KVM guest that sets non-zero access control
      bits or the fetch protection bit. Merge page_test_dirty and
      page_clear_dirty back to a single function and only clear the
      dirty bit from the storage key.
      
      In addition move the function page_test_and_clear_dirty and
      page_test_and_clear_young to page.h where they belong. This
      requires to change the parameter from a struct page * to a page
      frame number.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      2d42552d
  3. 21 5月, 2011 3 次提交
  4. 20 5月, 2011 1 次提交
  5. 18 5月, 2011 4 次提交
  6. 17 5月, 2011 1 次提交
  7. 15 5月, 2011 1 次提交
    • H
      tmpfs: fix race between swapoff and writepage · 05bf86b4
      Hugh Dickins 提交于
      Shame on me!  Commit b1dea800 "tmpfs: fix race between umount and
      writepage" fixed the advertized race, but introduced another: as even
      its comment makes clear, we cannot safely rely on a peek at list_empty()
      while holding no lock - until info->swapped is set, shmem_unuse_inode()
      may delete any formerly-swapped inode from the shmem_swaplist, which
      in this case would leave a swap area impossible to swapoff.
      
      Although I don't relish taking the mutex every time, I don't care much
      for the alternatives either; and at least the peek at list_empty() in
      shmem_evict_inode() (a hotter path since most inodes would never have
      been swapped) remains safe, because we already truncated the whole file.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05bf86b4
  8. 12 5月, 2011 7 次提交
    • H
      tmpfs: fix spurious ENOSPC when racing with unswap · 59a16ead
      Hugh Dickins 提交于
      Testing the shmem_swaplist replacements for igrab() revealed another bug:
      writes to /dev/loop0 on a tmpfs file which fills its filesystem were
      sometimes failing with "Buffer I/O error"s.
      
      These came from ENOSPC failures of shmem_getpage(), when racing with
      swapoff: the same could happen when racing with another shmem_getpage(),
      pulling the page in from swap in between our find_lock_page() and our
      taking the info->lock (though not in the single-threaded loop case).
      
      This is unacceptable, and surprising that I've not noticed it before:
      it dates back many years, but (presumably) was made a lot easier to
      reproduce in 2.6.36, which sited a page preallocation in the race window.
      
      Fix it by rechecking the page cache before settling on an ENOSPC error.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59a16ead
    • H
      tmpfs: fix race between umount and swapoff · 778dd893
      Hugh Dickins 提交于
      The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable
      to umount as that in shmem_writepage().
      
      Fix this instance by extending the protection of shmem_swaplist_mutex
      right across shmem_unuse_inode(): while it's on the list, the inode cannot
      be evicted (and the filesystem cannot be unmounted) without
      shmem_evict_inode() taking that mutex to remove it from the list.
      
      But since shmem_writepage() might take that mutex, we should avoid making
      memory allocations or memcg charges while holding it: prepare them at the
      outer level in shmem_unuse().  When mem_cgroup_cache_charge() was
      originally placed, we didn't know until that point that the page from swap
      was actually a shmem page; but nowadays it's noted in the swap_map, so
      we're safe to charge upfront.  For the radix_tree, do as is done in
      shmem_getpage(): preload upfront, but don't pin to the cpu; so we make a
      habit of refreshing the node pool, but might dip into GFP_NOWAIT reserves
      on occasion if subsequently preempted.
      
      With the allocation and charge moved out from shmem_unuse_inode(),
      we can also hold index map and info->lock over from finding the entry.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778dd893
    • H
      tmpfs: fix race between umount and writepage · b1dea800
      Hugh Dickins 提交于
      Konstanin Khlebnikov reports that a dangerous race between umount and
      shmem_writepage can be reproduced by this script:
      
        for i in {1..300} ; do
      	mkdir $i
      	while true ; do
      		mount -t tmpfs none $i
      		dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100))
      		umount $i
      	done &
        done
      
      on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =)
      
      Kernel log:
      
        VFS: Busy inodes after unmount of tmpfs.
                       Self-destruct in 5 seconds.  Have a nice day...
      
        WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
        list_del corruption. prev->next should be ffff880222fdaac8, but was (null)
        Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
        Call Trace:
         warn_slowpath_common+0x80/0x98
         warn_slowpath_fmt+0x41/0x43
         __list_del_entry+0x8d/0x98
         evict+0x50/0x113
         iput+0x138/0x141
        ...
        BUG: unable to handle kernel paging request at ffffffffffffffff
        IP: shmem_free_blocks+0x18/0x4c
        Pid: 10422, comm: dd Tainted: G        W   2.6.39-rc2+ #4
        Call Trace:
         shmem_recalc_inode+0x61/0x66
         shmem_writepage+0xba/0x1dc
         pageout+0x13c/0x24c
         shrink_page_list+0x28e/0x4be
         shrink_inactive_list+0x21f/0x382
        ...
      
      shmem_writepage() calls igrab() on the inode for the page which came from
      page reclaim, to add it later into shmem_swaplist for swapoff operation.
      
      This igrab() can race with super-block deactivating process:
      
        shrink_inactive_list()          deactivate_super()
        pageout()                       tmpfs_fs_type->kill_sb()
        shmem_writepage()               kill_litter_super()
                                        generic_shutdown_super()
                                         evict_inodes()
         igrab()
                                          atomic_read(&inode->i_count)
                                           skip-inode
         iput()
                                         if (!list_empty(&sb->s_inodes))
                                                printk("VFS: Busy inodes after...
      
      This igrap-iput pair was added in commit 1b1b32f2 "tmpfs: fix
      shmem_swaplist races" based on incorrect assumptions: igrab() protects the
      inode from concurrent eviction by deletion, but it does nothing to protect
      it from concurrent unmounting, which goes ahead despite the raised
      i_count.
      
      So this use of igrab() was wrong all along, but the race made much worse
      in 2.6.37 when commit 63997e98 "split invalidate_inodes()" replaced
      two attempts at invalidate_inodes() by a single evict_inodes().
      
      Konstantin posted a plausible patch, raising sb->s_active too: I'm unsure
      whether it was correct or not; but burnt once by igrab(), I am sure that
      we don't want to rely more deeply upon externals here.
      
      Fix it by adding the inode to shmem_swaplist earlier, while the page lock
      on page in page cache still secures the inode against eviction, without
      artifically raising i_count.  It was originally added later because
      shmem_unuse_inode() is liable to remove an inode from the list while it's
      unswapped; but we can guard against that by taking spinlock before
      dropping mutex.
      Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Tested-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1dea800
    • A
      memcg: allocate memory cgroup structures in local nodes · 21a3c964
      Andi Kleen 提交于
      Commit dde79e00 ("page_cgroup: reduce allocation overhead for
      page_cgroup array for CONFIG_SPARSEMEM") added a regression that the
      memory cgroup data structures all end up in node 0 because the first
      attempt at allocating them would not pass in a node hint.  Since the
      initialization runs on CPU #0 it would all end up node 0.  This is a
      problem on large memory systems, where node 0 would lose a lot of
      memory.
      
      Change the alloc_pages_exact() to alloc_pages_exact_nid().  This will
      still fall back to other nodes if not enough memory is available.
      
       [ RED-PEN: right now it would fall back first before trying
         vmalloc_node.  Probably not the best strategy ...  But I left it like
         that for now. ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Reported-by: Doug Nelson
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21a3c964
    • A
      mm: add alloc_pages_exact_nid() · ee85c2e1
      Andi Kleen 提交于
      Add a alloc_pages_exact_nid() that allocates on a specific node.
      
      The naming is quite broken, but fixing that would need a larger renaming
      action.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee85c2e1
    • Y
      mm: use alloc_bootmem_node_nopanic() on really needed path · 8f389a99
      Yinghai Lu 提交于
      Stefan found nobootmem does not work on his system that has only 8M of
      RAM.  This causes an early panic:
      
        BIOS-provided physical RAM map:
         BIOS-88: 0000000000000000 - 000000000009f000 (usable)
         BIOS-88: 0000000000100000 - 0000000000840000 (usable)
        bootconsole [earlyser0] enabled
        Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
        DMI not present or invalid.
        last_pfn = 0x840 max_arch_pfn = 0x100000
        init_memory_mapping: 0000000000000000-0000000000840000
        8MB LOWMEM available.
          mapped low ram: 0 - 00840000
          low ram: 0 - 00840000
        Zone PFN ranges:
          DMA      0x00000001 -> 0x00001000
          Normal   empty
        Movable zone start PFN for each node
        early_node_map[2] active PFN ranges
            0: 0x00000001 -> 0x0000009f
            0: 0x00000100 -> 0x00000840
        BUG: Int 6: CR2 (null)
             EDI c034663c  ESI (null)  EBP c0329f38  ESP c0329ef4
             EBX c0346380  EDX 00000006  ECX ffffffff  EAX fffffff4
             err (null)  EIP c0353191   CS c0320060  flg 00010082
        Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
               00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
               (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
        Pid: 0, comm: swapper Not tainted 2.6.36 #5
        Call Trace:
         [<c02e3707>] ? 0xc02e3707
         [<c035e6e5>] 0xc035e6e5
         [<c0353191>] ? 0xc0353191
         [<c03534d6>] 0xc03534d6
         [<c034f1cd>] 0xc034f1cd
         [<c034a824>] 0xc034a824
         [<c03513cb>] ? 0xc03513cb
         [<c0349432>] 0xc0349432
         [<c0349066>] 0xc0349066
      
      It turns out that we should ignore the low limit of 16M.
      
      Use alloc_bootmem_node_nopanic() in this case.
      
      [akpm@linux-foundation.org: less mess]
      Signed-off-by: NYinghai LU <yinghai@kernel.org>
      Reported-by: NStefan Hellermann <stefan@the2masters.de>
      Tested-by: NStefan Hellermann <stefan@the2masters.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>		[2.6.34+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f389a99
    • M
      mm: check PageUnevictable in lru_deactivate_fn() · bad49d9c
      Minchan Kim 提交于
      The lru_deactivate_fn should not move page which in on unevictable lru
      into inactive list.  Otherwise, we can meet BUG when we use
      isolate_lru_pages as __isolate_lru_page could return -EINVAL.
      Reported-by: NYing Han <yinghan@google.com>
      Tested-by: NYing Han <yinghan@google.com>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bad49d9c
  9. 10 5月, 2011 2 次提交
  10. 08 5月, 2011 1 次提交
    • C
      slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery · 1759415e
      Christoph Lameter 提交于
      Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used
      everywhere.
      
      There may be performance implications since:
      
      A. We now have to manage a transaction ID for all arches
      
      B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced
      to a very short irqoff section.
      
      There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback
      case we only have to do one disable and enable like before.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      1759415e
  11. 05 5月, 2011 2 次提交
    • L
      VM: skip the stack guard page lookup in get_user_pages only for mlock · a1fde08c
      Linus Torvalds 提交于
      The logic in __get_user_pages() used to skip the stack guard page lookup
      whenever the caller wasn't interested in seeing what the actual page
      was.  But Michel Lespinasse points out that there are cases where we
      don't care about the physical page itself (so 'pages' may be NULL), but
      do want to make sure a page is mapped into the virtual address space.
      
      So using the existence of the "pages" array as an indication of whether
      to look up the guard page or not isn't actually so great, and we really
      should just use the FOLL_MLOCK bit.  But because that bit was only set
      for the VM_LOCKED case (and not all vma's necessarily have it, even for
      mlock()), we couldn't do that originally.
      
      Fix that by moving the VM_LOCKED check deeper into the call-chain, which
      actually simplifies many things.  Now mlock() gets simpler, and we can
      also check for FOLL_MLOCK in __get_user_pages() and the code ends up
      much more straightforward.
      Reported-and-reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1fde08c
    • T
      slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg · 30106b8c
      Thomas Gleixner 提交于
      The SLUB allocator use of the cmpxchg_double logic was wrong: it
      actually needs the irq-safe one.
      
      That happens automatically when we use the native unlocked 'cmpxchg8b'
      instruction, but when compiling the kernel for older x86 CPUs that do
      not support that instruction, we fall back to the generic emulation
      code.
      
      And if you don't specify that you want the irq-safe version, the generic
      code ends up just open-coding the cmpxchg8b equivalent without any
      protection against interrupts or preemption.  Which definitely doesn't
      work for SLUB.
      
      This was reported by Werner Landgraf <w.landgraf@ru.ru>, who saw
      instability with his distro-kernel that was compiled to support pretty
      much everything under the sun.  Most big Linux distributions tend to
      compile for PPro and later, and would never have noticed this problem.
      
      This also fixes the prototypes for the irqsafe cmpxchg_double functions
      to use 'bool' like they should.
      
      [ Btw, that whole "generic code defaults to no protection" design just
        sounds stupid - if the code needs no protection, there is no reason to
        use "cmpxchg_double" to begin with.  So we should probably just remove
        the unprotected version entirely as pointless.   - Linus ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-and-tested-by: Nwerner <w.landgraf@ru.ru>
      Acked-and-tested-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1105041539050.3005@ionosSigned-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30106b8c
  12. 29 4月, 2011 3 次提交
    • M
      mm: check if PTE is already allocated during page fault · cc03638d
      Mel Gorman 提交于
      With transparent hugepage support, handle_mm_fault() has to be careful
      that a normal PMD has been established before handling a PTE fault.  To
      achieve this, it used __pte_alloc() directly instead of pte_alloc_map as
      pte_alloc_map is unsafe to run against a huge PMD.  pte_offset_map() is
      called once it is known the PMD is safe.
      
      pte_alloc_map() is smart enough to check if a PTE is already present
      before calling __pte_alloc but this check was lost.  As a consequence,
      PTEs may be allocated unnecessarily and the page table lock taken.  Thi
      useless PTE does get cleaned up but it's a performance hit which is
      visible in page_test from aim9.
      
      This patch simply re-adds the check normally done by pte_alloc_map to
      check if the PTE needs to be allocated before taking the page table lock.
      The effect is noticable in page_test from aim9.
      
        AIM9
                        2.6.38-vanilla 2.6.38-checkptenone
        creat-clo      446.10 ( 0.00%)   424.47 (-5.10%)
        page_test       38.10 ( 0.00%)    42.04 ( 9.37%)
        brk_test        52.45 ( 0.00%)    51.57 (-1.71%)
        exec_test      382.00 ( 0.00%)   456.90 (16.39%)
        fork_test       60.11 ( 0.00%)    67.79 (11.34%)
        MMTests Statistics: duration
        Total Elapsed Time (seconds)                611.90    612.22
      
      (While this affects 2.6.38, it is a performance rather than a functional
      bug and normally outside the rules -stable.  While the big performance
      differences are to a microbench, the difference in fork and exec
      performance may be significant enough that -stable wants to consider the
      patch)
      Reported-by: NRaz Ben Yehuda <raziebe@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc03638d
    • K
      oom: use pte pages in OOM score · f755a042
      KOSAKI Motohiro 提交于
      PTE pages eat up memory just like anything else, but we do not account for
      them in any way in the OOM scores.  They are also _guaranteed_ to get
      freed up when a process is OOM killed, while RSS is not.
      Reported-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>		[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f755a042
    • A
      mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups · 78f11a25
      Andrea Arcangeli 提交于
      The huge_memory.c THP page fault was allowed to run if vm_ops was null
      (which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
      setup a special vma->vm_ops and it would fallback to regular anonymous
      memory) but other THP logics weren't fully activated for vmas with vm_file
      not NULL (/dev/zero has a not NULL vma->vm_file).
      
      So this removes the vm_file checks so that /dev/zero also can safely use
      THP (the other albeit safer approach to fix this bug would have been to
      prevent the THP initial page fault to run if vm_file was set).
      
      After removing the vm_file checks, this also makes huge_memory.c stricter
      in khugepaged for the DEBUG_VM=y case.  It doesn't replace the vm_file
      check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
      under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
      only be allowed to exist before the first page fault, and in turn when
      vma->anon_vma is null (so preventing khugepaged registration).  So I tend
      to think the previous comment saying if vm_file was set, VM_PFNMAP might
      have been set and we could still be registered in khugepaged (despite
      anon_vma was not NULL to be registered in khugepaged) was too paranoid.
      The is_linear_pfn_mapping check is also I think superfluous (as described
      by comment) but under DEBUG_VM it is safe to stay.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NCaspar Zhang <bugs@casparzhang.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78f11a25
  13. 17 4月, 2011 5 次提交
反馈
建议
客服 返回
顶部